The term Data Curation has taken hold and is now mainstream enough to be picked up by both marketing teams and MBAs as one of the ways they explain "what the data people do".
Years ago, "data steward" was used to describe the care-takers of data within an organization. Data Steward is still widely used to describe a person that is responsible for the care and feeding of a specific database or process. (I'll draft definition of that term if there's interest-- or feel free to post one).
Both Data Curation and Data Steward can mean many different things to both the speaker and the listener with enough ambiguity to completely allow them talk past each other and leave the conversation with different understandings of who is doing what without knowing it.
With that in mind, I'll offer this definition for Data Curation:
Data Curation:the process of gathering, sorting, formatting, cleansing, standardizing and maintaining data with a sense of responsibility for preserving, describing and delivering the value of the information contained in data to users.
My assertion is that Data Curation is an action of responsibility and authority that offers a service to other data users. The description of specific application of data and the value of the data in intended uses may be the most important aspect of this definition. A conversation with Colleen Reed, a co-worker and community member, left me with the question: does curation include documentation of such aspects as use cases?
Please, let me know your thoughts on this. I've been called a Data Curator a couple of times in the last week, and I'm curious to dig into what that means within the broader data community.
I believe absolutely it does. Properly used, data is very powerful and can solve its intended use cases. Misused, and the data can lead to erroneous conclusions.I go back to a simple definition of quality I ascribe to, which is "fitness for intended use". I can use a hammer to drive a screw, but chances are I'm going to make a mess of things (better to bring a nail) – the hammer is intended to work with the nail; similarly, in data, we can perform all the curation steps you mention but if our resulting use case (target) is wrong, then we will make a mess of things.Sometimes a data consumer might take shortcuts due to time, financial pressures or ignorance and attempt to use data curated for one use case and apply it to another. As the data curator, it's our responsibility to outline the envisioned or tested use cases for the data and to proactively team with innovative customers to ensure their use case will be a winning new use case - and not a mangled screw.
As a community we have the opportunity to ask important questions about our new industry as it and we develop:
❏ Have we listed how this technology can be attacked or abused?
❏ Have we tested our training data to ensure it is fair and representative?
❏ Have we studied and understood possible sources of bias in our data?
❏ Does our team reflect diversity of opinions, backgrounds, and kinds of thought?
❏ What kind of user consent do we need to collect to use the data?
❏ Do we have a mechanism for gathering consent from users?
❏ Have we explained clearly what users are consenting to?
❏ Do we have a mechanism for redress if people are harmed by the results?
❏ Can we shut down this software in production if it is behaving badly?
❏ Have we tested for fairness with respect to different user groups?
❏ Have we tested for disparate error rates among different user groups?
❏ Do we test and monitor for model drift to ensure our software remains fair over time?
❏ Do we have a plan to protect and secure user data?"