Data Sandbox

Expand all | Collapse all

What is "Data Curation"

  • 1.  What is "Data Curation"

    Pitney Bowes
    Posted 05-09-2019 10:50

    The term Data Curation has taken hold and is now mainstream enough to be picked up by both marketing teams and MBAs as one of the ways they explain "what the data people do". 

    Years ago, "data steward" was used to describe the care-takers of data within an organization. Data Steward is still widely used to describe a person that is responsible for the care and feeding of a specific database or process.  (I'll draft definition of that term if there's interest-- or feel free to post one). 

    Both Data Curation and Data Steward can mean many different things to both the speaker and the listener with enough ambiguity to completely allow them talk past each other and leave the conversation with different understandings of who is doing what without knowing it. 

    With that in mind, I'll offer this definition for Data Curation:

    Data Curation:the process of gathering, sorting, formatting, cleansing, standardizing and maintaining data with a sense of responsibility for preserving, describing and delivering the value of the information contained in data to users.

    My assertion is that Data Curation is an action of responsibility and authority that offers a service to other data users.  The description of specific application of data and the value of the data in intended uses may be the most important aspect of this definition.  A conversation with Colleen Reed, a co-worker and community member, left me with the question: does curation include documentation of such aspects as use cases? 

    Please, let me know your thoughts on this.  I've been called a Data Curator a couple of times in the last week, and I'm curious to dig into what that means within the broader data community. 



    ------------------------------
    Dan Adams
    Pitney Bowes
    White River Junction VT
    ------------------------------


  • 2.  RE: What is "Data Curation"

    Pitney Bowes
    Posted 05-15-2019 20:55
    Hi Dan -  to your question "does curation include documentation of such aspects as use cases? "

    I believe absolutely it does.  Properly used, data is very powerful and can solve its intended use cases.  Misused, and the data can lead to erroneous conclusions.

    I go back to a simple definition of quality I ascribe to, which is "fitness for intended use".   I can use a hammer to drive a screw, but chances are I'm going to make a mess of things (better to bring a nail) – the hammer is intended to work with the nail; similarly, in data, we can perform all the curation steps you mention but if our resulting use case (target) is wrong, then we will make a mess of things.

    Sometimes a data consumer might take shortcuts due to time, financial pressures or ignorance and attempt to use data curated for one use case and apply it to another.  As the data curator, it's our responsibility to outline the envisioned or tested use cases for the data and to proactively team with innovative customers to ensure their use case will be a winning new use case - and not a mangled screw.

    Thanks
    Tom



    ------------------------------
    Tom Gilligan
    Pitney Bowes
    White River Junction, VT, USA
    ------------------------------



  • 3.  RE: What is "Data Curation"

    Pitney Bowes
    Posted 07-18-2019 09:29
    Hi Dan -
    In light of this fascinating Forbes article. I thought I would comment on the mention of data curation and its responsibilities to the providers as well as dispensing intended use cases for potential customers/clients.
    Here is the article for review:
    https://www.forbes.com/sites/johnkoetsier/2019/07/17/viral-app-faceapp-now-owns-access-to-more-than-150-million-peoples-faces-and-names/#55c2f33362f1

    FaceApp has now curated over 150 million pieces of important, not so private, pieces of data. To what degree does FaceApp owe the providers of the data (consumers of the app) and to what degree is FaceApp now responsible for how the data can/will be used in the future?

    In my opinion, consumers have fully embraced the shift of data driven Apps, as a necessary evil. In doing so, also believe they are somehow protected.
    The responsibility then ultimately falls on the data curators, to be held accountable for the  "preserving, describing and delivering of value of the information contained in data to users."
    In this use case, FaceApp has the responsibility to find thoughtful uses for this curation of data.

    Thanks
    Sam

    ------------------------------
    Samantha Martino
    Pitney Bowes
    White River Junction, VT, USA
    ------------------------------



  • 4.  RE: What is "Data Curation"

    Pitney Bowes
    Posted 07-18-2019 11:24
    This brings to mind the issue of ethics around data curation and data use, which is a subject that is growing in importance as we collect, use and analyze personal data.   Here's a checklist of questions to ask when collecting and using  (curating) data that can help guide us (from Ethics and Data Science by Mike Loukides, Hilary Mason & DJ Patil Kindle Version location 157 https://www.amazon.com/Ethics-Data-Science-Mike-Loukides-ebook/dp/B07GTC8ZN7).  Since Dan started this conversation around data curation I've highlighted some items below I think are key in that context.

    "

    As a community we have the opportunity to ask important questions about our new industry as it and we develop:

    ❏ Have we listed how this technology can be attacked or abused?

    ❏ Have we tested our training data to ensure it is fair and representative?

    ❏ Have we studied and understood possible sources of bias in our data?

    ❏ Does our team reflect diversity of opinions, backgrounds, and kinds of thought?

    What kind of user consent do we need to collect to use the data?

    Do we have a mechanism for gathering consent from users?

    Have we explained clearly what users are consenting to?

    Do we have a mechanism for redress if people are harmed by the results?

    ❏ Can we shut down this software in production if it is behaving badly?

    ❏ Have we tested for fairness with respect to different user groups?

    ❏ Have we tested for disparate error rates among different user groups?

    ❏ Do we test and monitor for model drift to ensure our software remains fair over time?

    Do we have a plan to protect and secure user data?"

    Thanks,
    Cecily

    ------------------------------
    Cecily Herzig
    PITNEY BOWES SOFTWARE, INC
    Maitland FL
    ------------------------------