Data Sandbox

How might we reduce data regression?

  • 1.  How might we reduce data regression?

    Pitney Bowes
    Posted 30 days ago

    Are you responsible for Data content or quality?  If so, perhaps you could help brainstorm how data producers/consumers can reduce Data Regression…I suggest 4 ways below, but I'd like to hear from the community so please share your thoughts below if you have creative ideas to share!

    Data Regression is the unintended loss in one or more of the 4C's of Data Quality (Completeness, Correctness, Coverage and Currency) usually as a side effect of a data quality improvement or data update activity.  Flaws in a data production/update process can cause Data Regression, and you might incur serious losses in data quality without even knowing it's happening.  Whether you are a data producer or a data consumer simply updating data you acquired, you are at risk and should take actions to detect or prevent it! 

    How does Data regression occur? A common occurrence could be one of your project teams losing sight of their impact on the existing data – and data that already possessed good quality becomes compromised.  Typically, existing data that might be outside the scope of the project (for example, other features that are spatially related to the project scope or joined to the project scope by means of an ID) are damaged.  For example, a project team responsible for improving water polygon completeness can successfully introduce large amounts of spatially accurate water, but neglect to check the impact on less spatially inaccurate roads, points of interest or buildings.  As a result, the data quality overall is decreased because after improving the water polygons, roads, points of interests, and/or buildings now display inside the water polygons. Damage like this can happen to virtually all data types. 

    How can you reduce data regression? Here are 4 ideas:

    1. First, acknowledge it could occur and brainstorm as a project team how your existing data could be negatively impacted by your data update actions. Prior to project execution, identify the risks and how they can be mitigated. This includes using completeness and correctness QC checks.
    2. Produce and closely watch data statistics – at the right level of detail. Your high-level statistical counts might not see the loss of good data for example.  If for example you are adding several million new customer records, the loss of tens of thousands of existing records might not be noticeable.   If updating spatial data, monitor your existing data at the smallest practical geography.  For example, rather than monitoring counts at a national or state level in the USA, monitor at the county level if possible.   A reasonable increase at a state level might hide sizeable decreases in a county.
    3. More is not always better; be aware for over-completeness in the form of duplicates introduced. Have logical QC checks to prevent (and detect) introduction of duplicates. Duplicates can also obscure losses in counts of good records.
    4. Have "Golden records"; "Golden records" are those relatively few records that are known to be absolutely correct/perfect and you monitor them during any data updating process. These should be a subset/sample of all your records and should include your most important records; those should not change, and if you detect a change in them (loss, duplication, change in their attributes) it should set off major alarms…  ideally you have invested in metadata that tracks the currency of individual data attributes, so you don't overwrite newer data with older data.  

    Please share your thoughts below on Data Regression and how data producers/consumers can reduce it.



    ------------------------------
    Tom Gilligan
    Pitney Bowes Software, Inc.
    White River Junction, VT, USA
    ------------------------------