Summary Solution: Creating a culture of data quality not only has a concrete impact on the accuracy of your findings, but also inspires greater organizational buy-in when your framework adheres to certain standards.
More data is better data. It’s a common refrain in data science and other analytical circles – and it’s one that makes sense. If you have more to scrutinize, previously undiscovered insights may be lurking in unexplored places. And while accurate for the most part, the phrase lacks a mention of one key element: quality.
As the tools that analyze data grow more sophisticated, the quality of the data that feeds them is becoming ever more critical to success. For the most part, firms collect plenty to drive insight. The limitation comes with the enterprise’s ability to ensure the quality of the data, so that they in turn can trust the result of any analyses or insights derived from it.
A robust data quality framework is the key to running well-managed processes that build trust – internally and externally – in results. It’s important because, not surprisingly, auditors and risk functions will look for the validation of inputs and the reliability of outputs. A clear data quality framework is also important in automated environments, where invalid values can break a process or creep into a process run by machines that don’t know any better.
Most organizations tackle this problem with a patchwork of tools, manual checks and eyeball tests that are usually held to some threshold for materiality, but this approach is hardly confidence inspiring for leaders. It also can reduce the efficacy of more advanced tools like machine learning, where algorithms hone in on data correlations or anomalies that could just as easily be shifts in customer behavior or a single month of incorrect application data.
Elements of Data Quality
While the concept of “quality” data seems straightforward, it can get nuanced in a hurry. What if the credit application and credit monitoring systems reflect different information on a borrower’s FICO? What if they match, but they’re both wrong? What if the CEO is making decisions based on that data?
Qualifying data to be “of good quality” requires a few elements, with various assessment means – all of which can be used to gain comfort in the data and confidence in the decisions it drives. A complete framework can be characterized by its:
- Completeness: Is the full volume of data being collected and presented in its entirety?
- Consistency: Are data elements uniform across all systems, sources and instances?
- Correctness: Are the values in the data objectively the right values?
- Uniqueness: Are there duplicate values or multiple records where there shouldn’t be?
- Age: Is the data current and still relevant, or has here been a refresh?
While dedupes, reconciliations and timestamps can easily account for most of the above, correctness is hard to nail down, and can mean something different to each user, which makes an integrated strategy crucial.
Attributes of an Effective Data Quality Framework
While use case should drive your selection of a specific data quality methodology – from where you fall in the value stream to large-source systems or small analytical data stores – you’ll want to make sure it accounts for these key elements:
There is typically an authoritative source – also known as the “single source of truth” – for each dataset. It is usually a source system or a view with particularly robust quality controls that give the rest of the organization something to anchor to.
Wherever possible, data should be pulled exclusively from authoritative sources to reduce data hops, eliminate superfluous copies, remove any obscurity in how it was altered, and lessen the number of failure points created through movement and transformation. Quality checks should be incorporated into source systems, data lakes, warehouses and production processes, as well as analytical exercises. Centralizing quality checks on these authoritative sources consolidates the work with a few key teams, taking the burden off analysts and other users to perform additional steps themselves while instilling comfort that the data being consumed is suitable for use.
It takes a concerted effort across the organization and a data quality culture to discourage users from creating bespoke local datasets that contain copies of key data elements, as is so often the path of least resistance. But by controlling the creation and consumption of data, and better monitoring how it is used, organizations can take a huge leap forward in its overall quality.
As a foundational layer, you’ll want to create data movement checks and reconciliations to confirm all the data was captured and that it was moved correctly. Comparing row counts is useful, but for data that is updated in batches, or has undergone transformations or aggregation, statistical profiling and more sophisticated change point detection from cycle to cycle (methods like “average balance of these accounts, incidence of invalid values, and the cumulative number of payments made on those accounts all are within a confidence interval”) can provide a reasonable amount of comfort that you have at least accounted for the full population of data.
Comparisons across systems ensure uniformity among the primary source and other authoritative sources. Checking for identical values keeps things in sync to ensure that the real truth – versus versions of the truth from competing systems – is preserved across all usages. Additionally, defining and confirming data types are consistent for similar elements across systems (integer vs. float) and that nulls are denoted in the same way each time they are represented (“ ” vs. “NULL”) adds another level of confidence that data is being stored consistently and therefore used more effectively.
Knowing with certainty that data is correct is not possible, but we can create reasonable assurance. It starts with making sure that any data being used is complete and consistent with the source, as well as with any third sources, such as an accounting system. This narrows the focus for quality to whether or not the values themselves are accurate.
Qualitatively, visualizations and business intuition from key leaders are also helpful, providing guardrails to determine if you’re in the ballpark before releasing data for use. The obvious drawback to this is the inherent confirmation bias that what happened in the past will keep happening in the future, making true shifts in the data appear incorrect.
More quantifiable checks should rely on a number of rules and ranges in your data quality rules to determine thresholds and acceptable values. Some of these might include:
- Acceptable values
- Checks for a certain number of digits/characters
- Data types
- Nullable vs. not-nullable
- Caps and floors to identify unreasonable values
- Statistical profiling
- Mean, median, standard deviation, etc.
- Outlier detection for anomalies
- Excess changes in key metrics cycle-to-cycle
- Change point detection models
In any case, thresholds should be monitored, but data should not be dropped from the system or dismissed out of hand. Anomalies and rule failures require the expertise and investigation of subject matter experts to be handled correctly. For instance, if a new product ID code appears in a dataset, it may be because the rules have not been updated for a new offering, or it could be due to underlying issues in the data source or the code that pulls it. Either way, writing it off as immaterial or dropping it as invalid would be a mistake.
Duplicates in the data are common, especially when data is collected from multiple systems and joined together. Standardizing the unique identifier by which a record is named across as many use cases as possible is Data 101. Another common method for deduping is to simply group-by the unique identifier, but depending on the implementation, a group-by for multiple records may randomly select a record, take an average, or behave in other unintended ways. Instead, it’s better to ensure some sort of rank-order is applied to dedupe in a logical, consistent way for all cases, and that that methodology is shared explicitly with users. Sometimes consistent application of the rules is just as important as getting to the “most correct” data.
Reducing the number of data sources is the best way to ensure data is not stale. Timestamps, clear procedures, workflows for overwrites and other like methods can keep data relevant. A best practice, however, is to archive items, rather than overwriting them, so that they can be reproduced if needed during an audit or some other review where repeatability is important.
Other Data Management Frameworks
While quality is an important part of data management, it’s not the whole ballgame – especially if users can’t find it or use it. Good data is made more effective by useful metadata, clear ownership agreements between producers and consumers, and documentation that users can and should lean on to make informed decisions about how best to leverage the available data.
Reporting to Instill Further Confidence
Data quality reporting is a topic deserving of its own discussion. In general, though, reporting should be as granular as possible – even to the element level. Capturing detailed comparisons and metrics at the column level in a given dataset will provide maximum flexibility to summarize results across a number of dimensions to meet the needs of many users. Although a data producer may have a specific rubric for determining what data is of good enough quality to release for use, users will need more information.
While some analytics may not be impacted by seemingly immaterial anomalies in data, some processes can be very sensitive to these issues. Moreover, the governance frameworks and internal controls around many processes (such as SOX-controlled processes) dictate those users perform additional checks of their own, which adds to the duplicative effort that should really have been baked in at the data’s source.
Summarization and segmentation can play an important role in reporting of data quality, as well. Issues in a small portion of a large dataset may appear immaterial in aggregate, but if a portion of those records is critical to some downstream process, that small difference may have outsized impacts to some users downstream. Organizations that provide flexible reporting, allowing users to focus on the pieces that matter to them or their process, will provide cover for more potential issues.
As data grows, the enterprise will struggle to keep up. A federated model of every team fending for themselves creates distractions and a drain on resources that might otherwise be driving analysis and decision-making. But a thoughtful framework of data quality approaches, implemented centrally and at scale, can provide an organization with data that is useful and the confidence to act on the insights it might reveal. Click here to speak to our team about developing your data quality.