Effective decision-making depends heavily on the quality of the data at available. In this blog, we will delve into six core dimensions of data quality with functional examples as well as basic MSSQL queries to develop better insights into your data.
We are also sharing posts and previews on channels like Instagram, TikTok, LinkedIn and YouTube. Feel free to join the communities over there and keep up to date on Data Quality, Governance, SQL, Data Integration, Analytics and much more!
What is Data Quality?
Data Quality (DQ) is the assessment of usefulness and reliability of data for its purpose. This process is often referred to as Data Profiling: a subjective discovery to understand your data and its usability relative to the standards defined by the user.
An example of Subjective Data Quality:
A financial controller needs to know every decimal and cent in the data for it to be accurate. The sales department only needs approximated sales numbers to determine sales trends.
Partly intertwined are the processes of Data Integration (DI) and Master Data Management (MDM), which together is sometimes called Data Cleansing. This involves resolving and managing issues, standardizing data, enrichment and integrating various sources into a data system with the goal of unlocking full value, high-quality data.
Data Quality Metrics
In 2020 the Data Management Association (DAMA) compiled a list of 65 dimensions and subdimensions for Data Quality, with 12 of these being marked as ‘common’. Out of these twelve there are 6 core dimensions:
— The main features required for analysis do not have missing values.
— Check that there is only one record of each observation in the data.
- Timeliness and Currency:
— Data must be up to date and the most recent record reflects the most recent change.
— Minimize or remove contradictions within a dataset or between datasets.
— Ensures data has the right type, format and is within acceptable range.
— Measures the correctness of the form and content of data.
Managing Data Quality
Data Governance is a data management concept concerning the capability of a business to ensure high data quality exists during the lifecycle of the data, and that controls are in place that support business objectives.
In Data Governance the responsibility is skewed mostly towards business users and not to the IT infrastructure. While data is expected to be fixed at creation, it is often not assessed until it is consumed. That is why analysts are often burdened with the added task of detecting and resolving issues and flaws in the data.
This has led to a new role to centralize this responsibility: the Data Steward. A Data Steward is tasked with administering all data in compliance with policy and regulation. This role is often taken on by a Data Analyst or Data Scientist connected to the project or organization.
Completeness is measured by the ratio of missing records for the features that are within the scope of the project. A completeness of 95% would mean that only 5 out of 100 records have missing values.
A real-world example: At the cash register in the grocery store all products are scanned with a barcode on the packaging. For one of your items the barcode is missing, thus it cannot be registered for payment. The item is manually processed by the cashier, looking up the item code or price.
This manual involvement is surprisingly like how completeness is handled in business. Only when the salesperson tries to send an email to a customer the system returns a message that the email address isn’t available. They manually resolve the issue, either by getting the data from another system or by calling the customer to retrieve the missing email address.
Below an example of a basic test on the completeness of a table in a MSSQL database:
SELECT 100 * COUNT(CASE WHEN col A is not null THEN 1 END) / COUNT(*) AS completeness_col_A FROM table_name
There are several options to resolve missing data and improve the overall completeness of your dataset. We’ve listed some of the solutions below:
- If available, use other data sources. This is a great solution for standardized data such as addresses and can be seen in online shopping. They only ask for your postal code and house number, which is used to retrieve the correct street name. A solution like this requires time to test external data sources and determine if they can be used to complete empty records or fields.
- Imputation is inferring the missing values from the existing ones, for example by replacing the missing values with the median, mean or mode of the data. Imputation does require a very clear definition of the goal and the precision of data needed.
An example in MSSQL:
SELECT COALESCE (col_A, AVG (col_A) ) FROM table_name
- Manually fixing values, although possible and quite often common practice, is an undesirable solution. When the number of missing values is very low, and the business logic is well known, this is sometimes used as a quick fix. But the major downside manual resolution is non-scalability and human error.
- Dropping missing values is an option when the dataset is large enough and the ratio of completeness is high enough. If dropping the empty records doesn’t impact the analysis, you could drop the null values. But this does mean losing information and possibly trends due to dropping information like important outliers with missing values.
Objects in a dataset should be unique occurrences to give an accurate insight during data analysis, as well as to provide the correct data to the user looking for specific information. Data deduplication is the process of removing or filtering out duplicate records and works mostly for identical records. But even if a record is not identical to any others, that does not guarantee uniqueness.
An example: We’re hosting a business event and during the event we collect information from the visitors to store in a database. After a month we want to contact the visitors for follow-up questions and in the list of applicants are three very similar names: Isa Daniels, Isabella Daniels, and Isadora Daniels. Two records, Isa and Isadora, share the same email and phone number. Even if they’re not identical records, they still aren’t unique.
Actual duplicate records are easier to detect and manage, and often harder to create, since database tables require unique keys to prevent duplicates and promote easy querying. If no primary key is required an identical duplicate record could be inserted into a dataset, skewing the results, and allocating unnecessary resources. Most statistical packages used in data analysis are well equipped to handle identical duplicate values.
To check for duplicate values in MSSQL you can use:
SELECT col_A, col_B, col_C, COUNT(*) as cnt FROM table_A GROUP BY col_A, col_B, col_C HAVING cnt >1
Another option would be to use the DISTINCT keyword when using the COUNT function.
SELECT COUNT(DISTINCT customer_id) AS num_unique_customers FROM customer;
TIMELINESS and CURRENCY
The timeliness dimension is a measurement of the delay between an event occurring and the data being available to the business. It is important to remember that the data is still valid, it is just late. An example from a business perspective:
An online business promises next-day delivery and a customer makes an order on Monday. Because of maintenance in the order processing system the order gets added to the warehouse packing list on Tuesday, resulting in delivery on Wednesday. The order data is accurate in the context of the business, but because of the delay it is no longer quality data according to timeliness.
The most up to date value and timeline for data is subjective but should nevertheless be part of the Data Quality metrics. While using a particular tool is often not necessary it is helpful to check the important dates, or when data was last updated.
A simple MSSQL example of a timeliness check:
SELECT MIN (date), MAX (date) FROM table_A;
Currency is a similar dimension and often grouped (or overlooked) with timeliness. It is a measurement of the quality for the state change of data in a collection. The state of an object can change over time and if not captured correctly becomes of lesser quality or completely useless.
In this example a distribution business keeps track of the last delivery dates for each of their customers and prepares orders in advance to ensure on time delivery. When the last delivery date is not updated correctly an order might be prepared even though a customer has already received a shipment on a more recent date. The data still shows the old state for that customers’ latest delivery which means the data currency is bad.
This highlights the difference between timeliness and currency. Timeliness is the late arrival of data or delay, but the information is still accurate. Currency is whether the data has lost its value due to a state change happening while the data is being processed late.
Next time we’re back with more about Consistency, Validity and Accuracy! And there is even more educational content available in other posts in our blog. Also follow us on Instagram, TikTok, LinkedIn, YouTube for more information and don’t hesitate to contact us!