1. Data Quality

Lesson content locked

Enroll in Course to Unlock

If you're already enrolled, you'll need to login.

Transcript

- Hello, in this part we would focus on identifying quality issues in a data, how to clean data, how to transform data, and how to use visual resources to explore the data. Let's get this started. The data needs to satisfy several key requirements in order to be of use for generating insights and predictions. In our business, we can identify six basic data quality requirements. Completeness. All the relevant data for the analysis was recorded. For example, is the data comprehensive enough? Are there missing records or values? Consistency. Data shares the same information across different systems. For example, are there discrepancy in data values or columns across different data tables? Accuracy. How well the data reflects reality. For example, is the data represented with the right amount of bits, digits, or characters? Conformity. The data follow standard conventions. For example, is there a change in data format? Are the units correct? Timeliness. The data is kept up to date. For example, when was the last time that the data was recorded from the source? Is data from the source being collected? Integrity. The data is valid and can be related trace across all data in the database. For example, are there missing prelations or primary key values? Having the previous requirements in mind, it is key to perform a list of sanity checks to detect possible data issues. The detection can go from simple eye-balling to performing advanced statistical and visual methods. With panel checks, we may be able to verify their specifications. Are all the activists present? Is scan the bill of records? Are there unusual gaps, missing, or duplicate values? Or is scan the number of files, file sizes, or number of records are correct? However, manual checks are mostly suitable for inspecting a small data sets. A more systematic and rigorous check will involve compelling estimates such as averages, medians, or frequencies with expected values unbounds, and visualizing the data to find trends, relations, and distributions with the aid of multiple graphing methods. In the next slides, we will focus on detailing these types of checks, as well as describing ways to cleanse the data. The following table illustrates several issues that we may encounter when dealing with Abe. The most common one is missing data indicated in gray color. Another one is a lack of conformity caused by a type of formal discrepancies indicate here in green. Or by human typos indicated in red. Data could be also truncated by software or hardware limitations or human intervention. These induces loss of accuracy which are indicated in general column. We also observed sudden or anomalous changes in data trance, also outliers indicated in pink. Additional issues of data consistency and integrity also rise when different data sets are compared and intended to be merged. Thus, you can infer that detecting data issues can be a highly time-consuming and challenging task without the aid of suitable software and visual resources. Omitting these data quality issues will imply misleading or biased analytics results. That's why I made your person of the process is awarded to data cleansing.