Setting Up Data Quality Process

Data Quality is the process used to derive unique, standardized, and complete master data. Data Quality routines ensure that the data entered in a repository is "golden" so that data can be managed appropriately. If data quality is low, the repository contains two or more records for the same logical item, that is, it results in duplication of data.

To avoid duplicate records, you must bring the same logical record into a standardized form. Only after this data standardization, you can successfully check for duplicates. This process of standardizing the data is also known as data cleansing.

Even after data cleansing, it may sometimes be difficult to determine whether a record is really new or is actually a variation (that is, a version) of an existing record. It may require a mix of automated decisions (for most of the records) and some human intervention to decide whether a record is new or existing. This is often not a simple decision. For example, deciding whether two persons are the same when a reliable or unique ID is missing, is difficult. It, typically, depends on the nature of the data and, in particular, which attributes are needed for identification. In case a reliable or unique ID is supplied, deduplication is not required. However, this represents an ideal scenario, which rarely occurs in the real world.