A Density-Based Data Cleaning Approach for...

A Density-Based Data Cleaning Approach for Deduplication with Data Consistency and Accuracy

Abstract

Data cleaning is a critical part of the data transformation stage in data warehousing where the extracted data from relational databases are usually unclean. This may affect critical tasks in different organizations such as data analysis and decision making. Current techniques of data cleaning generally deal with one or two quality aspects. The techniques assume the availability of master data, or that users are involved in data cleaning such as manually placing confidence scores that represent the correctness of the values of data. In this paper, we present a uniform framework and algorithms to integrate data deduplication with inconsistent data repairing and discovering of the accurate values in data. We utilize the embedded density information in data to fix errors based on data density where tuples that are close to each other are packed together. We present a weight model to assign confidence scores that are based on the density of data. The assignments are automated and no user is involved in the process. We consider the inconsistent data in terms of violations with respect to a set of functional dependencies (FDs), as these violations are common in practice. We present a cost model for data repairing that is based on the weight model. We experimentally verify the quality and the scalability of our algorithms. We use synthetic and real datasets.