Home
Scholarly Works
Quantifying duplication to improve data quality
Conference

Quantifying duplication to improve data quality

Abstract

Deduplication is a costly and tedious task that involves identifying duplicate records in a dataset. High duplication rates lead to poor data quality, where data ambiguity occurs as to whether two records refer to the same entity. Existing deduplication techniques compare a set of attribute values, and verify whether given similarity thresholds are satisfied. While potential duplicate records are identified, these techniques do not provide users with any information about the degree of duplication, i.e., the varying levels of closeness among the attribute values and between records that define the duplicates. In this paper, we present a duplication metric that quantifies the level of duplication for an attribute value, and within an attribute. This metric can be used by analysts to understand the distribution and similarity of values during the data cleaning process. We present a deduplication framework that differentiates terms during similarity matching step, and is agnostic to the ordering of values within a record. We compare our framework against two existing approaches, and show that we achieve improved accuracy and performance over real data collections.

Authors

Huang Y; Chiang F; Saillet Y; Maier A; Spisic D; Petitclerc M; Zuzarte C

Pagination

pp. 272-278

Publication Date

January 1, 2020

Conference proceedings

Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering Cascon 2017

Contact the Experts team