Conference
Quantifying duplication to improve data quality
Abstract
Deduplication is a costly and tedious task that involves identifying duplicate records in a dataset. High duplication rates lead to poor data quality, where data ambiguity occurs as to whether two records refer to the same entity. Existing deduplication techniques compare a set of attribute values, and verify whether given similarity thresholds are satisfied. While potential duplicate records are identified, these techniques do not provide …
Authors
Huang Y; Chiang F; Saillet Y; Maier A; Spisic D; Petitclerc M; Zuzarte C
Pagination
pp. 272-278
Publication Date
January 1, 2020
Conference proceedings
Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering Cascon 2017