Home
Scholarly Works
Models for Distributed, Large Scale Data Cleaning
Conference

Models for Distributed, Large Scale Data Cleaning

Abstract

Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused on cleaning data in static environments. Given the Big Data era, modern applications are operating in dynamic data environments where large scale data may be frequently changing. For example, consider data in sensor environments where there is a frequent stream of data arrivals, or financial data of stock prices and trading volumes. Data cleaning in such dynamic environments requires understanding the properties of the incoming data streams, and configuration of system parameters to maximize performance and improved data quality. In this paper, we present a set of queueing models, and analyze the impact of various system parameters on the output quality of a data cleaning system and its performance. We assume random routing in our models, and consider a variety of system configurations that reflect potential data cleaning scenarios. We present experimental results showing that our models are able to closely predict expected system behaviour.

Authors

Maccio VJ; Chiang F; Down DG

Series

Lecture Notes in Computer Science

Volume

8643

Pagination

pp. 369-380

Publisher

Springer Nature

Publication Date

January 1, 2014

DOI

10.1007/978-3-319-13186-3_34

Conference proceedings

Lecture Notes in Computer Science

ISSN

0302-9743
View published work (Non-McMaster Users)

Contact the Experts team