UGuide - User-guided discovery of FD-detectable errors

  • Saravanan Thirumuruganathan
  • , Laure Berti-Equille
  • , Mourad Ouzzani
  • , Jorge Arnulfo Quiane-Ruiz
  • , Nan Tang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

26 Citations (Scopus)

Abstract

Error detection is the process of identifying problematic data cells that are different from their ground truth. Functional dependencies (FDs) have been widely studied in support of this process. Oftentimes, it is assumed that FDs are given by experts. Unfortunately, it is usually hard and expensive for the experts to define such FDs. In addition, automatic data profiling over dirty data in order to find correct FDs is known to be a hard problem. In this paper, we propose an end-to-end solution to detect FD-detectable errors from dirty data. The broad intuition is that given a dirty dataset, it is feasible to automatically find approximate FDs, as well as data that is possibly erroneous. Arguably, at this point, only experts can confirm true FDs or true errors. However, in practice, experts never have enough budget to find all errors. Hence, our problem is, given a limited budget of expert's time, which questions we should ask, either FDs, cells, or tuples, such that we can find as many data errors as possible. We present efficient algorithms to interact with the user. Extensive experiments demonstrate that our proposed framework is effective in detecting errors from dirty data.

Original languageEnglish
Title of host publicationSIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1385-1397
Number of pages13
ISBN (Electronic)9781450341974
DOIs
Publication statusPublished - 9 May 2017
Event2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017 - Chicago, United States
Duration: 14 May 201719 May 2017

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
VolumePart F127746
ISSN (Print)0730-8078

Conference

Conference2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
Country/TerritoryUnited States
CityChicago
Period14/05/1719/05/17

Fingerprint

Dive into the research topics of 'UGuide - User-guided discovery of FD-detectable errors'. Together they form a unique fingerprint.

Cite this