Record Linkage and Fusion over Web Databases

Mourad Ouzzani, Eduard Dragut, El Kindi, Amgad Madkour

Research output: Contribution to conferencePosterpeer-review

Abstract

Many data-intensive applications on the Web require integrating data from multiple sources (Web databases) at query time. Online sources may refer to the same real world entity in different ways and some may provide outdated or erroneous data. An important task is to recognize and merge the various references that refer to the same entity at query time. Almost all existing duplicate detection and fusion techniques work in the offline setting and, thus, do not meet the online constraint. There are at least two aspects that differentiate online duplicate detection and fusion from its offline counterpart. First, the latter assumes that the entire data is available, while the former cannot make such a hard assumption. Second, several iterations (query submissions) may be required to compute the “ideal” representation of an entity in the online setting.

We propose a general framework to address this problem: an interactive caching solution. A set of frequently requested records is cleaned off-line and cached for future references. Newly arriving records in response to a stream of queries are cleaned jointly with the records in the cache, presented to users and appended to the cache.

We introduce two online record linkage and fusion approaches: (i) a record-based and (ii) a graph-based. They chiefly differ in the way they organize data in the cache as well as computationally. We conduct a comprehensive empirical study of the two techniques with real data from the Web. We couple their analysis with commonly used cache settings: static/dynamic, cache size and eviction policies.
Original languageEnglish
DOIs
Publication statusPublished - Nov 2011

Fingerprint

Dive into the research topics of 'Record Linkage and Fusion over Web Databases'. Together they form a unique fingerprint.

Cite this