TY - JOUR
T1 - Synthesizing entity matching rules by examples
AU - Singh, Rohit
AU - Meduri, Venkata Vamsikrishna
AU - Elmagarmid, Ahmed
AU - Madden, Samuel
AU - Papotti, Paolo
AU - Quiané-Ruiz, Jorge Arnulfo
AU - Solar-Lezama, Armando
AU - Tang, Nan
N1 - Publisher Copyright:
© 2017 VLDB Endowment.
PY - 2018
Y1 - 2018
N2 - Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).
AB - Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given highlevel specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions (∧), disjunctions (∨) and negations (¬), and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positivenegative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g., decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g., decision trees with high depth, gradient-boosting trees, random forests and SVM).
UR - https://www.scopus.com/pages/publications/85074643202
U2 - 10.14778/3149193.3149199
DO - 10.14778/3149193.3149199
M3 - Conference article
AN - SCOPUS:85074643202
SN - 2150-8097
VL - 11
SP - 189
EP - 202
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
T2 - 44th International Conference on Very Large Data Bases, VLDB 2018
Y2 - 27 August 2018 through 31 August 2018
ER -