Rpt: Relational pre-trained transformer is almost all you need towards democratizing data preparation

  • Nan Tang
  • , Ju Fan*
  • , Fangyi Li
  • , Jianhong Tu
  • , Xiaoyong Du
  • , Guoliang Li
  • , Sam Madden
  • , Mourad Ouzzani
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

59 Citations (Scopus)

Abstract

Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoen-coder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.

Original languageEnglish
Pages (from-to)1254-1261
Number of pages8
JournalProceedings of the VLDB Endowment
Volume14
Issue number8
DOIs
Publication statusPublished - 2021
Event47th International Conference on Very Large Data Bases, VLDB 2021 - Virtual, Online
Duration: 16 Aug 202120 Aug 2021

Fingerprint

Dive into the research topics of 'Rpt: Relational pre-trained transformer is almost all you need towards democratizing data preparation'. Together they form a unique fingerprint.

Cite this