TY - JOUR
T1 - Rpt
T2 - 47th International Conference on Very Large Data Bases, VLDB 2021
AU - Tang, Nan
AU - Fan, Ju
AU - Li, Fangyi
AU - Tu, Jianhong
AU - Du, Xiaoyong
AU - Li, Guoliang
AU - Madden, Sam
AU - Ouzzani, Mourad
N1 - Publisher Copyright:
© 2021, VLDB Endowment. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoen-coder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.
AB - Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoen-coder for tuple-to-X models (“X ” could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.
UR - https://www.scopus.com/pages/publications/85115297213
U2 - 10.14778/3457390.3457391
DO - 10.14778/3457390.3457391
M3 - Conference article
AN - SCOPUS:85115297213
SN - 2150-8097
VL - 14
SP - 1254
EP - 1261
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 8
Y2 - 16 August 2021 through 20 August 2021
ER -