Unsupervised Code-switched Text Generation from Parallel Text

  • Jie Chi
  • , Brian Lu
  • , Jason Eisner
  • , Peter Bell
  • , Preethi Jyothi
  • , Ahmed M. Ali

Research output: Contribution to journalConference articlepeer-review

5 Citations (Scopus)

Abstract

There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.

Original languageEnglish
Pages (from-to)1419-1423
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
DOIs
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Keywords

  • Code-switching
  • Data augmentation
  • Encoder-decoder
  • Text generation
  • Unsupervised learning

Fingerprint

Dive into the research topics of 'Unsupervised Code-switched Text Generation from Parallel Text'. Together they form a unique fingerprint.

Cite this