Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora

  • Amir Hussein*
  • , Dorsa Zeinali
  • , Ondrej Klejch
  • , Matthew Wiesner
  • , Brian Yan
  • , Shammur Chowdhury
  • , Ahmed Ali
  • , Shinji Watanabe
  • , Sanjeev Khudanpur
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

4 Citations (Scopus)

Abstract

Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We investigate the impact of generated data on speech recognition in two scenarios: using in-domain CS text and a zero-shot approach with synthesized CS text. Empirical results highlight up to 34.4% and 16.2% relative reductions in Mixed-Error Rate and Word-Error Rate for in-domain and zero-shot scenarios, respectively. Lastly, we demonstrate that CS augmentation bolsters the model's code-switching inclination and reduces its monolingual bias.

Original languageEnglish
Pages (from-to)12006-12010
Number of pages5
JournalProceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
DOIs
Publication statusPublished - 18 Mar 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Keywords

  • Asr
  • Code-switching
  • Data augmentation
  • End-to-end
  • Zero-shot learning

Fingerprint

Dive into the research topics of 'Speech Collage: Code-Switched Audio Generation by Collaging Monolingual Corpora'. Together they form a unique fingerprint.

Cite this