TY - GEN
T1 - Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition
AU - Nandi, Rabindra Nath
AU - Menon, Mehadi Hasan
AU - Al Muntasir, Tareq
AU - Sarker, Sagor
AU - Muhtaseem, Quazi Sarwar
AU - Islam, Md Tariqul
AU - Chowdhury, Shammur Absar
AU - Alam, Firoj
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We bench-marked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.
AB - One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We bench-marked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.
UR - https://www.scopus.com/pages/publications/85185001457
M3 - Conference contribution
AN - SCOPUS:85185001457
T3 - BLP 2023 - 1st Workshop on Bangla Language Processing, Proceedings of the Workshop
SP - 196
EP - 200
BT - BLP 2023 - 1st Workshop on Bangla Language Processing, Proceedings of the Workshop
A2 - Sadeque, Farig
A2 - Amin, Ruhul
A2 - Kar, Sudipta
A2 - Chowdhury, Shammur Absar
A2 - Alam, Firoj
PB - Association for Computational Linguistics (ACL)
T2 - 1st Workshop on Bangla Language Processing, BLP 2023
Y2 - 7 December 2023
ER -