TY - GEN
T1 - TituLLMs
T2 - 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
AU - Nahin, Shahriar Kabir
AU - Nandi, Rabindra Nath
AU - Sarker, Sagor
AU - Muhtaseem, Quazi Sarwar
AU - Kowsher, Md
AU - Shill, Apu Chandraw
AU - Ibrahim, Md
AU - Menon, Mehadi Hasan
AU - Al Muntasir, Tareq
AU - Alam, Firoj
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately ∼ 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to benchmark LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available.
AB - In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately ∼ 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to benchmark LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available.
UR - https://www.scopus.com/pages/publications/105028556428
U2 - 10.18653/v1/2025.findings-acl.1279
DO - 10.18653/v1/2025.findings-acl.1279
M3 - Conference contribution
AN - SCOPUS:105028556428
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 24922
EP - 24940
BT - Findings of the Association for Computational Linguistics
A2 - Che, Wanxiang
A2 - Nabende, Joyce
A2 - Shutova, Ekaterina
A2 - Pilehvar, Mohammad Taher
PB - Association for Computational Linguistics (ACL)
Y2 - 27 July 2025 through 1 August 2025
ER -