TY - JOUR
T1 - From Words to Waves
T2 - 26th Interspeech Conference 2025
AU - Ersoy, Asım
AU - Mousi, Basel
AU - Chowdhury, Shammur
AU - Alam, Firoj
AU - Dalvi, Fahim
AU - Durrani, Nadir
N1 - Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.
PY - 2025
Y1 - 2025
N2 - The emergence of large language models has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts-showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. To support reproducibility, we have released our code along with a curated audio version of the SST-2 dataset for public access.
AB - The emergence of large language models has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts-showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. To support reproducibility, we have released our code along with a curated audio version of the SST-2 dataset for public access.
KW - Conceptual Abstractions
KW - Interpretability
KW - Multimodal Learning
UR - https://www.scopus.com/pages/publications/105020090673
U2 - 10.21437/Interspeech.2025-2180
DO - 10.21437/Interspeech.2025-2180
M3 - Conference article
AN - SCOPUS:105020090673
SN - 2308-457X
SP - 241
EP - 245
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 17 August 2025 through 21 August 2025
ER -