From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

Research output: Contribution to journalConference articlepeer-review

Abstract

The emergence of large language models has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts-showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. To support reproducibility, we have released our code along with a curated audio version of the SST-2 dataset for public access.

Original languageEnglish
Pages (from-to)241-245
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • Conceptual Abstractions
  • Interpretability
  • Multimodal Learning

Fingerprint

Dive into the research topics of 'From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models'. Together they form a unique fingerprint.

Cite this