TY - GEN
T1 - Speech Representation Analysis Based on Inter- and Intra-Model Similarities
AU - El Kheir, Yassine
AU - Ali, Ahmed
AU - Chowdhury, Shammur Absar
N1 - Publisher Copyright:
©2024 IEEE.
PY - 2024/4/19
Y1 - 2024/4/19
N2 - Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm – Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts. We made the code publicly available for facilitating further research, we publicly released our code.
AB - Self-supervised models have revolutionized speech processing, achieving new levels of performance in a wide variety of tasks with limited resources. However, the inner workings of these models are still opaque. In this paper, we aim to analyze the encoded contextual representation of these foundation models based on their inter- and intra-model similarity, independent of any external annotation and task-specific constraint. We examine different SSL models varying their training paradigm – Contrastive (Wav2Vec2.0) and Predictive models (HuBERT); and model sizes (base and large). We explore these models on different levels of localization/distributivity of information including (i) individual neurons; (ii) layer representation; (iii) attention weights and (iv) compare the representations with their finetuned counterparts. Our results highlight that these models converge to similar representation subspaces but not to similar neuron-localized concepts. We made the code publicly available for facilitating further research, we publicly released our code.
KW - Inter- and Intra- Similarities
KW - Self-Supervised Learning
KW - Speech Models
UR - https://www.scopus.com/pages/publications/85202281466
U2 - 10.1109/ICASSPW62465.2024.10669908
DO - 10.1109/ICASSPW62465.2024.10669908
M3 - Conference contribution
AN - SCOPUS:85202281466
T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
SP - 848
EP - 852
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Y2 - 14 April 2024 through 19 April 2024
ER -