WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Abstract

Current speech encoding pipelines often relyon an additional text-based LM to get robustrepresentations of human communication, eventhough SotA speech-to-text models often havea LM within. This work proposes an approachto improve the LM within an audio model suchthat the subsequent text-LM is unnecessary. Weintroduce WhiSPA (Whisper with Semanticand Psychological Alignment), which lever-ages a novel audio training objective: con-trastive loss with a language model embeddingas a teacher. Using over 500k speech segmentsfrom mental health audio interviews, we evalu-ate the utility of aligning Whisper’s latent spacewith semantic representations from a text au-toencoder (SBERT) and lexically derived em-beddings of basic psychological dimensions:emotion and personality. Over self-supervisedaffective tasks and downstream psychologicaltasks, WhiSPA surpasses current speech en-coders, achieving an average error reductionof 73.4% and 83.8%, respectively. WhiSPAdemonstrates that it is not always necessaryto run a subsequent text LM on speech-to-textoutput in order to get a rich psychological rep-resentation of human communication

Publication
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics