Contact person: Mireia Diez Sanchez (mireia@fit.vutbr.cz

Internal Partners:

  1. BUT, Brno University of Technology, Mireia Diez Sanchez, mireia@fit.vutbr.cz; cernocky@fit.vutbr.cz
  2. TUB, TECHNISCHE UNIVERSITÄT BERLIN, Tim Polzehl, tim.polzehl@dfki.deklaus.r.mueller@googlemail.com

 

In this microproject, we pursued enabling access to AI technology to those who might have special needs when interacting with “AI: Automatic Speech Recognition made accessible for people with dysarthria”. Dysarthria is a motor speech disorder resulting from neurological injury and is characterized by poor articulation of phonemes. Within Automatic speech recognition (ASR), dysarthric speech recognition is a tedious task due to the lack of supervised data and diversity.

The project studied the adaptation of automatic speech recognition (ASR) systems for impaired speech. Specifically, the micro-project focused on improving ASR systems for speech from subjects with dysarthria and/or stuttering speech impairment types of various degrees. The work was developed using German “Lautarchive” data comprising only 130 hours of untranscribed doctor-patient German speech conversations and using English TORGO dataset, applying human-in-the-loop methods. We spot individual errors and regions of low certainty in ASR in order to apply human originated improvement and clarification in AI decision processes.

Results Summary

Particularly, in this work, we have studied the performance of different ASR systems on dysarthric speech: LF-MMI, Transformer and wav2vec2. The analysis revealed the superiority of the wav2vec2 models on the task. We investigated the importance of speaker dependent auxiliary features such as fMLLR and xvectors for adapting wav2vec2 models for improving dysarthric speech recognition. We showed that in contrast to hybrid systems, wav2vec2 did not improve by adapting model parameters based on each speaker.

We proposed a wav2vec2 adapter module that inherits speaker features as auxiliary information to perform effective speaker normalization during finetuning. We showed that, using the adapter module, fMLLR and xvectors are complementary to each other, and proved the effectiveness of the approach outperforming existing SoTA on UASpeech dysartric speech ASR.

In our cross-lingual experiments, we also showed that combining English and German data for training, can further improve performance of our systems, proving useful in scenarios where little training examples exist for a particular language.

 

Tangible Outcomes

  1. M. K. Baskar, T. Herzig, D. Nguyen, M. Diez, T. Polzehl, L. Burget, J. Černocký, “Speaker adaptation for Wav2vec2 based dysarthric ASR”. Proc. Interspeech 2022, 3403-3407, doi: 10.21437/Interspeech.2022-10896 https://www.isca-speech.org/archive/pdfs/interspeech_2022/baskar22b_interspeech.pdf
  2. Open source tool for training ASR models for dysarthic speech: The repository contains: A baseline recipe to train a TDNN-CNN hybrid model based ASR system, this recipe is prepared to be trained on the TORGO dataset. And an end-to-end model using ESPnet framework prepared to be trained on UASpeech dataset. https://github.com/creatorscan/Dysarthric-ASR