[TMP-126] Use of dialog context to boost ASR/NLG/TTS and improve the overall quality of voice dialog systems

Contact person: Petr Schwarz, Brno University of Technology, (schwarzp@fit.vutbr.cz)

Internal Partners:

Brno University of Technology, Petr Schwarz, schwarzp@fit.vutbr.cz
Charles University, Ondrej Dusek, odusek@ufal.mff.cuni.cz

This project brings us data, tools, and baselines that enable us to study and improve context exchange among component and dialog sides (AI agent and human) in voice dialog systems. A better context exchange allows us to build more accurate automatic speech transcription, better dialog flow modeling, more fluent speech synthesis, and more powerful AI agents. The context exchange can be seen as an interactive grounding in two senses – among dialog sides (for example, technologies like example automatic speech transcription rarely use the other dialog side information to adapt itself) and among dialog system components (the speech synthesis rarely uses dialog context to produce more fluent or expressive speech). The individual project outputs are summarized below.

Results Summary

1) Audio data collection software based on the Twilio platform and WebRTC desktop/mobile device clients. The purpose is to collect audio data of communication between agents (company, service provider, for example, travel info provider) and users. This software enables us to collect very realistic voice dialogs that have high-quality audio (>= 16kHz sampling frequency) on the agent side and low telephone-quality audio on the user side. The code is available here: https://github.com/oplatek/speechwoz

2) We have established a relationship with Paweł Budzianowski (Poly.AI) and Izhak Shafran (Google). Paweł created the MultiWoz database – an excellent dialog corpus (https://arxiv.org/abs/1810.00278) that we use for the text-based experiment. We decided to collect our audio data similarly. Izhak organized DSTC11 Speech Aware Dialog System Technology Challenge (https://arxiv.org/abs/2212.08704) and created artificial audio data for MultiWOZ through speech synthesis, reading, and paraphrasing. Both provided us with the necessary advice for our data collection.

3) Speech dialog data – the data collection platform preparation and data collection are very time-consuming. The data collection is in progress and will be released before June 26th, 2023.

4) Initial experiments with context exchange between dialog sides (user and agent) were performed. These experiments show a nice improvement in the component of automatic speech recognition side. The results will be re-run with the collected data and published when the collection is finished.

5) Initial experiments with training instance weighting for response generation – which brings context to dialog system response generation, were performed. Experiments were based on the AuGPT system, previously developed at CUNI. The code is available here: https://github.com/knalin55/augpt. Instance weighting increases the re-use of context, compared to normal training, and can go even beyond natural occurrences in data. Simple weighting (threshold) seems better than designing a complex instance weight (in terms of automated metrics, limited manual evaluation is not conclusive). Cross entropy loss works better than unlikelihood loss, where dialogue success may be reduced.

6) We organized a workshop in JSALT research summer workshop about “Automatic design of conversational models from observation of human-to-human conversation”(https://jsalt2023.univ-lemans.fr/en/automatic-design-of-conversational-models-fromobservation-of-human-to-human-conversation.html). https://www.clsp.jhu.edu/2023-jelinek-summer-workshop, https://jsalt2023.univ-lemans.fr/en/index.html. This is a prestigious workshop organized by John Hopkins University every year. This year it is supported and co-organized by the University of Le Mans. Our topic passed a scientific review by more than 40 world-class researchers in AI in Baltimore, USA, in December 2022, and was selected for this workshop out of 15 proposals together with three others. The workshop topic builds on the outcome of this Micro-Project and will reuse the collected data.

Tangible Outcomes

Nalin Kumar and Ondrej Dusek. 2024. LEEETs-Dial: Linguistic Entrainment in End-to-End Task-oriented Dialogue systems. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 727–735, Mexico City, Mexico. Association for Computational Linguistics https://aclanthology.org/2024.findings-naacl.46/
Code for audio data collection: https://github.com/oplatek/speechwoz
Code for end-to-end response generation: https://github.com/knalin55/augpt
Report for end-to-end response generation: https://docs.google.com/document/d/1iQB1YWr3wMO8aEB08BUYBqiLh0KreYjyO4EHnb395Bo/edit
“Automatic design of conversational models from observation of human-to-human conversation” workshop in the prestigious JSALT research summer workshops program https://jsalt2023.univ-lemans.fr/en/automatic-design-of-conversational-models-from-observation-of-human-to-human-conversation.html
workshop proposal: https://docs.google.com/document/d/19PAOkquQY6wnPx_wUXIx2EaInYchoCRn/edit
presentations from the prestigious JSALT research summer workshop: https://youtu.be/QS5zXkpXV3Q

[TMP-126] Use of dialog context to boost ASR/NLG/TTS and improve the overall quality of voice dialog systems

Results Summary

Tangible Outcomes

Knowledge 4 All Foundation Ltd.

Humane AI on Social Media