Contact person: Patrick Paroubek, LIMSI-CNRS (pap@limsi.fr)
Internal Partners:
- Centre national de la recherche scientifique (CNRS), Patrick Paroubek
External Partners:
- Charles University Prague, O. Dušek
We aim to evaluate the usefulness of current dialogue dataset annotation and propose annotation unification and automatized enhancements for better user modeling by training on larger amounts of data. Current datasets’ annotation is often only focused on annotation geared toward the dialog system learning how to answer, while the user representation should be explicit, consistent and as complete as possible for more complex user representation (e.g. cognitively). The project will start from existing annotated dialog corpora and produce extended versions, with improved annotation consistency and extra user representation annotations produced automatically from existing corpora like bAbI++ and MultiWOZ and others. We will explore unifying annotations from multiple datasets and evaluate the enhanced annotation using our own end-to-end dialogue models based on memory networks.
Results Summary
- A corpus of 37,173 annotated dialogues with unified and enhanced annotations was built from existing open dialogue resources.
- Code and trained models (GPT-2, MarCo) for dialogue response generation on the above corpus were generated.
- Ongoing collaboration between LISN (Paris-Saclay University) and Fac. of Mathematics and Physics (Charles University, Pragues).
Tangible Outcomes
- Schaub, Léon-Paul, Vojtech Hudecek, Daniel Stancl, Ondrej Dusek, and Patrick Paroubek. “Defining and detecting inconsistent system behavior in task-oriented dialogues.” In Traitement Automatique des Langues Naturelles, pp. 142-152. ATALA, 2021. https://hal.science/TALN-RECITAL2021/hal-03265892 https://aclanthology.org/2021.jeptalnrecital-taln.13/
- Vojtěch Hudeček, Léon-Paul Schaub, Daniel Stancl, Patrick Paroubek, and Ondřej Dušek. 2022. DIASER: A Unifying View On Task-oriented Dialogue Annotation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1286–1296, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.137/
- Dataset: DIASER corpus – Ondrej Dusek: A corpus of 37,173 annotated dialogues with unified and enhanced annotations built from existing open dialogue resources. https://gitlab.com/ufal/dsg/diaser
- Video presentation summarizing the project