Current speech translation data sets contain pre-segmented speech audio, post-processed transcripts, and reference translations. Such data do not allow identifying error contributions of individual components in the whole speech translation pipeline and often lack detailedness to identify major error contributors. There is also lack of data for evaluating speech translation of web-based meetings (Zoom, Microsoft Teams, etc.) that have become vital to enable remote work and remote international cooperation.

To address the gaps, we propose:

1. to collect a data set of multi-speaker covering various domains and languages.

2. to create multi-layer annotations of the data that would allow evaluating individual components of pipeline-based (and also end-to-end) speech translation systems and measure each component's contribution towards the total error.

The data will feature audio data from multiple languages (including Latvian and Lithuanian for which no speech translation evaluation sets have ever been created).


Data set for evaluation of speech translation systems consisting of audio data (in at least 3 languages)

Multi-layer annotations of the data (with at least the following annotation layers – speaker segmentation, sentence segmentation, orthographic transcriptions, normalised transcriptions, translations), and documentation of the data.

Project Partners:

  • Charles University Prague, Ondrej Bojar

Primary Contact: Aivars Bērziņš, Tilde