Critical review and meta-analysis of existing speech datasets for affective computing in the perspective of inclusivity, transparency, and fair use

As AI-powered devices, software solutions, and other products become prevalent in everyday life, there is an urgent need to prevent the creation or perpetuation of stereotypes and biases around gender, age, race, as well as other social characteristics at risk of discrimination.

There are well-documented limitations in our practices for collecting, maintaining, and distributing the datasets used in current ML models. Moreover, these AI/ML systems, their underlying datasets, as well as the stakeholders involved in their creation, often do not reflect the diversity in human societies, thus further exacerbating structural and systemic biases. Thus, it is critical for the AI community to address this lack of diversity, acknowledge its impact on technology development, and seek solutions to ensure diversity and inclusion.

Audio is a natural way of communicating for humans and allows the expression of a wide range of information. Its analysis through AI applications can provide insights regarding the emotions and inner state of the speaker, information that cannot be captured by simply analyzing text. The analysis of the speech component is valuable in any AI application designed for tasks requiring an understanding of human users behind their explicit textual expressions, such as the research area of affective computing.

Affective computing refers to the study and development of systems and devices that can recognize, interpret, and simulate human emotions and related affective phenomena. Most of the currently available speech datasets face significant limitations, such as a lack of diversity in the speaker population, which can affect the accuracy and inclusivity of speech recognition systems for speakers with different accents, dialects, or speech patterns.

Other limitations include narrow context and small scale of recordings, data quality issues, limited representation, and limited availability of data. These issues must be carefully addressed when selecting and using speech datasets in an affective computing context, to ensure that speech recognition systems can effectively contribute to applications such as intelligent virtual assistants, mental health diagnosis, and emotion recognition in diverse populations.

In this MP, we aim to contribute towards the creation of future datasets and to facilitate a more aware use of existing ones. We propose to perform an extensive review of the literature on the topic, in particular existing speech datasets, with two main objectives.

First, we want to identify the key characteristics required in the creation of unbiased and inclusive speech datasets and how such characteristics should be pursued and validated.

Second, we want to perform a meta-analysis of the domain, focusing on the underlying limitations in the existing datasets. We want to provide a critical evaluation of the datasets themselves, but also of the scientific articles in which they were presented. Such a fine-grained analysis will allow us to elaborate on a more general and coarse-grained evaluation of the domain.

This MP would naturally fit the topic “ELS evaluation projects (WP5)”. Our purpose is the evaluation of existing speech datasets, that are used for the development of AI solutions, according to ethical and societal principles, and the formalization of best practices into a document of guidelines.

The project will be divided into 3 phases: in-depth analysis of the domain and of existing methodologies; discussion and writing of the criteria that will be evaluated in our analysis; systematic literature review to perform the meta-analysis. Given the unique background and expertise on key aspects of this project of the different partners, we plan to meet in person after each phase to discuss the results and elaborate on the methodology to apply for the following phase.

Output

We are planning to deliver two resources to the community:

1) A document to identify and discuss requirements, desiderata, and best practices that will cover many aspects of the creation and validation of a speech dataset, such as how to ensure that the data collection is inclusive, the pre-processing and the experimental setting do not introduce harmful biases, and the presentation of the work in a scientific publication includes all the relevant information.

2) A scientific report resulting from our meta-analysis. Depending on the result of our meta-analysis, we will select an appropriate venue (conference or journal).

Project Partners

  • University of Bologna, Andrea Galassi
  • Uppsala University, Ana Tanevska

Primary Contact

Andrea Galassi, University of Bologna