Transformers and self-attention (Vaswani et al., 2017), have become the dominant approach for natural language processing (NLP) with systems such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) rapidly displacing more established RNN and CNN structures with an architecture composed of stacked encoder-decoder modules using self-attention.

This micro-project will assess tools and data sets for experiments and a first initial demonstration of the potential of transformers for multimodal perception and multimodal interactions. We explore research challenges, benchmark data sets and performance metrics for multimodal perception and modeling tasks such as (1) audio-visual narration of scenes, actions and activities, (2) audio-video recordings of lectures and TV programs (3) perception and evocation of engagement, attention, and emotion.

(full description and bibliography exceeds 200 words – available on request).


Project Partners:

  • Institut national de recherche en sciences et technologies du numérique (INRIA), James Crowley
  • Eötvös Loránd University (ELTE), Andras Lorincz
  • Université Grenoble Alpes (UGA), Fabien Ringeval
  • Centre national de la recherche scientifique (CNRS), François Yvon
  • Institut “Jožef Stefan” (JSI), Marko Grobelnik

Primary Contact: James Crowley, INRIA

Main results of micro project:

This micro-project has explored the potential of transformers for multimodal perception and interaction to support Humane AI, providing
1) A tutorial on the use of transformers for multimodal interaction, and
2) A report on available tools for experiments.
3) A survey of data sets and research challenges for experiments.
The result has opened a new approach to building practical tools for interaction and collaboration between people and intelligent systems.

Contribution to the objectives of HumaneAI-net WPs

This microproject has promoted the use of a transformers and self attention for multimodal modal interaction by Humane AI Net researchers, by identifying relevant tools and benchmark data sets, by providing tutorials and training materials for education, and by identifying research challenges for multimodal perception and interaction with Transformers.

Tangible outputs