Transformers and self-attention (Vaswani et al., 2017), have become the dominant approach for natural language processing (NLP) with systems such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) rapidly displacing more established RNN and CNN structures with an architecture composed of stacked encoder-decoder modules using self-attention.

This micro-project will provide tools and data sets for experiments and a first initial demonstration of the potential of transformers for multimodal perception and multimodal interactions. We will define research challenges, benchmark data sets and performance metrics for multimodal perception and interaction tasks such as (1) audio-visual narration of scenes, cooking actions and activities, (2) audio-video recordings of lectures and TV programs (3) audio-visual deictic (pointing) gestures, and (4) perception and evocation of engagement, attention, and emotion.

(full description and bibliography covers 200 words – available on request).


Benchmark data and performance targets for a phased set of research challenges of increasing difficulty.

Tools for experiments to explore use of embeddings, encoder-decoders, self-attention architectures and related problems associated with applying transformers to different modalities.

Concept demonstrations for simple examples of multimodal perception.

Project Partners:

  • Institut national de recherche en sciences et technologies du numérique (INRIA), James Crowley
  • Eötvös Loránd University (ELTE), Andras Lorincz
  • Université Grenoble Alpes (UGA), Fabien Ringeval
  • Centre national de la recherche scientifique (CNRS), François Yvon
  • Institut “Jožef Stefan” (JSI), Marko Grobelnik

Primary Contact: James Crowley, INRIA