Transformers and self-attention (Vaswani et al., 2017), have become the dominant approach for natural language processing (NLP) with systems such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) rapidly displacing more established RNN and CNN structures with an architecture composed of stacked encoder-decoder modules using self-attention.

This micro-project will provide tools and data sets for experiments and a first initial demonstration of the potential of transformers for multimodal perception and multimodal interactions. We will define research challenges, benchmark data sets and performance metrics for multimodal perception and interaction tasks such as (1) audio-visual narration of scenes, cooking actions and activities, (2) audio-video recordings of lectures and TV programs (3) audio-visual deictic (pointing) gestures, and (4) perception and evocation of engagement, attention, and emotion.

(full description and bibliography covers 200 words – available on request).

Output

Benchmark data and performance targets for a phased set of research challenges of increasing difficulty.

Tools for experiments to explore use of embeddings, encoder-decoders, self-attention architectures and related problems associated with applying transformers to different modalities.

Concept demonstrations for simple examples of multimodal perception.

Presentations

Project Partners:

  • Institut national de recherche en sciences et technologies du numérique (INRIA), James Crowley
  • Eötvös Loránd University (ELTE), Andras Lorincz
  • Université Grenoble Alpes (UGA), Fabien Ringeval
  • Centre national de la recherche scientifique (CNRS), François Yvon
  • Institut “Jožef Stefan” (JSI), Marko Grobelnik

Primary Contact: James Crowley, INRIA

Main results of micro project:

This micro-project will survey tools and data sets for experiments for demonstrating the potential use of transformers for multimodal perception and multimodal interactions. We will define research challenges and performance metrics for multimodal perception and interaction tasks such as audio-visual narration of scenes, cooking actions and activities, audio-visual deictic (pointing) gestures, and perception and evocation of engagement, attention, and emotion. We will provide tutorials on the use of transformers for multimodal perception and interaction.

Contribution to the objectives of HumaneAI-net WPs

This microproject will aid and encourage the use of a transformers and self attention for multimodal modal interaction by Humane AI Net researchers, by identifying relevant tools and benchmark data sets, by providing tutorials and training materials for education, and by identifying research challenges for multimodal perception and interaction with Transformers.

Tangible outputs

  • Dataset: A survey of tools and data-sets for a multimodal perception with transformers – James Crowley
  • Other: A tutorial on the use of transformers for multimodal perception. – Francois Yvon
  • Other: Research challenges for the use of transformers for multimodal perception and interaction. – James Crowley