Contact person: James Crowley (James@crowley-coutaz.fr)
Internal Partners:
- Eotvos Lorand University – ELTE, Andras Lorincz
- Univ Grenoble Alpes, Dominique Vaufreydaz, Fabien Ringeval
- Uni Paris Saclay, Camille Guinaudeau, Marc Evrard
- Jozef Stefan Institut-JSI, Marko Grobelnik
- Charles University, Pavel Pecina
Transformers and self-attention (Vaswani et al., 2017), have become the dominant approach for natural language processing (NLP) with systems such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) rapidly displacing more established RNN and CNN structures with an architecture composed of stacked encoder-decoder modules using self- attention. This micro-project provide tools and data sets for experiments and a first initial demonstration of the potential of transformers for multimodal perception and multimodal interactions. We define research challenges, benchmark data sets and performance metrics for multimodal perception and interaction tasks such as (1) audio-visual narration of scenes, cooking actions and activities, (2) audio-video recordings of lectures and TV programs (3) audio-visual deictic (pointing) gestures, and (4) perception and evocation of engagement, attention, and emotion.
1) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762
2) Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
3) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., and Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Results Summary
In this project, we explore the potential of Transformer-based models in two significant domains: unsupervised object discovery and multimodal emotion recognition using physiological signals. First, we demonstrate a novel approach for unsupervised object discovery by leveraging self-supervised learning with self-distillation loss (DINO). Our method utilizes visual tokens as nodes in a weighted graph, where edges reflect connectivity scores based on token similarity. By applying a normalized graph-cut and solving it through spectral clustering with generalized eigen-decomposition, we isolate foreground objects. This approach effectively segments self-similar regions, with the second smallest eigenvector of the decomposition providing the cutting solution that indicates token association with foreground objects. This technique not only simplifies the object discovery process but also achieves substantial performance improvements over current state-of-the-art methods such as LOST, outperforming it by 6.9%, 8.1%, and 8.1% on the VOC07, VOC12, and COCO20K benchmarks, respectively. Furthermore, integrating a second-stage class-agnostic detector (CAD) enhances these results, and our method’s adaptability is demonstrated in its application to unsupervised saliency detection and weakly supervised object detection, achieving notable IoU improvements on the ECSSD, DUTS, and DUT-OMRON datasets.
In parallel, we address the challenge of multimodal emotion recognition from physiological signals using Transformer-based models. Recognizing the advantages of attention mechanisms in Transformers for creating contextualized representations, we propose a model for processing electrocardiogram (ECG) data to predict emotions. This model highlights significant segments of the signal, ensuring that relevant information is given priority. Due to the limited size of datasets with emotional labels, we adopt a self-supervised learning approach. We pre-train our model using unlabelled ECG datasets to build robust representations and then fine-tune it on the AMIGOS dataset for emotion recognition. Our findings confirm that this approach achieves state-of-the-art results in emotion recognition tasks involving ECG signals. Additionally, the success of this strategy underscores the broader potential of Transformers and pre-training techniques for analyzing time-series data in emotion recognition tasks.
Overall, the outcomes of our project demonstrate that Transformer-based models, coupled with self-supervised learning, can significantly enhance the performance of both unsupervised object discovery and emotion recognition from physiological signals. These methods provide robust solutions for complex visual and temporal signal analysis tasks, marking a substantial step forward in computer vision and affective computing.
Tangible Outcomes
- Y. Wang, X. Shen, S. Hu, Y. Yuan, J. L. Crowley, D. Vaufreydaz, Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut. IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2022, pp14543-14553, New Orleans, Jun 2022.
https://arxiv.org/abs/2202.11539 - J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin and J. L Crowley, “Emotion Recognition with PreTrained Transformers Using Multimodal Signals”, 10th International Conference on Affective Computing and Intelligent Interaction (ACII), Oct 2022
https://ieeexplore.ieee.org/document/9953852 . - J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, J. L. Crowley. Transformer-Based SelfSupervised Learning for Emotion Recognition. 26th International Conference on Pattern Recognition (ICPR 2022), Aug 2022, Montreal, Canada.
https://arxiv.org/abs/2204.05103 - A survey of tools and datasets for a multimodal perception with transformers (http://crowley-coutaz.fr/jlc/HumanE-AI-Net/TransfomerMicroProject/TransformerTools.pdf )
- A tutorial on the use of transformers for multimodal perception. (http://crowley-coutaz.fr/jlc/Courses/ACAI2021/Multimodal-Transformer-Tutorial.html )
- Report on challenges for the use of transformers for multimodal perception and interaction. (http://crowley-coutaz.fr/jlc/HumanE-AI-Net/TransfomerMicroProject/ReseachChallengesDataSets.pdf )