Contact person: Carmela Comito, CNR (carmela.comito@icar.cnr.it)
Internal Partners:
- Consiglio Nazionale delle Ricerche (CNR), Carmela Comito, carmela.comito@icar.cnr.it
- Umeå University (UMU), Nina Khairova, nina.khairova@umu.se
- Università di Bologna (UNIBO), Andrea Galassi, p.torroni@unibo.it
- TILDE
In this project, we work with a Ukrainian academic refugee, to combine methods for semantic text similarity with expert human knowledge in a participatory way to develop a training corpus that includes news articles containing information on extremism and terrorism.
Results Summary
1) Collection and curation of two event-based datasets of news about Russian-Ukrainianwar.
The datasets support analysis of information alteration among news outlets (agency and media) with a particular focus on Russian, Ukrainian, Western (EU and USA), and international news sources, over the period from February to September 2022. We manually selected some critical events of the Russian-Ukrainian war. Then, for each event, we created a short list of language-specific keywords. The chosen languages for the keywords are Ukrainian, Russian, and English. Finally, besides the scraping operation over the selected sources, we also gather articles using an external news intelligence platform, named Event Registry which keeps track of world events and analyzes media in real-time. Using this platform we were able to collect more articles from a larger number of news outlets and expand the dataset with two distinct article sets. The final version of the RUWA Dataset is thus composed of two distinct partitions.
2) Development of an unsupervised methodology to establish whether news from the various parties are similar enough to say they reflect each other or, instead, they are completely divergent and therefore one is likely not trustworthy. We focused on textual and semantic similarity (sentence embeddings methods such as Sentence-BERT), comparing the news and assess if they have a similar meaning. Another contribution of the proposed methodology is a comparative analysis of the different media sources in terms of sentiments and emotions, extracting subjective points of view as they are reported in texts,
combining a variety of NLP-based AI techniques and sentence embeddings techniques. Finally, we applied NLP techniques to detect propaganda in news article, relying on self supervised NLP systems such as RoBERTa and existing adequate propaganda datasets.
3) Preliminary Qualitative results:
When the events concern civilians all sources are very dissimilar. But Ukraine and Western are more similar. When the event is military targets, Russian and Ukraine sources are very dissimilar from other sources, there is more propaganda in Ukraine and Russian Ones.
Tangible Outcomes
- Github repository of datasets and software: https://github.com/fablos/ruwa-dataset