Many industrial NLP applications emphasise the processing and detection of nouns, especially proper nouns (Named Entity Recognition, NER). However, processing of verbs has been neglected in recent years, even though it is crucial for the development of full NLU systems, e.g., for the detection of intents in spoken language utterances or events in written language news articles. The META-O-NLU microproject focuses on proving the feasibility of a multilingual event-type ontology based on classes of synonymous verb senses, complemented with semantic roles and links to existing semantic lexicons. Such an ontology shall be usable for content- and knowledge-based annotation, which in turn shall allow for developing NLU parsers/analyzers. The concrete goal is to extend the existing Czech-English SynSemClass lexicon (which displays all the necessary features, but only for two languages) by German and Polish, as a first step to show it can be extended to other languages as well.
Output
Common paper co-authored by the proposers (possibly with et. partners)
Extended version of SynSemClass (entried in additional languages)
Presentations
Project Partners:
- Charles University Prague, Jan Hajič
- German Research Centre for Artificial Intelligence (DFKI), Georg Rehm
Primary Contact: Jan Hajič, Univerzita Karlova (Charles University, CUNI)
Main results of micro project:
The main results of the META-O-NLU microproject is the extension of the original SynSemClass dataset by German classes, or more precisely, the inclusion of German verbs and event descriptors to the existing classes in SynSemClass. Together with the individual verbs, existing German lexical resources have been linked to (GermaNet, E-VALBU and GUP). Adding a third language demonstrated that future extension to other languages is feasible, in terms of annotation rules, the dataset itself, and in creating a new web browser that can show all language entries alongside each other with all the external links. The data is freely available in the LINDAT/CLAIRAH-CZ repository (and soon also through the Euroepan Language Grid) and a web browser on the resources is now also available.
Contribution to the objectives of HumaneAI-net WPs
Task 3.6 focuses on both spoken and written language-based interactions (dialogues, chats), in particular questions of multilinguality that are essential to the European vision of human-centric AI. The results of this microproject contribute especially to the multlingual issue, and is directed to full NLU (Natural Language Understanding) by describing event types, for which no general ontology exists yet. The resulting resource will be used for both text and dialog annotation, to allow for evaluation and possibly also for training of NLU systems.
Tangible outputs
- Dataset: SynSemClass 3.5 dataset – Jan Hajič, Goerg Rehm, Zdeňka Urešová, Karolina Zaczynska, Eva Fučíková
http://hdl.handle.net/11234/1-3750 - Other: SynSemClass 3.5 browser – Jan Hajič, Goerg Rehm, Zdeňka Urešová, Karolina Zaczynska, Eva Fučíková
https://lindat.cz/services/SynSemClass35/