Developing user-friendly software for narrative analysis of text data.

In this project we continue the development of the Segram package for Python. The purpose of the package is to provide tools for automated narrative analysis of text data focused on extracting information on basic building blocks of narratives – agents (both active and passive), actions, events, or relations between agents and actions (e.g. determining subjects and objects of actions), as well as descriptions of actors, actions and events. The development process is also naturally paired with conceptual work on representations of narratives.

The package is designed as a graybox model. It is based on an opaque statistical language model providing linguistic annotations, which are subsequently used by transparent deterministic algorithms for discovering narrative elements. Thus, the final output should be easy to interpret and validate by human users, whenever necessary. Moreover, by lifting the analysis from the purely linguistic level to the arguably more intuitive level of narratives, it is hoped that the provided tools will be significantly easier to use and understand for end users, including those without training in linguistics and/or computer science.

The proposed framework is aimed at language understanding and information extraction, as opposed to language generation. Namely, the role of the package is to organize narrative information in convenient data structures allowing effective querying and deriving of various statistical descriptions. Crucially, thanks to its semi-transparent nature, the produced output should be easy to validate for human users. This should facilitate development of shared representations (corresponding the WP1 and WP2 motivated goal: „Establishing Common Ground for Collaboration with AI Systems”) of narratives, understandable for both humans and machines, that are the same time trustworthy (by being easy to validate for humans), which is arguably a desirable feature, for instance in comparison to increasingly powerful but hard-to-trust large language models. In particular, the package should be useful for facilitating and informing human-driven analyses of text data.

Alpha version of the package implementing core functionalities related to grammatical and narrative analysis is ready. The goal of the present microproject is to improve the package and release a beta version. This will include implementing an easy-to-use interface (operating at the level of narrative concepts) for end users allowing effective querying and analysis of the data produced by Segram as well as developing a comprehensive documentation. Thus, the planned release should be ready for broader adoption to a wide array of use cases and users with different levels of linguistic/computational expertise.

Output

1. Segram package for Python published officialy at Python Package Index (PyPI, https://pypi.org/). It may be also published at Conda-forge, but it is not yet guaranteed at this stage.

2. Comprehensive package documentation available online at the Read the Docs platform (https://readthedocs.org/).

Project Partners

  • University of Warsaw, Andrzej Nowak

Primary Contact

Szymon Talaga, University of Warsaw

Results Description

The aim of the project is to develop a software package (for Pyhon) providing easy to use and understand (also for researchers not trained in computer science or linguistics) tools for extracting narrative information (active and passive actors, the actions they perform as well as descriptions of both actors and actions, which together define events) and organizing them in rich hierarchical data structures (data model is implicitly graphical) from which subsequently different sorts of descriptive statistics can be generated depending on particular research questions. Crucially, for this to be practically possible, a legible and efficient framework for querying the produced data is needed.

The above goal fits into a broader HumanE-AI objective of developing common ground concepts providing better representations shared by humans and machines alike. In particular, the contribution of the project to work on aligning machine analyses with human perspective through the notion of narratives is twofold. Firstly, narrative-oriented tools for automated text analyses can empower human analysts as, arguably, the narrative framework provides a more natural and meaningful context for people without formal training in linguistics and/or computer science for reasoning about textual data. Secondly, the development of the software for narrative analysis is naturally intertwined with conceptual work on the core terms and building blocks of narratives, which can inform subsequent work on more advanced approaches.

Importantly, the software is developed as a graybox model, in which core low-level NLP tasks, such as POS and dependency tagging, are performed by a blackbox statistical model, and then they are transformed to higher order grammar and narrative data based on a set of transparent deterministic rules. This is to ensure high explainability of the approach, which is crucial for systems in which the machine part is supposed to be a helper of a human analyst instead of an implicit leader.

Currently, the core modules of the package responsible for the grammatical analysis are mostly ready (but several improvements are still planned). This includes also a coreference resolution module. Moreover, the core part of the semantic module, which translates grammatical information to more semantic constructs focused on actors, actions and descriptions, is also ready. What is still missing are an interface exposing methods for end users allowing easy access and analysis of rich data produced by the package as well as a principled and convenient query framework on which the interface should be based. This is the main focus of the ongoing and future work. The second missing part is the documentation, but this part is best finished after the interface is ready.
Thus, even though the package in the current state can seem a little rough from the perspective of an end user, its quality and usefulness will increase steadily as new updates are delivered.

Publications

None

Links to Tangible results

https://github.com/sztal/segram