[TMP-026] Collection of datasets tailored for HumanE-AI multimodal perception and modelling

Contact person: Mireille Hildebrandt (m.hildebrandt@cs.ru.nl)

Internal Partners:

University of Sussex (UOS)
German Research Centre for Artificial Intelligence (DFKI)
Vrije Universiteit Brussel (VUB)

HumanE-AI research needs data to advance. Often, researchers struggle to progress because of the lack of data. At the same time, collecting a rich and accurate dataset is no easy task. Therefore, we propose to share through the AI4EU platform the datasets already collected so far by different research groups. The datasets are curated to be ready-to-use for researchers. Possible extension and variation of such datasets are also generated using artificial techniques and published on the platform. A performance baseline is provided for each dataset, in the form of publication reference, developed model or written documentation. The relevant legal framework will be investigated with specific attention to privacy and data protection, as to highlight limitations and challenges for the use and extension of existing datasets as well as future data collection on the subject of multimodal data collection for perception modelling.

Results Summary

There were 2 main outputs:

— Datasets: The partners involved have created, curated and released datasets for Human Activity Recognition (HAR) tasks, in particular, the extended dataset OPPORTUNITY++ and Wearlab BeachVolleyball dataset. The participation in the microproject has offered the chance to get a closer look at the practices, doubts and difficulties emerging within the scientific community involved in the creation, curation and dissemination of training datasets. Considering that one of the goals of the HumanE-AI Net is to connect research with relevant use cases in European society and industry, the participation to the microproject has offered the occasion to situate dataset collection, curation, and release within the broader context of AI pipeline.

— A comprehensive report introducing the concept of “Legal Protection Debt”: the Report examines the potential issues that arise within current ML-practices and provides an analysis of the relevant normative frameworks that govern such practices. By bridging the gap between practices and legal norms, the Report provides researchers with the tools to assess the risks to fundamental rights and freedoms that may occur due to the implementation of AI research in real world situations and recommends a set of mitigating measures to reduce infringements and to prevent violations.

The Report acknowledges that datasets constitute the backbone infrastructure underpinning the development of Machine Learning. The datasets that are created, curated and disseminated by ML practitioners provide the data to train ML models and the benchmarks to test the improvement of such models in performing the tasks for which they are intended.

However, until recently, the practices, processes and interactions that take place further upstream the ML-pipeline, between the collection of data and the use of dataset for training ML-models, have tended to fade into the background.

The report argues that the practices of dataset creation, curation and dissemination play a crucial role in the setting of the level of legal protection that is afforded to all the legal subjects that are located downstream ML-pipelines. Where such practices lack appropriate legal safeguards, a “Legal Protection Debt” can mount up incrementally along the stages of MLpipelines.

In section 1.1., the Report provides a brief overview of how current data science

practices depend on and perpetuate an ecosystem characterised by a lack of structural safeguards for the risks posed by data processing. This can lead to the accumulation of “technical debt”. Such debt, in turn, can assume relevance in the perspective of compliance with legal requirements. Taking inspiration from the literature on technical and ethical debt, the Report introduces the concept of Legal Protection Debt. Because of this legal protection debt, data-driven systems implemented at the end of the ML pipeline may lack the safeguards necessary to avoid downstream harm to natural persons.

The Report argues that the coming about of Legal Protection Debt and its accumulation at the end of the ML pipeline can be contrasted through the adoption of a Legal protection by design approach. This implies the overcoming of a siloed understanding of legal liability that mirrors the modular character of ML pipelines. Addressing legal protection debt requires ML practitioners to adopt a forward looking perspective. Such perspective should situates the stage of development in practitioners are involved in the context of the further stages that take place both upstream and downstream the pipeline. The consideration of the downstream stages of the ML-pipeline shall, as it were, back propagate and inform the

choices as to the technical and organisational measure to be taken upstream: upstream design decisions must be based on the anticipation of the downstream uses afforded by datasets and the potential harms that the latter may cause. Translated into a legal perspective, this implies that the actors upstream the pipeline should take into consideration the legal requirements that apply to the last stages of the pipeline.

The Report illustrates how data protection law lays down a set of legal equirements that overcome modularity and encompass the ML pipeline in its entirety, connecting the actors upstream with those downstream. The GDPR makes controllers responsible for the effects of the processing that they carry out. In section 2, the Report shows how the GDPR provides the tools to mitigate the problem of many hands in ML-pipelines. The duties and obligations set by the GDPR require controllers to implement by design safeguards that conjugate the

need to address downstream harms with the necessity to comply with the standards that govern scientific research. In this perspective, the Report shows that the obligations established by data protection law either instantiate or harden most of the requirements set by the Open science and Open data framework and also the best practices emerging within the ML-community.

In section 2.1. the report illustrates the core structure of the regime of liability to which controllers are subject under the GDPR. Such a regime of liability hinges upon controllers’ duty to perform a context-dependent judgment. Such judgment must inform controllers’ decisions as to the measures to be adopted to ensure compliance with all the obligations established by the GDPR. Such judgment must be based on the consideration of the downstream harms posed by the processing.

In essence, the duty to anticipate and address potential downstream harms requires controllers to adopt a forward-looking approach. In order to ensure compliance with the GDPR, controllers must engage in a dynamic, recursive practice that addresses the requirements of present processing in the light of the future potential developments. At the same time, the planning effort required by the GDPR is strictly connected with the compliance with obligations set by other normative frameworks. In this sense, compliance with the GDPR and compliance with obligations such as those imposed by the Open science and Open data framework go hand in hand. Compliance with the GDPR is a pre-requisite

for complying with Open science and Open data framework. Simultaneously, the perspective of open access and re-usability of datasets affects the content of the obligations set by the GDPR.

As a result, the consideration of “what happens downstream” – i.e., the potential uses of datasets, potential harms that the latter may cause, further requirements imposed by other normative frameworks – back propagates, determining the requirements that apply upstream.

In section 2.2. we show how the compliance with the documentation obligations set by the GDPR can contrast the accumulation of a documentation debt and ensure controllers’ compliance with the obligations established by other normative frameworks, such as Open Data and Open Science. The overlapping between the documentation requirements established by such different frameworks shows firstly that a serious approach to the compliance with the GDPR can provide the safeguards necessary to contrast the accumulation of a documentation debt. In this way, compliance with the documentation obligations set by the GDPR can prevent the accumulation of other forms of technical debt and, eventually, of legal protection debt. At the same time, the convergence between the requirements set by the GDPR and those established by the FAIR principle and the Horizon DMP template shows how the performance of the documentation obligations established by the GDPR can also facilitate compliance with requirements specific to data processing conducted in the context of scientific research.

A correct framing of the practices of dataset creation, curation and release in the context of research requires to make an effort towards the integrity of the legal framework as a whole, taking into consideration the relations between Open data, Open science and data protection law. First, it is first important to stress that compliance with data protection law represents a pre-requisite for the achievement of the goals of Open Data and Open Science framework.

In section 2.3. the report analyses the requirements that govern the release and downstream (re)use of datasets. Compliance with the requirements set by the GDPR is essential to avoid that dataset dissemination gives rise to the accumulation of legal protection debt along ML pipelines. Based on the assessment of adequacy and effectiveness required for all forms of processing, controllers can consider the adoption a range of measures to ensure that data

transfer are compliant with the GDPR. Among such measures, the Report examines the use of licenses, the providing of adequate documentation for the released dataset, data access management and traceability measures, including the use of unique identifiers.

The Report contains an Annex illustrating the provisions of the GDPR that establish a special regime for the processing carried out for scientific research purposes. We highlight how most of the provisions contained in the GDPR are not subject to any derogation or exemption in view of the scientific research purpose of the processing. All in all, the research regime provided by the GDPR covers the application of a limited number of provisions (or part of provisions). A process that is unlawful in that it does not comply with the general provisions set by the GDPR cannot enjoy the effects of the derogations provided by the research regime. The derogations allowed under the special research regime concern

almost exclusively the GDPR provisions on the rights of data subjects, while no derogation is possible for the general obligations that delineate the responsibility of the controller. The derogations provided under the special research regime allow controllers to modulate their obligations towards data subjects where the processing of personal data is not likely to affect significantly the natural persons that are identified or identifiable through such data. As it were, the decrease of the level of potential harm makes possible the lessening of the safeguards required to ensure the protection of data subjects. Even in such cases, however, no derogation is allowed with respect to the requirements different than those concerning the rights of the data subject. This circumstance makes manifest that the system established by the GDPR aims at providing a form of protection that goes beyond the natural persons whose personal data are processed at that time by controllers.

[TMP-026] Collection of datasets tailored for HumanE-AI multimodal perception and modelling

Results Summary

Knowledge 4 All Foundation Ltd.

Humane AI on Social Media