AITHENA: DATA – Life cycle management and generation

In the context of the AITHENA project, when it comes to making AI more trustworthy and human-centric, data management and privacy are at the heart of the issue. In Work Package 2 we have 3 different tasks that deal with the whole data lifecycle management and data generation.

The first task of the WP focuses on the creation of representative datasets. The first step was to create a questionnaire to collect the requirements and the desired file format for each sensor from the partners working on the use cases. After gathering the requirements, data collection campaigns were started by some of the partners involved, as well as the generation of synthetic data. The simulated data was generated using the Simcenter Prescan simulation engine, which simulated camera, LiDAR and RADAR data in an urban environment. For the recorded data, various partners used their vehicles either on the road or on test tracks to record data with the use cases in mind.

@ AITHENA project

All of the data collected has been pre-annotated using various AI models, and we’ll be working on reviewing the pre-annotated data so that it’s labelled correctly. 

We are also working on extending various datasets, such as nuScenes and Kitti, to better suit the purpose of the projects. On the one hand, 10 synthetic multimodal sensor corruptions have been applied to the nuScenes dataset so that researchers can test the robustness of their models. On the other hand, the same dataset has been processed with the map and CAN bus extension to extract scenarios. By adding the scenarios, it will be easier to test our models under the specific desired circumstances. The extracted scenarios are hard braking, hard acceleration, lane change, cut in, cut out, overtaking, pedestrian crossing zebra crossing, pedestrian crossing road, following, near collision.

For the second task we worked on data provenance and governance. The main work we did was to create a data card for the generated data. This data card is a document that summarises the essential information of a dataset and essentially acts as a documentation of the data. This document not only reports the content of the data, but also sensitive information, possible ethical concerns or biases and privacy issues, as well as data owners and licences for use. As part of this task, we are currently working on integrating the FiftyOne datasets tool to analyse and explore our data and automatically report it to the data cards.

Finally, for the third task of the WP, we are working on privacy preserving techniques and anonymisation. As part of the task, we have developed a tool that automatically blurs out licence plates and faces on the data captured by our vehicles. As this is the more ‘traditional’ approach, we are working on anonymising the data in a different way. To eliminate as much useful data as possible for the models to learn, we are developing a tool that replaces the pedestrians and number plates in the scene with synthetically generated ones. In this approach, we detect the pedestrian and their pose, and generate a fake pedestrian that is added on top of them, which eliminates more of the identifiable information of the person, while maintaining a more similar image to what the vehicle cameras will capture.

@ AITHENA project

For the license plate, we detect it and replace the content with random numbers and letters. We want to add the ability to check that this random sequence does not follow a particular European number plate standard, so that we do not accidentally add an existing number plate to our anonymised images.

@ ATHENA project