GIVE metadata project
Much of the content in meemoo’s archive is currently not adequately described. The fourth project in GIVE, Gecoördineerd Initiatief voor Vlaamse Erfgoeddigitalisering (Coordinated Initiative for Flemish Heritage Digitisation), is therefore completely dedicated to metadata. In this project, we’re investigating the possibilities of an automatic description process – a crucial step to improve findability and re-use.
At meemoo, we archive a vast number of audio and video files from cultural, media and heritage organisations. The counter is currently at over 6 million items in total, with 2 million items consisting of audiovisual content. Where do all these files come from? We’ve successfully digitised a large proportion of the audiovisual carriers in Flemish cultural archives over recent years, and meemoo’s archive system also accommodates born-digital content.
This mass of content has not or not always been annotated properly, however, and is therefore not searchable, which has a negative effect on encouraging its re-use. A file that isn’t described cannot be found and so also not re-used.
The solution is found by adding and expanding metadata, but catching up on all this work manually is a hopeless task – processing metadata by hand takes a long time. That’s why we’re focusing on an automatic description process using techniques such as artificial intelligence (AI), machine learning (link in Dutch) and computer vision.
Meemoo is responsible for organising and coordinating the GIVE metadata project. We’re opting for services and algorithms that have already been developed as much as possible for this, and cooperating with external suppliers for the implementation. This means we will not need to train or roll out any or only a limited number of new models, unless there is no other option available.
What are we planning?
Given its funding, this project has an impact on all the collections stored by meemoo, except for those from our media partners. In order to add metadata to these collections, we’re launching three activities around metadata creation over the next two years (2022-2023). We’ll be focusing on mature techniques for this and aiming to develop workflows that can continue to be used after the project.
Activity 1: speech recognition
In this first activity, we’re focusing on recognising the Dutch language used in some 160,000 audio and video files. This means providing metadata for a staggering amount of over 170,000 hours of content. The speech in the audio and video files will be converted into searchable text with time stamps using existing and commercially available tooling.
Activity 2: entity recognition in text
We will then start named entity recognition (NER) on the texts generated in the speech recognition activity. This is how we search for names of people, organisations or locations, for example. Where possible, some of these entities will be linked to existing files in linked open data sources. The underlying technology used in entity recognition is NLP – software that ‘understands’ written texts.
Activity 3: face detection and face recognition
We’re using some 120,000 hours of video content in this third and final activity, and want to start by detecting faces without immediately naming them. Each face that appears in a video isn’t necessarily a face that we need to attach a name to, after all. Building further, we’ll apply face recognition to the detected faces – opting for a fixed set of faces that we will link to existing public figures. Where possible, we will link to existing data sources such as VIAF, Wikidata and ODIS. This activity builds further on insights gained in the FAME project.
Need for legal framework
We must not lose sight of privacy and a proper legal framework in this process, especially for face detection and recognition. That’s why we took the first step by carrying out a Data Protection Impact Assessment (DPIA) [link in Dutch] in 2021 already.
Ready for re-use
A final, essential step is to make the acquired metadata accessible. The metadata gained in the four activities will be shared and made usable through our content partners’ applications and by meemoo, and we will also store this metadata in our metadata infrastructure. This will make the content more searchable for the general public. Furthermore, we’re getting started with data mining – an automatic analysis technique to extract information and knowledge from metadata.
Meemoo is also contributing to other elements in the Flemish Government’s 'Resilience Recovery Plan', in particular for Flemish heritage databases, supervising cultural organisations in their digital collection registration projects and the digital leap in education.
We’re working with some 120 content partners from the cultural sector in the GIVE metadata project.