GIVE metadata project
Much of the content in meemoo’s archive is currently not adequately described. The fourth project in GIVE, Gecoördineerd Initiatief voor Vlaamse Erfgoeddigitalisering (Coordinated Initiative for Flemish Heritage Digitisation), is therefore completely dedicated to metadata. In this project, we’re investigating the possibilities of an automatic description process – a crucial step to improve findability and re-use.
This project is part of the Flemish Government’s 'Resilience Recovery Plan' and has been made possible thanks to support from the European Regional Development Fund (ERDF) [links in Dutch].
At meemoo, we archive a vast number of audio and video files from cultural, media and heritage organisations. At the end of 2022, the counter was at over 6.5 million items in total, with 2 million items consisting of audiovisual content. Where do all these files come from? We’ve successfully digitised a large proportion of the audiovisual carriers in Flemish cultural archives over recent years, and meemoo’s archive system also accommodates born-digital content.
This mass of content has not or not always been annotated properly, however, and is therefore not easily searchable, which has a negative effect on encouraging its re-use. A file that isn’t described cannot be found and so also not re-used.
The solution is found by adding and expanding metadata, but catching up on all this work manually is a hopeless task – processing metadata by hand takes a long time. That’s why we’re focusing on an automatic description process using techniques such as artificial intelligence (AI), machine learning (link in Dutch) and computer vision.
Meemoo is responsible for organising and coordinating the GIVE metadata project. We’re opting for services and algorithms that have already been developed as much as possible for this, and cooperating with external suppliers for the implementation. This means we will not need to train or roll out any or only a limited number of new models, unless there is no other option available.
What are we planning?
Given its funding, this project has an impact on all the collections stored by meemoo, except for those from our media partners. Their collections will be enriched in the project Shared AI. In order to add metadata to the collections of our culture and government partners, we’re launching three activities around metadata creation over the next two and a half years (July 2021-2023). We are focusing on mature techniques for this and are developing workflows that can continue to be used after the project.
Activity 1: speech recognition
In this first activity, we’re focusing on recognising the Dutch language used in some 160,000 audio and video files. This means providing metadata for a staggering amount of over 160,000 hours of content. The speech in the audio and video files will be converted into searchable text with time stamps using existing and commercially available tooling. We're relying on the tooling by Speechmatics.
Activity 2: entity recognition in text
We will then start named entity recognition (NER) on the texts generated in the speech recognition activity. This is how we search for names of people, organisations or locations, for example. Where possible, some of these entities will be linked to existing files in linked open data sources. The underlying technology used in entity recognition is NLP – software that ‘understands’ written texts.
Activity 3: face detection and face recognition
We’re enriching some 120,000 hours of video content in this third and final activity, and want to start by detecting faces without immediately naming them. Each face that appears in a video isn’t necessarily a face that we need to attach a name to, after all. Building further, we’ll apply face recognition to the detected faces – opting for a fixed set of faces that we will link to existing public figures. Where possible, we will link to existing data sources such as VIAF, Wikidata and ODIS. In this activity, we will build upon the insights gained in the FAME project. By doing so, we guarantee scaling up the processing of the video content.
Need for legal and ethical framework
We must not lose sight of privacy and a proper legal and ethical framework in this process, especially for face detection and recognition. That’s why we took the first step by carrying out a Data Protection Impact Assessment (DPIA) [link in Dutch] in 2021 already. Besides, we built on a sound ethical framework together with the Knowledge Centre Data & Society and several stakeholders.
We fully acknowledge that technologies like face recognition need to be handled with care. Meemoo-collegues Bart Magnus and Rutger Goeminne wrote a tech blog about the legal and ethical challenges within the FAME-project and the first phase of the GIVE metadataproject. Read it here.
Ready for re-use
A final, essential step is to make the acquired metadata accessible. The metadata gained in the three activities will be shared and made usable through our content partners’ applications and by meemoo, and we will also store this metadata in our metadata infrastructure. This will make the content more searchable for the general public. Furthermore, we’re getting started with data mining – an automatic analysis technique to extract information and knowledge from metadata.
More GIVE projects?
The GIVE metadata project is one of four within the GIVE initiative. In addition to metadata enrichment, the digitisation of newspapers (Primeur), glass plates and Flemish masterpieces is also on the agenda. You can read how we selected these four projects here.
Meemoo is also contributing to other elements in the Flemish Government’s 'Resilience Recovery Plan', in particular for Flemish heritage databases, supervising cultural organisations in their digital collection registration projects and the digital leap in education.
We’re working with some 120 content partners from the cultural sector in the GIVE metadata project.