Metadata roadmap: the route to an improved metadata infrastructure

  • Report

If you’ve clicked to open this report already, we probably don’t need to convince you about the importance of metadata. You know it’s essential for sustainably storing and searching any archive or collection. It’s the engine that makes information accessible in various ways and allows content partners to curate their own archive content. Also for meemoo, metadata is crucial for the creation and execution of efficient internal processes and service provision.

The amount of data that we and our content partners feed into our archive system has increased dramatically over recent years. And because more and more content partners are joining us with new (types of) collections, metadata is also continuing to grow and change. Additions to existing metadata and new insights in general, play a key role here. Take the new information soon available from AI and machine learning, for example, and you will realise the kind of challenge we are facing.

Moreover, most of our content partners have their own management systems and structures, which they use to organise information about their collections. This means that the model we developed in 2016 and are continuing to expand is no longer as uniform as we’d like. This roadmap will help us take on the less visible and somewhat underexposed role of data integrator – focusing on unifying all the various metadata to make it accessible in an unambiguous way.

Various platforms and applications – belonging to meemoo, suppliers and content partners –organise data based on the functionality it offers to the user. These platforms are often ill-equipped to exchange data on a large scale, however, which results in so-called data silos. This inevitably leads to difficult transfers between different platforms, with complex and error-prone data transformations.

How do we resolve this?

There are several reasons why the complex task of data integration is one of the greatest challenges in various domains. In order to catch up with our historical backlog and prepare for increasing amounts of metadata (such as data streams from AI projects), we are setting out a metadata infrastructure roadmap. It needs to enable the correct management and processing of the ever-expanding volume and variety of metadata.

The roadmap focuses on making the following more sustainable, robust and uniform:

  • metadata storage and organisation: choosing database technology and framework;

  • machine-readable presentation of metadata: designing and using data models and formats;

  • metadata integration with user applications: method for (internal) tools and opening up platforms to search, add, modify and request metadata.

This roadmap translates objectives from our strategic plan into specific, high-level infrastructure goals. We’re using five concrete horizons to transform existing and future collections of metadata into a sustainable and accessible collective memory for meemoo and our partners and users.

Horizon 1: measurable and validated metadata quality

Metadata quality is context-dependent. The aspects of metadata that make it suitable for operating the archive system differ from the requirements for opening up platforms, digitisation and the digital influx of metadata. We are therefore using a context-related definition and classification as our starting point.

A classification classifies different metadata according to the functionality it enables (e.g. ‘findable in The Archive for Education’ or ‘can be digitised by the digitisation company’). A definition contains the list of measurable parameters that can provide an indication of quality within the classification (e.g. ‘the presence of keywords increases findability in The Archive for Education’).

This definition and classification make it possible to set up a metadata quality framework that collects relevant metrics and underlying measurable parameters per category. The scores that this produces tell us how ‘good’ the metadata is for a particular collection. A first result from such a metric describes the completeness of a certain field: which objects was this field filled in for, and for which wasn’t it?

We’re also continuing to roll out the validation per application: each process or tool must be able to check whether the metadata supplied meets the specific expectations.

Horizon 2: broadly supported metadata model

Metadata isn’t limited to what is attached to a certain object in a system. It’s much broader than this and can support any kind of knowledge. We can store metadata about the length of a video, but also about the organisation that manages it or the people who appear in it.

Figure 1: A modular data model could look like this. (For illustration purposes only – this example does not show any final choices for domains, models or other.)

We want to include this wide range of knowledge as completely as possible in our metadata model. Everything we know at meemoo about archive objects and their contexts must be linked and potentially made explicit and searchable. We’re also making it possible (by providing the facilities, tools and processes) to convert knowledge into metadata, although we are not yet implementing this for all types of information.

We see metadata modelling not just as an IT issue, and are introducing this practice across all the various teams. We use the organisation’s sector-specific language for this: no tables, XML, nodes or edges, but an intuitive representation of concepts, entities and relationships used within the operation.

We use a modular data model to describe this broad knowledge, consisting of:

  • a specific sub-model per relevant domain (e.g. preservation, digitisation, usage) to describe in detail the metadata created in this field;

  • a broad core model with the general concepts that appear in the sub-models;

  • meta models to annotate the accumulated knowledge such as usage rights (who can do what with the metadata or content) or data origin (who has added or modified metadata and when).

Metadata is therefore built up using the concepts from the specific sub-model and the broad core model. Content partners’ models are mapped to the objects they feed in from digital collections to our archive system, which can result in a combination of concepts from the core model and include one or more domain models. The metadata itself can then be annotated with concepts from the metadata models to form a kind of meta-metadata.

We use as many existing standards (e.g. Dublin Core, MPEG, PREMIS or OSLO) and authoritative thesauri (e.g. VIAF or ISO 639 language codes) as possible. This simplifies mapping for content partners who are already using these standards or thesauri, and helps them to link more quickly.

We have developed our data models primarily to support our service provision and processes, both for feeding in content and making it accessible. In order to help all parties as much as possible, we’re also looking at how we can regularly align fields with direct implications for our (content) partners – such as descriptive metadata, usage rights and determining the rights status.

Within meemoo, the underlying models also need to be editable and expandable for colleagues with varying levels of technical ability, e.g. using a tool. Focussing on internal procedures and documentation to apply the data models, and making models, thesauri, lists, etc. ensures that we can maintain these practices within our organisation in the long term, regardless of any change in personnel.

Horizon 3: sustainable data integrations

A specific implementation of the models from Horizon 2, and linking them to the metadata available in the various tools and platforms, results in the construction of a single, collective knowledge graph. A graph is a descriptive model that looks like a widespread network of links that can easily be extended.

We use an excavation model (RDF) and associated technology for this: excavation databases, ontology-based and data management tools, mapping tools and integrations with other software libraries. The Knowledge Graph unifies knowledge by making the metadata, thesauri, controlled lists and domain models accessible together in a uniform way. This creates a universal overview of metadata, which is always surrounded by context that provides meaning. It also means we can substantiate our linked data objectives from the core.

Customers can benefit from this by selecting the part of the Knowledge Graph in which they are interested. For example, The Archive for Education could select metadata based on the core model, the ‘Education’ domain model, the LOM thesauri and the licence that permits the content to be used for educational purposes.

The Knowledge Graph therefore has two layers:

  • the modelling layer: the concepts and relationships that exist there (e.g. a video, a title and a video can have a title) – this is the materialisation of the result from H2.

  • the instance of the modelling layer: the data models are applied to create metadata, e.g. this video has the title ‘Uitzending Het Journaal 10/10’ (‘Broadcasting The News 10/10’).

Horizon 4: uniform access to metadata

It must be possible to access everything that we at meemoo know about archive objects and their contexts as optimally as possible via our archive infrastructure. This creates an unambiguous view on the data, makes organisation-wide analyses possible, and provides extensive context to AI-based or other processes that create or enrich metadata.

PIDs are extended to persistent URIs for this, which – provided the necessary access is available – are retrievable and represent the current knowledge about this entity in a location-independent way. The entire collection of metadata is searchable with a simple visual interface and independent of existing software applications.

Software integrations can also query this metadata collection – provided the necessary access is available or in line with usage rights – using a standard search language (e.g. GraphQL, openCypher or SPARQL), or export it in standardised data formats (e.g. downloading metadata as linked datasets). In order to improve general findability further, relevant metadata is also embedded in public web pages (hetarchief.be, other platforms and websites) in a structured way that is readable for search engines.

We’re also paying attention to ensuring we have an accessible and validated influx to our metadata collection. We provide links with commonly used standards (e.g. METS or Adlib XML) or working methods that are popular with content partners, and support them with the right software applications. We then build a complete metadata migration process, which migrates the existing metadata from the meemoo archive system (MAM). We also offer a training programme to support content partners in adapting to the new ways of feeding in metadata.

Horizon 5: embedding the archive in a local cultural heritage metadata network

The steady expansion of a network with external data sources that can enrich the metadata is crucial in the context of improving the findability of archive content, opening it up to new domains, and the linked data objectives. This infrastructure makes that possible.

How we go about this and help to build a collective linked-data publishing strategy is the final building block on our roadmap, but perhaps also one of the most promising. It enables us to manage the metadata in a decentralised way. This means that metadata no longer needs to be migrated or mapped, but that meemoo can use the metadata directly from the source (e.g. the content partner or external sources). And the reverse is of course also possible: other partners can use the metadata made available by meemoo, provided they have the necessary access rights.

Do you have a question?
Contact Miel Vander Sande
Data Architect