Linked data: recognising entities in your data
- Tech Blog
In 2018 we started experimenting with linked data in our News of the Great War newspaper database. After running some successful tests linking a sample of our newspapers to DBpedia, we wanted to go a step further. The metadata for all The Archive’s newspapers were published as linked data, and we wanted a specific use case to show what you could do with it, e.g. create links from newspaper content. Our search for a project led us to the List of Names (Namenlijst) at the In Flanders Fields Museum.
Which data are we going to link?
The List of Names is a database containing the names of over 500,000 casualties from the First World War, with information such as date and place of birth and death, profession, address, regiment and similar. We wanted to link these names to names in our archive’s newspapers, so we started the project by investigating the best way to link these two data sources together.
The first step: indexing
The first thing to do was index the names in the newspapers to make them quickly and easily searchable, but the questionable quality of the newspaper OCR text complicated this. The accuracy of OCR text depends on how legible the original text is – its quality, clarity, font, colour and contrast. In our case, there were too many recognition errors and the strange and unexpected characters made indexing difficult, so we looked at three techniques to get round the issue.
1. NER tagging
NER software – short for named entity recognition – seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as people or places, minimising the amount of data to search through. Research by Stanford suggested it would be the most suitable software for our needs, but NER tagging is far from 100% accurate – especially when combined with mistakes in OCR text. So we were still missing lots of names because they weren’t tagged as ‘person’ in the text, but as something else or not at all. We therefore stopped using NER in this project, and tried a second searching strategy instead.
2. Solr with proximity search
We used texts from our News of the Great War collection for NER tagging, but the List of Names was our starting point in this new strategy. Solr – an open source enterprise search platform – is designed for scalability and fault tolerance, and makes all data quickly searchable. We were particularly interested in its proximity search feature, which creates a greater tolerance for recognising names, e.g. if you search for ‘Jan Janssens’ with a proximity tolerance of one word, ‘Jan Wilfried Janssens’ will also be recognised. It all seemed very promising at first, but this proximity search feature ultimately produced lots of false-positive matches, so wasn’t very suitable for our project. We did keep using Solr, however, because it’s so much faster than other tried-and-tested methods such as binary search and database queries.
3. Solr without proximity search
Once the newspapers had been indexed in Solr, we searched the index for matches with every name in the List of Names. We also searched for variations of these names in case the same person sometimes had a different first name or surname, e.g. with an alternative spelling and/or middle names. The potential matches we found in the newspapers were written to a relational database for further analysis.
Frequently occuring names
This analysis told us that some names – or rather, some combinations of first and family names – appear very frequently. These are names such as: James Street, Richard Wagner, Wilhelm Kaiser, George Lloyd, Karl Marx and Albert Hall, which appear in the List of Names but also have a well-known counterpart. Someone by the name of Richard Wagner died in Henegouwen on 20 November 1917, for example, but the mentions in the newspapers aren’t about the casualty; they’re about the famous composer. Further analysis showed us that it was virtually impossible to link these ‘famous’ casualties mentioned in the newspapers to their corresponding names in the List of Names. So we eliminated anything that appeared too frequently – such as street names, places and famous people – from the list of potential matches, and ended up with 152,614 links for 556 commonly occurring names.
All these actions led to us finding 110,000 matches for people with the same first and family name in both the List of Names and the newspapers from the News of the Great War collection (from the 500,000 people in the List of Names), with some people appearing several times in different newspapers. We also found 1.1 million potential links between the List of Names and mentions in the News of the Great War newspapers. These links appeared on 102,729 of the 274,924 newspaper pages available at News of the Great War (around 37%).
It should be noted that these are only potential links; sharing the same first and family name is no guarantee it’s the same person, as we saw with Richard Wagner. That’s why we also take other factors into account, such as date of birth, before publishing links with the List of Names.