Dev8D: Working with British Library Data
February 24th, 2011Dev8D: Working with British Library data
What can you do with open access to data? What great ideas do you have for utilising open access to bibliographic catalogues? Or what example prototypes can you come up with in a month? We want to find out!
Thus began the Open Knowledge Foundation’s Open Bibliographic Data Challenge at Dev8D last week. The challenge is essentially: we have data, do something with it. Research Revealed had a deadline running up to Dev8D so I must confess I hadn’t been paying attention, however the announcement on Wednesday morning immediately caught my interest since we had already considered exploring one of these datasets: the JISC Open Bibliography British National Bibliography dataset.
The Dataset
The British Library data was released late last year. It describes:
new books published in the UK since 1950; this represents about 20% of the total BL catalogue
And all under the very liberal CC0 licence. (although note that this doesn’t say anything about database rights)
Our (Bristol) Data
On our side we have data largely derived from Bristol’s IRIS publications database. IRIS (as far as I can tell) has been operating for well over a decade. My impression is that it was primarily motivated by the need to support the various higher education research assessment exercises.
IRIS data is added by the researchers themselves. You can find an example here, which gives an idea of what it contains. This is one of the better examples, containing as it does a web link and an abstract. IRIS data is a little patchy, usually (88%) missing abstracts and rarely including links to the content; however it’s still an incredible useful source, and now contains over 80,000 items.
What I did
So one one hand we have a vast collection of information concerning books, on the other a less well curated publication database.
My first step was converting the rdf/xml to n-triples. It’s always worth doing this ‘preflight’ step to prevent broken data getting into the database. In this case it found:
- Extensive use of
rdf:descriptionrather thanrdf:Description - Unicode NFC issues throughout.
The former is simple to fix with sed s/rdf:description/rdf:Description/g. The latter is a nuisance, and I couldn’t find a tool to fix it. Warnings switched off…
- Bristol data. ~ 3 million triples, ~ 100,000 publications (broadly conceived). Loaded into named graphs.
- BL data. ~ 100 million triples. Loaded into default graph.
Next I put it all into an rdf store. Loading this on my laptop was, in retrospect, a little rash. I used TDB for storage and Fuseki for querying and update. On my laptop (13 inch macbook pro, SSD, 8Gb memory) using tdbloader2 it seemed to load in a few hours, but accurate timing wasn’t possible given the constant moving around at dev8d. It chugged along in the background happily enough.
So what can we do with this? My two ideas weren’t especially inspired:
- Identify people between the datasets, and contribute the links back.
- Enrich our publication and people data with BL data.
The former struck me as a more laudable aim, but as we’ll see it’s currently impossible.
So how to match people? A cursory examination of the data wasn’t very promising. People only seem to exist as name labels hanging off dc:contributor. The names also differed in form from our names, but this was straightforward to fix with a SPARQL update and a custom function:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dc: <http://purl.org/dc/terms/>
prefix ilrt: <http://www.ilrt.org/#>
insert into <urn:x:normalised>
{ ?person foaf:name ?nlabel }
where
{
select ?person (ilrt:normaliseName(?label) as ?nlabel)
{
graph ?g1 { ?s dc:contributor ?person }
graph ?g2 { ?person rdfs:label ?label }
}
}
Here I use a custom java function ‘normaliseName’ to coerce our names into the right form, from ‘Forename(s) Surname’ to ‘Surname, Forename(s).’.
With that in place let’s try some simple name matching:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
insert into <http://www.dev8d.org/> { ?mine skos:closeMatch ?theirs }
where
{
graph ?g1 { ?mine a foaf:Person }
graph ?g2 { ?mine foaf:name ?name } # UoB names
?theirs rdfs:label ?name # BL names
}
Here I use skos:closeMatch for my 11,609 putative matches, in the absence of anything vaguer.
As matches go this is pretty dodgy, so let’s look for more clues. As far as I can tell the bulk of the data concerns books (surprise), and we do have book information. Of the ~82,000 research outputs 2,442 are books authored by our researchers, so they will be listed directly as contributors. There are many other publications, of course, but the nature of the contribution is less direct. (Revisiting this I now realise writing a chapter in a book — 9,000 cases — should also be made direct). So we try a simple coupling:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dc: <http://purl.org/dc/terms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
select distinct ?name ?title
{
graph <http://www.dev8d.org/> { ?local skos:closeMatch ?remote } # get cross matched people
graph ?g1 { ?local rdfs:label ?name } # get their names
graph ?g2 { ?book dc:contributor ?local ; dc:title ?title } # get the title of books they contributed to (UoB)
?s dc:contributor ?remote ; dc:title ?title # match remote versions where book titles are identical
}
This produces a list where both name and book titles match. Counting the distinct individuals I found 50 people (from a total of 34,000) that I feel reasonably sure are correctly identified.
Although we’re really stuck with author names, books have much more information including — hurrah — isbns, which our researchers have also helpfully provided. The BL data looks like this:
<rdf:Description>
<dcterms:title>A list of composers, authors...</dcterms:title>
...
<dcterms:identifier>URN:ISBN:1905419155</dcterms:identifier>
</rdf:Description>
The equivalent in my data would look like:
<rdf:Description about="http://...."> <!-- or urn:isbn:1905419155 -->
<dcterms:title>A list of composers, authors...</dcterms:title>
...
<bibo:isbn>1905419155</bibo:isbn>
</rdf:Description>
(The uri used varies depending on whether we have a record about this thing, or a record mentioning this thing)
Which is close, but not close enough. But it’s fixable:
prefix dc: <http://purl.org/dc/terms/>
prefix bibo: <http://purl.org/ontology/bibo/>
prefix owl: <http://www.w3.org/2002/07/owl#>
insert into <http://www.dev8d.org/>
{ ?local owl:sameAs ?bl }
where
{
graph ?g { ?local bibo:isbn ?isbn }
bind (concat('URN:ISBN:', ?isbn) as ?isbnurn)
?bl dc:identifier ?isbnurn
}
(I’m not sure about the status of BIND in sparql 1.1. We can do the equivalent using subqueries as I did above)
This gives about 6000 matches, a pretty fair proportion of the books available.
Finally we can try the name match again:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dc: <http://purl.org/dc/terms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix owl: <http://www.w3.org/2002/07/owl#>
select count(distinct ?local)
{
# get cross matched people
graph <http://www.dev8d.org/> { ?local skos:closeMatch ?remote }
graph ?g2 { ?book1 dc:contributor ?local } # get the books they contributed to (UoB)
# cross match books
graph <http://www.dev8d.org/> { ?book1 owl:sameAs ?book2 }
?book2 dc:contributor ?remote # get the books they contributed to (BL)
}
Answer: 220. A handy improvement.
Results
Returning to my two ideas, how did they work out?
- Identify people between the datasets, and contribute the links back.
We have 220 links but — and here’s the dreadful punchline to the whole process — the BL uses blank nodes everywhere, and doesn’t provide any identifiers at all for people. I’ve created a cross linking file which uses the same identifier form as the BL, however the more astute amongst you will notice that I didn’t use the BL data to create this.
- Enrich our publication and people data with BL data.
I ran out of time, but there’s a wealth of information available here. For our work the most tempting parts are the book descriptions and subjects, which are straightforward to add. But the enrichment will go beyond this, because we’re also looking at using data mining techniques to extract topics and detect similarities in research. Any textual data helps this task.
Next steps
In the next few weeks I intend to:
- Look at pulling in data by ISSN. Won’t improve people matches, will improve our data.
- Consider using subject data.
- Improve name matching beyond exact string matches. Our LARQ free text index may help here.
The British Library appear to be committed to improving their rdf. They published samples of improved data last month which fixes some of the issues I tripped over.
The JISC OpenBibliography project is also working hard on improving the data. Most importantly they have generated urls, so I might add
- Link our contributors to their publications, and publish those links
To the list above.
Acknowledgements
First and foremost, that must go to the British Library for opening up their data. I hope my experiment encourages them.
Secondly, thanks to the challenge organisers, Rufus Pollock, Jo Walsh and Mark MacGillivray for putting up with an entry mostly made of sparql queries.
And finally thanks to the organisers of Dev8D for putting on such a great event.





