Dev8D: Working with British Library Data

February 24th, 2011

Dev8D: Working with British Library data

What can you do with open access to data? What great ideas do you have for utilising open access to bibliographic catalogues? Or what example prototypes can you come up with in a month? We want to find out!

Thus began the Open Knowledge Foundation’s Open Bibliographic Data Challenge at Dev8D last week. The challenge is essentially: we have data, do something with it. Research Revealed had a deadline running up to Dev8D so I must confess I hadn’t been paying attention, however the announcement on Wednesday morning immediately caught my interest since we had already considered exploring one of these datasets: the JISC Open Bibliography British National Bibliography dataset.

The Dataset

The British Library data was released late last year. It describes:

new books published in the UK since 1950; this represents about 20% of the total BL catalogue

And all under the very liberal CC0 licence. (although note that this doesn’t say anything about database rights)

Our (Bristol) Data

On our side we have data largely derived from Bristol’s IRIS publications database. IRIS (as far as I can tell) has been operating for well over a decade. My impression is that it was primarily motivated by the need to support the various higher education research assessment exercises.

IRIS data is added by the researchers themselves. You can find an example here, which gives an idea of what it contains. This is one of the better examples, containing as it does a web link and an abstract. IRIS data is a little patchy, usually (88%) missing abstracts and rarely including links to the content; however it’s still an incredible useful source, and now contains over 80,000 items.

What I did

So one one hand we have a vast collection of information concerning books, on the other a less well curated publication database.

My first step was converting the rdf/xml to n-triples. It’s always worth doing this ‘preflight’ step to prevent broken data getting into the database. In this case it found:

  • Extensive use of rdf:description rather than rdf:Description
  • Unicode NFC issues throughout.

The former is simple to fix with sed s/rdf:description/rdf:Description/g. The latter is a nuisance, and I couldn’t find a tool to fix it. Warnings switched off…

  • Bristol data. ~ 3 million triples, ~ 100,000 publications (broadly conceived). Loaded into named graphs.
  • BL data. ~ 100 million triples. Loaded into default graph.

Next I put it all into an rdf store. Loading this on my laptop was, in retrospect, a little rash. I used TDB for storage and Fuseki for querying and update. On my laptop (13 inch macbook pro, SSD, 8Gb memory) using tdbloader2 it seemed to load in a few hours, but accurate timing wasn’t possible given the constant moving around at dev8d. It chugged along in the background happily enough.

So what can we do with this? My two ideas weren’t especially inspired:

  • Identify people between the datasets, and contribute the links back.
  • Enrich our publication and people data with BL data.

The former struck me as a more laudable aim, but as we’ll see it’s currently impossible.

So how to match people? A cursory examination of the data wasn’t very promising. People only seem to exist as name labels hanging off dc:contributor. The names also differed in form from our names, but this was straightforward to fix with a SPARQL update and a custom function:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix dc: <http://purl.org/dc/terms/>
prefix ilrt: <http://www.ilrt.org/#>

insert into <urn:x:normalised>
{ ?person foaf:name ?nlabel }
where
{
  select ?person (ilrt:normaliseName(?label) as ?nlabel)
  {
    graph ?g1 { ?s dc:contributor ?person }
    graph ?g2 { ?person rdfs:label ?label }
  }
}

Here I use a custom java function ‘normaliseName’ to coerce our names into the right form, from ‘Forename(s) Surname’ to ‘Surname, Forename(s).’.

With that in place let’s try some simple name matching:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

insert into <http://www.dev8d.org/> { ?mine skos:closeMatch ?theirs }
where
{
  graph ?g1 { ?mine a foaf:Person }
  graph ?g2 { ?mine foaf:name ?name  } # UoB names
  ?theirs rdfs:label ?name # BL names
}

Here I use skos:closeMatch for my 11,609 putative matches, in the absence of anything vaguer.

As matches go this is pretty dodgy, so let’s look for more clues. As far as I can tell the bulk of the data concerns books (surprise), and we do have book information. Of the ~82,000 research outputs 2,442 are books authored by our researchers, so they will be listed directly as contributors. There are many other publications, of course, but the nature of the contribution is less direct. (Revisiting this I now realise writing a chapter in a book — 9,000 cases — should also be made direct). So we try a simple coupling:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dc: <http://purl.org/dc/terms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>

select distinct ?name ?title
{
  graph <http://www.dev8d.org/> { ?local skos:closeMatch ?remote  } # get cross matched people
  graph ?g1 { ?local rdfs:label ?name } # get their names
  graph ?g2 { ?book dc:contributor ?local ; dc:title ?title } # get the title of books they contributed to (UoB)
  ?s dc:contributor ?remote ; dc:title ?title # match remote versions where book titles are identical
}

This produces a list where both name and book titles match. Counting the distinct individuals I found 50 people (from a total of 34,000) that I feel reasonably sure are correctly identified.

Although we’re really stuck with author names, books have much more information including — hurrah — isbns, which our researchers have also helpfully provided. The BL data looks like this:

<rdf:Description>
    <dcterms:title>A list of composers, authors...</dcterms:title>
    ...
    <dcterms:identifier>URN:ISBN:1905419155</dcterms:identifier>
</rdf:Description>

The equivalent in my data would look like:

<rdf:Description about="http://...."> <!-- or urn:isbn:1905419155 -->
    <dcterms:title>A list of composers, authors...</dcterms:title>
    ...
    <bibo:isbn>1905419155</bibo:isbn>
</rdf:Description>

(The uri used varies depending on whether we have a record about this thing, or a record mentioning this thing)

Which is close, but not close enough. But it’s fixable:

prefix dc: <http://purl.org/dc/terms/>
prefix bibo: <http://purl.org/ontology/bibo/>
prefix owl: <http://www.w3.org/2002/07/owl#>

insert into <http://www.dev8d.org/>
{ ?local owl:sameAs ?bl }
where
{
  graph ?g { ?local bibo:isbn ?isbn  }
  bind (concat('URN:ISBN:', ?isbn) as ?isbnurn)
  ?bl dc:identifier ?isbnurn
}

(I’m not sure about the status of BIND in sparql 1.1. We can do the equivalent using subqueries as I did above)

This gives about 6000 matches, a pretty fair proportion of the books available.

Finally we can try the name match again:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dc: <http://purl.org/dc/terms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select count(distinct ?local)
{
  # get cross matched people
  graph <http://www.dev8d.org/> { ?local skos:closeMatch ?remote }
  graph ?g2 { ?book1 dc:contributor ?local } # get the books they contributed to (UoB)
  # cross match books
  graph <http://www.dev8d.org/> {  ?book1 owl:sameAs ?book2 }
  ?book2 dc:contributor ?remote # get the books they contributed to (BL)
}

Answer: 220. A handy improvement.

Results

Returning to my two ideas, how did they work out?

  • Identify people between the datasets, and contribute the links back.

We have 220 links but — and here’s the dreadful punchline to the whole process — the BL uses blank nodes everywhere, and doesn’t provide any identifiers at all for people. I’ve created a cross linking file which uses the same identifier form as the BL, however the more astute amongst you will notice that I didn’t use the BL data to create this.

  • Enrich our publication and people data with BL data.

I ran out of time, but there’s a wealth of information available here. For our work the most tempting parts are the book descriptions and subjects, which are straightforward to add. But the enrichment will go beyond this, because we’re also looking at using data mining techniques to extract topics and detect similarities in research. Any textual data helps this task.

Next steps

In the next few weeks I intend to:

  • Look at pulling in data by ISSN. Won’t improve people matches, will improve our data.
  • Consider using subject data.
  • Improve name matching beyond exact string matches. Our LARQ free text index may help here.

The British Library appear to be committed to improving their rdf. They published samples of improved data last month which fixes some of the issues I tripped over.

The JISC OpenBibliography project is also working hard on improving the data. Most importantly they have generated urls, so I might add

  • Link our contributors to their publications, and publish those links

To the list above.

Acknowledgements

First and foremost, that must go to the British Library for opening up their data. I hope my experiment encourages them.

Secondly, thanks to the challenge organisers, Rufus Pollock, Jo Walsh and Mark MacGillivray for putting up with an entry mostly made of sparql queries.

And finally thanks to the organisers of Dev8D for putting on such a great event.

ResearchRevealed Demonstrator – latest release

February 15th, 2011

There has been some great progress on production of our online demonstrator, with the UX and Design team joining forces with the “backend” programmers to produce a fantastic result that combines all our efforts in scaling up research information integration at the University of Bristol with the visual design for the presentation layer tool. Still plenty of work to do and lots of user engagement is taking place (on which we will build further), but we are really pleased with these initial results in terms of the “complete” system.

Screen Shot of the ResearchRevealed demo

How did the visual design come about? Through meeting with our stakeholders and design consultant to the project (Ben Hayes, Jack Fruit Design Ltd). Ben interpreted our design needs saying the tool needs to be “Transformative. Takes cold, hard data and makes it accessible and alive. Adds intelligence and knowledge to raw data. The style needs to be very modern, smart, fairly serious/formal, and feel simple and helpful to use. It needs to have a ‘human’ touch, so that the style works to “soften the harshness” of some of the content”. We think he’s done a great job, and as I say, the work continues.

Linked Data Workshop – Hack Day event in collaboration with DevCsi

February 8th, 2011

More information about our very successful two-day event in January 2011 is here: http://www.ukoln.ac.uk/events/devcsi/linked_data_hackdays/, with documentation on our wiki, and a full write up of the day with videos here: http://devcsi.ukoln.ac.uk/blog/2011/01/26/linked-data-hack-days-13th-14th-january-2010-bristol/

Workshop to gather University project initiatives for research information management

December 8th, 2010

Our fourth workshop in the serious took place on 23rd November 2010 and brought together four interrelated strands of activity at the University of Bristol: ResearchRevealed, and three University-funded, “internal” projects: Examiner (a project applying data mining techniques to the University’s data about research), the People Profiler project (a project developing extensions to support how researchers and research managers publish their research output thematically) and the Research Information Management project (a project examining the options and requirements for system to manage research information successfully for the benefit of institutional research managers in particular.

The commonalities across these projects were surfaced for workshop participants which was extremely useful all round. These projects will continue to inform each other going forward.

User-Centred Design for ResearchRevealed

October 19th, 2010

The success of Research Revealed will depend to a large degree on how useful and usable the tool is to researchers and administrators. Because of this, we’re committed taking a user-centred approach to the development of the tool.

The key steps in this User-Centred Design (UCD) approach are:

* User research and engagement. We have been working with users from a range of departments at the University of Bristol to try to understand what sort of research data they need and how to best present it to them. This is achieved mainly via a series of contextual interviews and user workshops.

* Prototyping. The findings from our user research are fed into the design of an interactive prototype, created using Axure software. Prototyping is a quick and effective way of bringing a design to life in a way that can be evaluated without the need for costly development.

* Evaluation. The prototype and demonstrator sites is tested with academics and research managers. We also recruit a panel of academics to provide informal feedback on each design iteration.

This iterative approach allows us to refine and improve our designs over time and, we believe, give us a much greater probability of designing a delivering the right solution.

Our user research so far suggests that the core goals for academic researchers using ResearchRevealed will be to (i) be able to foster research collaborations, (ii) manage research outputs and impacts, and (iii) monitor developments within their own school of department. Research managers and heads of departments will be seeking to use it as a tool to simplify the process of exploring and tracking research trends within their departments.

We’re planning more user research over the next couple of months, so if you’re an academic at the University of Bristol and would like to take part, please email Stuart Church (stuartc@pureusability.co.uk)

By Stuart Church, Pure Usability Ltd., Bristol

RDFa and ResearchRevealed

October 15th, 2010

This article introduces RDFa and explains the part it plays in ResearchRevealed. ResearchRevealed supports work on java-rdfa, an RDF parser written in java.

Part of ResearchRevealed’s purpose is to examine how institutions and researchers can work with the web of research data. I say ‘web’ here because although research data is typically ensconced in databases and repositories it is typically surfaced via the web.

ResearchRevealed has a page on the jisc web site. But is isn’t just a page: it mentions the start and end dates; it links to the page representing the funding programme, the strand, topics, a person (Nikki, via her email address). It also links to our own project page, from which you can find the ILRT’s page for me… And so it continues.

This is the familiar web, nothing exciting surely? Well it is exciting when you tilt your head slightly and see beyond the hypertext. JISC and UoB databases aren’t merged, but in a sense they are on the web. JISC surfacing some of their data about the project and is using (very) foreign keys to point to our information about the same thing at Bristol. For other researchers the picture is even richer: their publications, grants, departments all have web pages on disparate systems (publisher, funding body, institution), and the integration is correspondingly more complex.

The reflection of information in the web was the subject of a talk by Tim Berners-Lee in 1994, and it’s what led to the so-called ‘semantic web’ or ‘linked data’.

Semantic Web imagined by tim from 1994

Tim Berners-Lee
1994, http://www.w3.org/Talks/WWW94Tim/

The advice for publishing data on the web can be daunting and involves rather exotic technologies like RDF and content negotiation, things outside the web manager’s comfort zone. Yet the web pages are almost there. They mention the data values, they link. What isn’t apparent is what the values are, or what the nature of the link is, and that’s where RDFa fits in.

RDFa came out of the XHTML 2.0 work originally. It provides a way to decorate your HTML to make those values and links explicit, much like microformats. RDFa consists of a small number of attributes which can be added to html elements.

For my example I’ll start with the ILRT page at the university. Let’s make this serve the dual purpose of hypertext and data.

Firstly well need to add some xml namespaces, aiiso (http://purl.org/vocab/aiiso/schema#) and foaf (http://xmlns.com/foaf/0.1/), to help us describe the organisation. Next we’ll make it clear what we’re talking about:

<body about="#org" typeof="aiiso:Institute">

This give the ILRT itself a URL (http://www.bris.ac.uk/contact/organisation/getDetails?organisationCode=ILRT#org) (distinct from the page) and says it’s an institute.

Under ‘Hierarchy’ you’ll find that the ILRT is part of Information Services.

<a href="getDetails?organisationCode=ISER">
  <span class="underline">Information Services</span>
</a>

We can add an attribute to make that explicit:

<a rel="aiiso:part_of" href="getDetails?organisationCode=ISER#org">
  <span class="underline">Information Services</span>
</a>

‘rel’ is used to type links.

Now the page says that the ILRT is part of ISER. We can also label the ISER with its name:

<span class="underline" property="foaf:name">Information Services</span>

‘property’ is like ‘rel’, but is used for content rather than links.

We can do the same trick for me and other staff:

<a rel="foaf:member" href="../person/getDetails?personKey=W1K8jX7rokwl8OKZUeQMmPO1FXAvt0#person">
  <span class="underline" property="foaf:name">
    <span class="fn">
      <span class="n">
        <span class="honorific-prefix">Mr</span>
        <span class="given-name">Damian</span>
        <span class="family-name">Steer</span>
      </span>
    </span>
  </span>
</a>

We haven’t actually changed the university contact pages for ResearchRevealed, but we have provided similar pages to show the concept which cover the entire institution (several thousand people).

RDFa is taking off on the wider web. Google refers to such embedded data as ‘rich snippets’, and it forms the basis for Facebook’s Open Graph Protocol. You can find RDFa in every IMDB page, on slideshare, Best Buy, Overstock, and (although broken) youtube.

Introducing … Glint

September 27th, 2010

Glint is a Mac application for querying SPARQL endpoints. A SPARQL endpoint allows agents (human or otherwise) to query RDF data stores via the SPARQL language. For example, a number of endpoints have been made available with the recently launched data.gov.uk website. In fact, a SPARQL endpoint sits in front of the ResearchRevealed datastore.

Each SPARQL endpoint usually comes with a simple form that allows you to send queries. However, from my point of view, it would be useful to have a desktop tool that allows me to store the location of different endpoints, query the endpoints, provide syntax highlighting, keep a history of queries etc.

At the moment, Glint has the following features:

  • Store the location of SPARQL endpoints
  • Syntax highlighting for queries
  • View the results of a SELECT statements in a table, or view the results as XML or JSON
  • View the results of DESCRIBE and CONSTRUCT queries as RDF/XML, Turtle or N3
  • Export the results to file
  • Ability to receive automatic updates to the application

So, why develop a Mac application? Glint represents a merger of some of my professional and personal interests: the Semantic Web, SPARQL endpoints and developing Mac application with Objective-C and Cocoa. Basically, I wanted to write a Mac desktop client that allows me to query SPARQL endpoints. It is also provides a refreshing anthesis to my main development skills and focus at the moment, namely Java, web applications and the Mobile Web. The vast majority of the development effort on Glint has been done in my own time.

The project is still in the early stages and I would appreciate any comments, criticisms and suggestions for features that should be supported. Providing a history of queries is pretty high on the list of new features.

The latest DMG can be obtained from the project page on GitHub.

SPARQL query in Glint

Tabular results in Glint

You can track the progress of the project on my personal blog.

ResearchRevealed Demonstrator Progress

June 23rd, 2010

Over the last couple of months we have done a significant amount of work in scaling up our demonstrator from the previous Content Integration Project pilot. Our User Experience consultant has been helping us develop the design of our online tool that will be usable both by Research Managers and Researchers at the University of Bristol. It will offer a faceted browse over integrated research information – drawn from both central University systems and from remote sources such as Research Council websites. It will allow an end-user to browse through content to pages about research projects, research outputs (such as publications and conference papers) research departments and the researchers themselves. Researchers – our domain experts! – will be able to see all the data held about their research activity in one place and to enhance that data by verifying it, annotating it and linking in supporting information about the impact of their research.

ResearchRevealed Demonstrator - Faceted Browse Screenshot

Our initial demo is being scaled up to reveal data covering the entire University’s research activity, thus testing how robust we can make our Linked Data solution for research information integration. Our Web Designer is now working with us to help us develop a complete visual design for the final demonstrator which will be offered online and released as opensource software in Spring next year. Meanwhile, we are testing the solution iteratively with departments here at the University of Bristol and developing a requirements analysis around the needs for using, enhancing and promoting research information across the institution.

Research Impact Tool prototype

April 7th, 2010

As part of this project we are developing a new, Web browser-based, bookmarklet-style tool to facilitate the ongoing capture of research impact by Researchers and Research Managers across the University. The tool allows any web page to be simply bookmarked by our widget. The widget display automatically shows the web page’s title and url and allows the user to enter a description, keywords, people associated with the research impact in question and classification of the type of impact. Keywords are free text but auto-complete is offered to show all keywords entered to date by other users of the system. Auto-complete is also offered for all staff known to the central database at the University of Bristol.

Research Impact Capture Bookmarklet Tool

Research Impact Capture Bookmarklet Tool

Impact classifications are thus far based on pilot work towards Research Excellence Framework submissions for research impact, but we aim to extend this to cater for Research Council submissions as well. The tool is now to be trialled within selected departments at the University of Bristol and will be backed by a pilot “Research Impact” repository. Senior technical developer, Damian Steer and User Experience expert, Stuart Church, collaborated to develop the first version of this tool. The software is opensource and designed with general reuse in mind as far as possible.

An introduction to manipulating SPARQL using ARQ

April 1st, 2010

When you’ve been working with SPARQL you quickly find that static queries are restrictive. Maybe you want to vary a value, perhaps add a filter, alter the limit, etc etc. Being an impatient sort you dive in to the query string, and it works. But what about little Bobby Tables? And, even if you sanitise your inputs, string manipulation is a fraught process and syntax errors await you. Although it might seem harder than string munging, the ARQ api is your friend in the long run.

Let’s begin with something simple. Suppose we wanted to restrict the following query to a particular person:

select * { ?person <http://xmlns.com/foaf/0.1/name> ?name }

String#replaceAll would work, but there is a safer way. QueryExecutionFactory in most cases lets you supply a QuerySolution with which you can prebind values.

QuerySolutionMap initialBinding = new QuerySolutionMap();
initialBinding.add("name", personResource);
qe = QueryExecutionFactory.create(query, dataset, initialBinding);

Read the rest of this entry »