Vocabulary reconciliation

Role of Controlled Vocabularies

Subset of natural language
- Created to avoid the problems which rise with the use of natural language for indexing and retrieval
Improving precision and recall
- By providing synonymy control, controlled vocabularies improve recall, which is the proportion of the documents relevant to the search that were successfully retrieved. Greater precision, the proportion of retrieved documents relevant to the search, is achieved through the control of polysemy

Come in different flavours

Within the LIS community we can distinguish between three different types of controlled vocabularies:

classification schemes

for example the Dewey Decimal Classification

subject headings

for example the Library of Congress Subject Headings

thesauri

for example Arts and Architecture Thesaurus
It is essential to understand their differences before using them as Linked Data.

Classification schemes

Propose systematically arranged classes to arrange documents regarding the same topic together

Classes are represented by notations and captions provide a human-readable description of the scope of the class

Dewey’s Decimal Classification (DDC) is the most well known example

Example: Dewey Decimal Classification

We can systematically drill down the DCC in order to locate the class for Cubism:

7	Art and recreation
70	Arts
709	History, geographic treatment, biography
709.04	20^th Century, 1900–1999
709.0403	Cubism and futurism
709.04032	Cubism

Subject headings

Single words or phrases which describe the subject of resources to facilitate their intellectual access
Alphabetically organized, with cross-references
Complex composition of headings in a pre-coordinated manner
LCSH are used world-wide to describe the topics of resources during cataloging

Example: Library of Congress Subject Headings

Cubism (May Subd Geog)
BT	Aesthetics
BT	Art
BT	Art, Modern - 20th Century
BT	Modernism (Art)
BT	Painting
RT	Post-impressionism (Art)
NT	Decoration and ornament - Cubism
NT	Drawing, Cubist
NT	Purism (Art)

Thesauri

Represents all the concepts for a specific domain in a consistent manner and labels each concept with a preferred term, as well as synonyms. Hierarchical relationships between concepts are expressed.

The only type of controlled vocabulary for which a formal standard (ISO 25964) exists, allowing to enforce compliance. Examples include: Arts and Architecture Thesaurus (AAT), Education Resources Information Center thesaurus (ERIC).

Example: Arts and Architecture Thesaurus

Cubist

Styles and periods
- European
  - Modern European styles and movements
    - Cubist
      - Analytical cubist
      - Synthetic cubist

Drawbacks of Controlled Vocabularies

Despite their obvious advantages, controlled vocabularies also represent a number of problems:

cost: expensive to develop and maintain
complexity: end-users have trouble using them
evolve slowly: updating takes time
subjectivity: express always a certain worldview

Yet a little semantics goes a long way…

Difficulties to implement the full-blown Semantic Web and the move to the more pragmatic Linked Data approach stirred new interest in controlled vocabularies.

Inferencing capabilities are limited but awareness grew to reuse existing vocabularies to establish connections between data.

This evolution spurred interest in the development of SKOS.

Simple Knowledge Organization System (SKOS)

Data format to represent controlled vocabularies on the Web
Confusing relation with ontologies: what is the difference for example between skos:narrower and rdfs:subClassOf?
Controlled vocabularies are hierarchically and associatively structured, but do not have the same rigor as formal axioms or facts about the world.

SKOS properties

Concepts are expressed through concrete terms (=labels) and may have three kind of properties:

labeling properties
- skos:prefLabel
- skos:altLabel
semantic properties
- skos:narrower
- skos:broader
- skos:related
- …
documentation properties
- skos:definition
- skos:scopeNote
- …

SKOS example from LCSH

@prefix : <http://id.loc.gov/authorities/subjects/>.
@prefix ch: <http://purl.org/vocab/changeset/schema#>.
@prefix org: <http://id.loc.gov/vocabulary/organizations/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.

:sh85034652 a skos:Concept;
  skos:inScheme <http://id.loc.gov/authorities/subjects>;
  skos:prefLabel "Cubism"@en;
  skos:broader :sh85001441, :sh85007461, :sh85007805,
               :sh85086445, :sh85096661;
  skos:narrower :sh85036235, :sh85039437, :sh85109192;
  skos:related :sh2001008665, :sh85105416;
  skos:closeMatch <http://d-nb.info/gnd/4165855-3>;
  skos:exactMatch <http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119361753>;
  skos:changeNote [
       a ch:ChangeSet;
       ch:changeReason "new"^^xsd:string;
       ch:createdDate "2001-06-22T00:00:00"^^xsd:dateTime;
       ch:creatorName org:dl;
       ch:subjectOfChange :sh85034652
     ],
     [
       a ch:ChangeSet;
       ch:changeReason "revised"^^xsd:string;
       ch:createdDate "2001-07-19T13:07:56"^^xsd:dateTime;
       ch:creatorName org:dlc;
       ch:subjectOfChange :sh85034652
    ].

Let’s create our own thesaurus in SKOS!

Domain analysis
- Analyze the Wikipedia page on modern art to collect terms.
Hierarchy
- Once we made a list with the different terms (modern art, cubism, purism etc) we need to establish a hierarchy.
Tools
- Several tools can be used to create thesauri, such as Tematres. However, let’s encode a Turtle file manually.

Coding our thesaurus in a text editor

Create a new file called modern-art-skos.ttl
(pay attention to the .ttl extension)
Add prefixes:

@prefix art:<http://example.org/art/#>.
@prefix skos:<http://www.w3.org/2004/02skos/core#>.
@prefix dc:<http://purl.org/dc/terms/>.

We need to add a title and a top concept

For convenience, stick to ASCII characters and avoid whitespaces or any special tokens:

art:ModernArtPeriodsThesaurus a skos:ConceptScheme;
  dc:title "Modern art periods thesaurus"@en;
  skos:hasTopConcept art:ModernArt.

Adding concepts

Adding the top concept:

art:ModernArt a skos:Concept;
  skos:prefLabel "Modern art"@en;
  skos:inScheme art:ModernArtPeriodsThesaurus.

Adding multilingual labels and alternative names

art:ModernArt a skos:Concept;
  skos:prefLabel "Modern art"@en;
  skos:prefLabel "Art moderne"@fr;
  skos:prefLabel "Moderne Kunst"@de;
  skos:inScheme art:ModernArtPeriodsThesaurus.

art:ModernArt skos:altLabel "Modern art (1860-1945)"@en.

Now we can add some supplementary genres and subgenres

art:HeidelbergSchool a skos:Concept;
  skos:prefLabel "Heidelberg School"@en;
  skos:broader art:Impressionism;
  skos:inScheme art:ModernArtPeriodsThesaurus.

art:DieBrucke a skos:Concept;
  skos:prefLabel "Die Brücke"@en;
  skos:broader art:Expressionism;
  skos:inScheme art:ModernArtPeriodsThesaurus.

Play on your own!

Download our mini-thesaurus in SKOS/Turtle and add some extra concepts!
Keep the file, you’ll need it later on.

Validation and visualization tools:

Install the RDF extension

Download the extension and place it in the extensions folder (accessible by clicking on Browse workspace directory from the home page of OpenRefine).

A new RDF button appears after the installation.

Detailed instructions can be found on http://refine.deri.ie/installlationDocs.

Creating a project with a dataset

You can either create your own metadata, or download our test file on modern art.

After importing the file, OpenRefine’s preview will show headers such as Title, Artist, Year etc.

As you’ll see some corrupted accents, don’t forget to set the encoding to UTF-8 by clicking on the Encoding field in Preview mode.

[OpenRefine interface] — ©2017 OpenRefine

Reconciling the Powerhouse Museum metadata with the LCSH

After we played with our own mini-thesaurus and some dummy data, let’s work on a real-life case!

We have used the metadata set of the museum extensively for various cleaning and linking operations.

For this exercise, make sure to download either the cleaned OpenRefine project or create a new project using the cleaned metadata.

[Powerhouse Museum] — ©2017 Powerhouse Museum

Issues with LCSH we first need to solve

OpenRefine can not make a choice if there’s a match with different headings. If one concept uses a label as its preferred term, and another uses the same label to designate a non-preferred term, OpenRefine can not choose

For example, skating is an alternative label of the term with preferred label Ice skating (sj96005713), but a separate term with the preferred label Skating (sj85123105) also exists!

Preprocessing of the LCSH

Some changes were made in our version of the LCSH:

Subdivisions are only present if they do not conflict with an existing main heading with the same label.
Alternate labels were only added to the extent that they do not cause clashes with other labels.

Configuring LCSH as reconciliation source

Click the RDF button, select Add reconciliation service, Based on SPARQL endpoint.

Set its parameters as follows:

Name: LCSH (preprocessed)
Endpoint URL: http://sparql.freeyourmetadata.org/
Graph URI: http://sparql.freeyourmetadata.org/authorities-processed/
Type: Virtuoso
Label properties: check only skos:prefLabel

Now let’s reconcile!

Start the reconciliation process for the Categories column with this new endpoint (Reconcile, Start reconciling, LCSH (preprocessed), Start Reconciling)

Important: Experiment first with a very little subset of the records, as in below 100. The process takes a lot of time and if you launch it on the entire data set, it will take at least a day on your laptop. Create for example a filter on Record ID with 123 so that you have results after a couple of seconds

[JASIST] — ©2013 Journal for the American Society of Information Science and Technology (JASIST)

Impact of reconciliation

Creating a link between your catalog record and an entry of the LCSH unfortunately does not allow you to be connected automatically with all other records which link to that heading!

Always keep in mind that URLs are unidirectional: you point to the LCSH, but the LoC is agnostic of the fact that you point to them.

I wanted the act of adding a link to be trivial. So long as I didn’t introduce some central link database, everything would scale nicely.
Tim Berners-Lee, Weaving the Web, 1999

The unidirectionality of links was an explicit design choice. Asking the linked entity to confirm the link would create too much of a bottleneck for the Web to grow. Imagine someone at LoC whose job it is to check each link created to the LCSH… However, alternatives exist, such as Xanadu by Ted Nelson.

Self-assessment 1: thesauri

What key aspect distinguishes thesauri from other forms of controlled vocabularies?

A formal standard exists to verify their well-formedness.
- Yes, the ISO 25964 standard defines exactly how a thesaurus should be constructed.
A thesaurus provides description at a more granular level.
- No, this does not depend on the type of vocabulary.
Thesauri can be represented in SKOS.
- Thesauri can indeed be represented in SKOS, but so can other types of vocabularies, as illustrated by the LCSH.

Self-assessment 2: non-preferred terms

Why adding non-preferred terms to a vocabulary?

It reduces the negative effect of synonymy on search results.
- Yes! Even if an end-user performs a search on a synonym encoded as a non-preferred term in regards to the preferred term used for indexing, the results are the same.
It reduces the negative effect of polysemy on search results.
- No! Adding too many and potentially even irrelevant non-preferred terms will increase the negative impact of polysemy.
You can increase the success rate of the reconciliation process.
- Yes! That is, if you have configured the process to include the non-preferred terms.

Self-assessment 3: labels and concepts

How do labels and concepts relate to each other in a SKOS vocabulary?

Labels allow defining the structure whereas concepts can express the specific terms used.
- No! Completely wrong.
Labels are used to express semantic relations between concepts.
- No, semantic relations are expressed by using properties such as broader or narrower.
Concepts are abstract units of thought; labels are strings of characters associated with a concept.
- Yes!

Self-assessment 4: unidirectionality

Why is it important to acknowledge unidirectionality when creating URLs?

It explains why we don’t need SPARQL.
- No, it’s the opposite! SPARQL exactly allows us to traverse links in both ways across a graph.
It helps understand why it isn’t straightforward to connect all records together which point out to a central authority file.
- Exactly! It’s not because you link to the LCSH, that the LCSH, or other people referring to the same heading, are made aware of its existence.
In order to prevent the creation of dead links.
- No, but understanding unidirectionality helps us to realize why dead links are unavoidable.

Linked Data for Librarians

Part 2 – Module 1: Vocabulary reconciliation