When someone looks up an URI, provide useful information using standards.
Include links to other URIs so people can discover more.
Controlled vocabularies can play a pivotal role to address these two principles.
Role of Controlled Vocabularies
Subset of natural language
Created to avoid the problems which rise with the use of natural language
for indexing and retrieval
Improving precision and recall
By providing synonymy control, controlled vocabularies improve recall,
which is the proportion of the documents relevant to the search that
were successfully retrieved. Greater precision, the proportion of retrieved documents
relevant to the search, is achieved through the control of polysemy
Come in different flavours
Within the LIS community we can distinguish between three different types
of controlled vocabularies:
It is essential to understand their differences before using them as Linked Data.
Classification schemes
Propose systematically arranged classes to arrange documents
regarding the same topic together
Classes are represented by notations and captions
provide a human-readable description of the scope of the class
Dewey’s Decimal Classification (DDC) is the most well known example
Example: Dewey Decimal Classification
We can systematically drill down the DCC in order to locate the class for
Cubism:
7
Art and recreation
70
Arts
709
History, geographic treatment, biography
709.04
20th Century, 1900–1999
709.0403
Cubism and futurism
709.04032
Cubism
Subject headings
Single words or phrases which describe the subject of resources to facilitate
their intellectual access
Alphabetically organized, with cross-references
Complex composition of headings in a pre-coordinated manner
LCSH are used world-wide to describe the topics
of resources during cataloging
Example: Library of Congress Subject Headings
Cubism (May Subd Geog)
BT
Aesthetics
BT
Art
BT
Art, Modern - 20th Century
BT
Modernism (Art)
BT
Painting
RT
Post-impressionism (Art)
NT
Decoration and ornament - Cubism
NT
Drawing, Cubist
NT
Purism (Art)
Thesauri
Represents all the concepts for a specific domain in a consistent manner
and labels each concept with a preferred term, as well as synonyms.
Hierarchical relationships between concepts are expressed.
Despite their obvious advantages, controlled vocabularies also represent
a number of problems:
cost
expensive to develop and maintain
complexity
end-users have trouble using them
evolve slowly
updating takes time
subjectivity
express always a certain worldview
Yet a little semantics goes a long way…
Difficulties to implement the full-blown Semantic Web and the move
to the more pragmatic Linked Data approach stirred new
interest in controlled vocabularies.
Inferencing capabilities are limited but awareness grew to reuse
existing vocabularies to establish connections between data.
This evolution spurred interest in the development of SKOS.
Simple Knowledge Organization System (SKOS)
Data format to represent controlled vocabularies on the Web
Confusing relation with ontologies: what is the difference for
example between skos:narrower and rdfs:subClassOf?
Controlled vocabularies are hierarchically and associatively structured,
but do not have the same rigor as formal axioms or facts about the world.
SKOS properties
Concepts are expressed through concrete terms (=labels)
and may have three kind of properties:
OpenRefine can not make a choice if there’s a match with different headings.
If one concept uses a label as its preferred term, and another uses
the same label to designate a non-preferred term, OpenRefine can not choose
For example, skating is an alternative label of the term with preferred label
Ice skating (sj96005713), but a separate term with the preferred label Skating
(sj85123105) also exists!
Preprocessing of the LCSH
Some changes were made in our version of the LCSH:
Subdivisions are only present if they do not conflict with
an existing main heading with the same label.
Alternate labels were only added to the extent that they do not
cause clashes with other labels.
Configuring LCSH as reconciliation source
Click the RDF button,
select Add reconciliation service, Based on SPARQL endpoint.
Start the reconciliation process for the Categories column with
this new endpoint (Reconcile, Start reconciling, LCSH (preprocessed),
Start Reconciling)
Important: Experiment first with a very little subset of the records,
as in below 100. The process takes a lot of time and if you launch it on
the entire data set, it will take at least a day on your laptop.
Create for example a filter on Record ID with 123 so that you have
results after a couple of seconds
Impact of reconciliation
Creating a link between your catalog record and an entry of the LCSH
unfortunately does not allow you to be connected automatically with all
other records which link to that heading!
Always keep in mind that URLs are unidirectional: you point to the LCSH,
but the LoC is agnostic of the fact that you point to them.
I wanted the act of adding a link to be trivial. So long
as I didn’t introduce some central link database, everything
would scale nicely.
The unidirectionality of links was an explicit design choice. Asking
the linked entity to confirm the link would create too much of a bottleneck
for the Web to grow. Imagine someone at LoC whose job it is to check each link created to the LCSH…
However, alternatives
exist, such as
Xanadu by Ted Nelson.
Self-assessment 1: thesauri
What key aspect distinguishes thesauri from other forms of controlled vocabularies?
A formal standard exists to verify their well-formedness.
Yes, the ISO 25964 standard defines exactly how a thesaurus should be constructed.
A thesaurus provides description at a more granular level.
No, this does not depend on the type of vocabulary.
Thesauri can be represented in SKOS.
Thesauri can indeed be represented in SKOS, but so can other types of vocabularies, as illustrated by the LCSH.
Self-assessment 2: non-preferred terms
Why adding non-preferred terms to a vocabulary?
It reduces the negative effect of synonymy on search results.
Yes! Even if an end-user performs a search on a synonym encoded as a non-preferred term in regards to the preferred term used for indexing, the results are the same.
It reduces the negative effect of polysemy on search results.
No! Adding too many and potentially even irrelevant non-preferred terms will increase the negative impact of polysemy.
You can increase the success rate of the reconciliation process.
Yes! That is, if you have configured the process to include the non-preferred terms.
Self-assessment 3: labels and concepts
How do labels and concepts relate to each other in a SKOS vocabulary?
Labels allow defining the structure whereas concepts can express the specific terms used.
No! Completely wrong.
Labels are used to express semantic relations between concepts.
No, semantic relations are expressed by using properties such as broader or narrower.
Concepts are abstract units of thought; labels are strings of characters associated with a concept.
Yes!
Self-assessment 4: unidirectionality
Why is it important to acknowledge unidirectionality when creating URLs?
It explains why we don’t need SPARQL.
No, it’s the opposite! SPARQL exactly allows us to traverse links in both ways across a graph.
It helps understand why it isn’t straightforward to connect all records together which point out to a central authority file.
Exactly! It’s not because you link to the LCSH, that the LCSH, or other people referring to the same heading, are made aware of its existence.
In order to prevent the creation of dead links.
No, but understanding unidirectionality helps us to realize why dead links are unavoidable.