Possibilities and limitations of RDF

Linked Data for Librarians

by Seth van Hooland and Ruben Verborgh

Part 1 – Module 3: Possibilities and limitations of RDF

Tim Berners-Lee proposed
4 principles to publish Linked Data.

Use URIs as names for things.
Use HTTP URIs so people
can look up those names.
When someone looks up a URI,
provide useful information using the standards.
Include links to other things,
so people can discover more.

(Non-)information resources
should be uniquely identifiable.

Pointing to resources with a common name
or description is often ambiguous.
- Washington (city? state? president?)
Reuse or mint a URL for them.
- http://dbpedia.org/resource/George_Washington
Especially machines need unambiguous identifiers.

The URL is part of a broader family
of technologies related to identification.

URL – Uniform Resource Locator
- unique identification and location of resources
- mailto:ruben.verborgh@ugent.be
URN – Uniform Resource Name
- location-independent resource identifier
- urn:isbn:0-83891251-6
URI – Uniform Resource Identifier
- union of URLs and URNs

The broadest family is IRI,
which supports non-ASCII characters.

Not all characters are allowed in a URI.
- Percent-encode non-ASCII chars.
  - é becomes %C3%A9
- Percent-encode chars with other meaning.
  - ? becomes %3F
IRI – Internationalized Resource Identifier
- Non-ASCII chars don’t need to be encoded.
- Chars with other meaning still need encoding.

An HTTP URL identifies and locates a resource anywhere in the universe.

A string is a unique identifier if
at most one entity corresponds to it.
- A national number uniquely identifies a person,
  but does not allow locating him or her.
A string is a unique locator if
at most one location corresponds to it.
- A street address uniquely identifies a location,
  but does not allow to identify a specific person.

Using HTTP URLs ensures that
anybody can look up the resource.

An HTTP URI of a resource can be dereferenced:
use an HTTP client to retrieve a representation.
- Information resources result in a representation.
- Non-information resources result in a 303 redirect.
This relies on the double role of an HTTP URI
as identifier and locator.
Dereferencing is a core principle of Linked Data.
- If you don’t know something, look it up.

Dereferencing a URI should lead to
useful information about that resource.

“Useful” means the information is available
using standard technologies.
- Tim Berners-Lee mentions RDF and SPARQL.
“Useful” also means the information provides
explanations and/or context for the resource.
Define the resource in terms of concepts
the client already knows or can look up.

By including links to other resources,
we create a Web of Data.

Links connect a resource to known concepts.
- George Washington was born in British America
Links give meaning to data.
- These temperatures are measured in degrees Celsius.
Links allow exploration of related data.
- Find more by the same author.

An immense amount of Linked Data
is available on the Web for reuse.

On the structural level, hundreds of vocabularies
can provide building blocks to model your data.
- They provide properties and classes to reuse.
- Although not always counted as Linked Data
  because of their small size, most follow all 4 principles.
On the content level, thousands of datasets
provide identifiers and data of individuals.
- Strive to reuse identifiers rather than to mint new ones.

No Linked Data set is ever complete.
We make the open-world assumption.

Relational databases use highly rigid structures.
They strive for complete data.
- A NULL value can signal missing data,
  but this is typically an undesired situation.
With Linked Data, no source has all of the truth.
Other sources might have more data on a subject.
- The absence of a fact does not imply its falsehood.
- A fact has 2 possible states: true and unknown.

Several vocabularies are used frequently
across different datasets.

modeling vocabularies
- RDFS
- OWL
- SKOS
general-purpose vocabularies
- Dublin Core
- …
concept-specific vocabularies
- Schema.org
- DBpedia ontology
- …

Find the one you need at Linked Open Vocabularies.

The Dublin Core terms are a set of
15 common metadata properties.

Each property is generic,
and hence applicable in many cases.
- title
- description
- date
- creator
- …
Many applications use the Dublin Core terms.
- good interoperability of high-level semantics

Schema.org is a single vocabulary
that covers many different fields.

Created and maintained by major search engines,
it mainly provides discovery data for search.
Its concepts are defined rather loosely.
- This makes it flexible to use for developers.
- Machines cannot derive much knowledge from it.
Schema.org is manually curated
and open for extension.

Billions of Linked Data facts are
on the Web with an open license.

The most well-known dataset is DBpedia.
- Data is extracted automatically from Wikipedia.
- Like Wikipedia, it exists in several different languages.
- Its quality is acceptable for many queries.
Wikidata is a manually curated alternative.
- It has its own data model on top of RDF.
- It grows fast and might overtake DBpedia.
You can find many other datasets on Datahub.

RDF Schema is an RDF vocabulary
to model other RDF vocabularies.

RDFS defines classes, properties, and datatypes
that are used to define vocabularies.
- The vocabulary has a human-readable specification.
- It also has a machine-readable vocabularies,
  which define RDFS in terms of RDFS.
RDFS defines concepts in two namespaces.
- rdf: https://www.w3.org/1999/02/22-rdf-syntax-ns#
- rdfs: https://www.w3.org/2000/01/rdf-schema#

Practitioners in the RDF world often
refer to vocabularies as ontologies.

Strictly speaking, a vocabulary is a set of words;
an ontology a set of concepts and their relations.
The W3C states that there is no clear division.

vocabulary

usually in less formal contexts

ontology

usually in more complex and/or formal contexts

The Web Ontology Language (OWL) provides concepts for ontologies.

RDFS captures basic ontological relations,
but lacks several common and important concepts.
- cardinality restrictions on properties
- inverse, symmetric, and transitive properties
- equality and disjointness
- …
OWL extends RDFS with advanced concepts.
- RDFS and OWL are used side by side.

SPARQL Protocol And Query Language: query and update RDF datasets.

SPARQL is a query language.
- Select specific data from an RDF dataset.
- Insert, change, or delete data in an RDF dataset.
The SPARQL protocol is a Web API definition
for querying in the SPARQL language over HTTP.
- A SPARQL endpoint executes SPARQL queries by clients
  through HTTP, and replies with their results.

The SPARQL language defines
forms a query can take.

There are currently 4 read-only query forms:

SELECT: find values that satisfy conditions
CONSTRUCT: create triples that satisfy conditions
ASK: check whether data exists
DESCRIBE: show information about a resource

The main unit of a SPARQL query
is a Basic Graph Pattern (BGP).

A BGP is a set of triple patterns.
- Their syntax is a superset of Turtle.
A triple pattern is a triple in which
each of the components can be a variable.
- Variables start with a question mark (?name).
A SPARQL query engine finds solution mappings.
- Variables and blank nodes are mapped to URIs,
  blank nodes, or literals according to dataset triples.

This query searches DBpedia for artists influenced by Picasso.

PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?person ?personLabel WHERE {
  ?person a dbo:Artist.
  ?person foaf:name ?personLabel.
  ?person dbo:influencedBy dbr:Pablo_Picasso.
}

Here is the live result of that query.

This query searches Wikidata for artists influenced by Picasso.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?person ?personLabel WHERE {
  ?person wdt:P106/wdt:P279* wd:Q483501.
  ?person wdt:P737 wd:Q5593.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Here is the live result of that query.

Why are the results
and the queries different?

DBpedia and Wikidata contain different data.
- So far, so good. This was expected.
DBpedia and Wikidata use different ontologies.
- Ideally, the same SPARQL queries would suffice.
- …and they can, when ontologies link to each other.
In practice, some bridging is still needed.
- Semantic Web reasoning can bridge the gap,
  but is often not enabled on query endpoints.

[cars in rainbow colors] — by PictureWendy – CC BY-NC 2.0

Heterogeneity exists on multiple levels
across metadata collections.

Heterogeneity exists on the data level.
- We can choose our own vocabularies.
- How do we ensure they align?
Heterogeneity exists on the interface level.
- We can choose how consumers can query our data.
- How can clients consume multiple datasets easily?

Heterogeneity is our best friend
and our largest enemy.

Anybody on the Web is free
to publish however they want.
- This works great for people—sometimes.
- It often doesn’t work great for machines.
Standardization helps us align.
- delicate balance between flexibility and interoperability

[sad and smiley face] — Kunsthal by PictureWendy – CC BY-NC 2.0

Standardization and agreement
have provided us with foundations.

the Semantic Web family of standards
- RDF
- SPARQL
- RDFS
- OWL
- …
ontologies and vocabularies
- Dublin Core
- DBpedia ontology
- Wikidata ontology
- Schema.org
- …
~~Web APIs~~ not yet
- Linked Data Platform
- OAI-ORE

The current level of standardization
still leaves some areas uncovered.

vocabulary usage
- examples and guidance
vocabulary agreement
- the right terms for the right clients
Web APIs
- stop reinventing the wheel

Which vocabularies should we use
to describe our metadata, and how?

We need to develop examples and guidance.
- vocabulary usage
- URL strategy
- …
Reasoning can fill vocabulary gaps.
- You can apply this to your own metadata.
We can never cover all vocabularies.
- Negotiate content through profiles.

Web APIs are the Achilles’ heel
of interoperability on the Web.

Shall we all have our SPARQL endpoints?
- Who is going to pay?
Shall we all support the Linked Data Platform?
- That doesn’t solve querying…
Shall we all have our own custom APIs?
- That’s not a sustainable way.

See more in the REST module.

Self-assessment 1: HTTP URLs

Why are HTTP URLs important for Linked Data?

HTTP URLs are not important for Linked Data.
- No. While not necessary for the RDF data model itself,
  the Linked Data principles mandate HTTP URLs.
Because they guarantee consistent semantics.
- No. Semantic consistency comes from the reuse of unique concept identifiers (URIs), not specifically from HTTP URLs.
So we can look up (un)known concepts.
- Yes. HTTP URLs can be dereferenced: to obtain more data about a concept, follow its URL.

Self-assessment 2: OWL and RDFS

Which of the following propositions are true?

RDFS and OWL are an answer to Schema.org.
- No: Schema.org is (mainly) a vocabulary with terms to describe concrete things, such as books, people, articles, … RDFS and OWL contain terms to model other ontologies or vocabularies (such as Schema.org).
OWL replaces RDFS.
- No: RDFS is still required to express basic ontological relations. As such, RDFS and OWL are often used side by side.
OWL extends RDFS.
- Yes: OWL extends RDFS with more advanced ontological concepts.

Self-assessment 3: SPARQL

What is SPARQL?

A data model.
- No. The data model is RDF.
A query language.
- Yes, SPARQL is a query language for RDF.
A protocol.
- Yes, SPARQL is a protocol to execute SPARQL queries over HTTP.

Self-assessment 4: SPARQL queries

Does the same SPARQL query return the same results
on different sources about the same metadata?

In theory, but not in practice.
- Yes: in theory, SPARQL queries should be interoperable across datasources. Even if different sources use different ontologies, reasoning can bridge the gap.
In practice, but not in theory.
- No: in practice, datasets use different ontologies and most endpoints do not have reasoning enabled to bridge the gap.
In theory and in practice.
- No: in practice, datasets use different ontologies and most endpoints do not have reasoning enabled to bridge the gap.

Linked Data for Librarians

Part 1 – Module 3: Possibilities and limitations of RDF

Tim Berners-Lee proposed 4 principles to publish Linked Data.

(Non-)information resources should be uniquely identifiable.

The URL is part of a broader family of technologies related to identification.

The broadest family is IRI, which supports non-ASCII characters.

An HTTP URL identifies and locates a resource anywhere in the universe.

Using HTTP URLs ensures that anybody can look up the resource.

Dereferencing a URI should lead to useful information about that resource.

By including links to other resources, we create a Web of Data.

An immense amount of Linked Data is available on the Web for reuse.

No Linked Data set is ever complete. We make the open-world assumption.

Several vocabularies are used frequently across different datasets.

The Dublin Core terms are a set of 15 common metadata properties.

Schema.org is a single vocabulary that covers many different fields.

Billions of Linked Data facts are on the Web with an open license.

RDF Schema is an RDF vocabulary to model other RDF vocabularies.

Practitioners in the RDF world often refer to vocabularies as ontologies.