Part 1 – Modules
Introduction
Understanding data models
Possibilities and limitations of RDF
Data quality
Data profiling and cleaning
Vocabulary reconciliation
Metadata enriching
REST
Decentralization and federation
Conclusions
Part 1
Part 2
VIDEO
(Non-)information resources
should be uniquely identifiable.
Pointing to resources with a common name
or description is often ambiguous.
Washington (city? state? president?)
Reuse or mint a URL for them.
Especially machines need unambiguous identifiers.
The URL is part of a broader family
of technologies related to identification.
URL – Uniform Resource Locator
unique identification and location of resources
mailto:ruben.verborgh@ugent.be
URN – Uniform Resource Name
location-independent resource identifier
urn:isbn:0-83891251-6
URI – Uniform Resource Identifier
The broadest family is IRI,
which supports non-ASCII characters.
Not all characters are allowed in a URI.
IRI – Internationalized Resource Identifier
Non-ASCII chars don’t need to be encoded.
Chars with other meaning still need encoding.
An HTTP URL identifies and locates
a resource anywhere in the universe.
A string is a unique identifier if
at most one entity corresponds to it.
A national number uniquely identifies a person,
but does not allow locating him or her.
A string is a unique locator if
at most one location corresponds to it.
A street address uniquely identifies a location,
but does not allow to identify a specific person.
Using HTTP URLs ensures that
anybody can look up the resource.
An HTTP URI of a resource can be dereferenced :
use an HTTP client to retrieve a representation .
This relies on the double role of an HTTP URI
as identifier and locator .
Dereferencing is a core principle of Linked Data.
If you don’t know something, look it up.
By including links to other resources,
we create a Web of Data.
Links connect a resource to known concepts.
Links give meaning to data.
Links allow exploration of related data.
Find more by the same author.
An immense amount of Linked Data
is available on the Web for reuse.
On the structural level, hundreds of vocabularies
can provide building blocks to model your data.
They provide properties
and classes to reuse.
Although not always counted as Linked Data
because of their small size, most follow all 4 principles.
On the content level, thousands of datasets
provide identifiers and data of individuals.
Strive to reuse identifiers rather than to mint new ones.
No Linked Data set is ever complete.
We make the open-world assumption.
Relational databases use highly rigid structures.
They strive for complete data.
A NULL
value can signal missing data,
but this is typically an undesired situation.
With Linked Data, no source has all of the truth.
Other sources might have more data on a subject.
The absence of a fact does not imply its falsehood.
A fact has 2 possible states: true and unknown .
Several vocabularies are used frequently
across different datasets.
modeling vocabularies
general-purpose vocabularies
concept-specific vocabularies
Find the one you need at Linked Open Vocabularies .
The Dublin Core terms are a set of
15 common metadata properties.
Each property is generic,
and hence applicable in many cases.
Many applications use the Dublin Core terms.
good interoperability of high-level semantics
Schema.org is a single vocabulary
that covers many different fields.
Created and maintained by major search engines,
it mainly provides discovery data for search.
Its concepts are defined rather loosely.
This makes it flexible to use for developers.
Machines cannot derive much knowledge from it.
Schema.org is manually curated
and open for extension .
Billions of Linked Data facts are
on the Web with an open license.
The most well-known dataset is DBpedia .
Data is extracted automatically from Wikipedia .
Like Wikipedia, it exists in several different languages.
Its quality is acceptable for many queries.
Wikidata is a manually curated alternative.
It has its own data model on top of RDF.
It grows fast and might overtake DBpedia.
You can find many other datasets on Datahub .
RDF Schema is an RDF vocabulary
to model other RDF vocabularies.
RDFS defines classes, properties, and datatypes
that are used to define vocabularies.
RDFS defines concepts in two namespaces.
Practitioners in the RDF world often
refer to vocabularies as ontologies .
RDFS captures basic ontological relations,
but lacks several common and important concepts.
cardinality restrictions on properties
inverse, symmetric, and transitive properties
equality and disjointness
…
OWL extends RDFS with advanced concepts.
RDFS and OWL are used side by side.
SPARQL is a query language.
Select specific data from an RDF dataset.
Insert, change, or delete data in an RDF dataset.
The SPARQL protocol is a Web API definition
for querying in the SPARQL language over HTTP.
A SPARQL endpoint executes SPARQL queries by clients
through HTTP, and replies with their results.
A BGP is a set of triple patterns .
Their syntax is a superset of Turtle.
A triple pattern
is a triple in which
each of the components can be a variable .
Variables start with a question mark (?name
).
A SPARQL query engine finds solution mappings .
Variables and blank nodes are mapped to URIs,
blank nodes, or literals according to dataset triples.
This query searches DBpedia
for artists influenced by Picasso.
SELECT DISTINCT ?person ?personLabel WHERE {
?person a dbo:Artist.
?person foaf:name ?personLabel.
?person dbo:influencedBy dbr:Pablo_Picasso.
}
Here is the live result of that query.
This query searches Wikidata
for artists influenced by Picasso.
SELECT DISTINCT ?person ?personLabel WHERE {
?person wdt:P106/wdt:P279* wd:Q483501.
?person wdt:P737 wd:Q5593.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
Here is the live result of that query.
Why are the results
and the queries different?
DBpedia and Wikidata contain different data.
So far, so good. This was expected.
DBpedia and Wikidata use different ontologies.
Ideally, the same SPARQL queries would suffice.
…and they can, when ontologies link to each other.
In practice, some bridging is still needed.
Semantic Web reasoning can bridge the gap,
but is often not enabled on query endpoints.
Heterogeneity exists on multiple levels
across metadata collections.
Heterogeneity exists on the data level.
We can choose our own vocabularies.
How do we ensure they align?
Heterogeneity exists on the interface level.
We can choose how consumers can query our data.
How can clients consume multiple datasets easily?
Heterogeneity is our best friend
and our largest enemy.
Anybody on the Web is free
to publish however they want.
This works great for people —sometimes.
It often doesn’t work great for machines .
Standardization helps us align.
delicate balance between flexibility and interoperability
Standardization and agreement
have provided us with foundations.
the Semantic Web family of standards
ontologies and vocabularies
Dublin Core
DBpedia ontology
Wikidata ontology
Schema.org
…
Web APIs not yet
Linked Data Platform
OAI-ORE
The current level of standardization
still leaves some areas uncovered.
vocabulary usage
vocabulary agreement
the right terms for the right clients
Web APIs
stop reinventing the wheel
Which vocabularies should we use
to describe our metadata, and how?
We need to develop examples and guidance.
vocabulary usage
URL strategy
…
Reasoning can fill vocabulary gaps.
We can never cover all vocabularies.
Web APIs are the Achilles’ heel
of interoperability on the Web.
Shall we all have our SPARQL endpoints?
Shall we all support the Linked Data Platform?
That doesn’t solve querying…
Shall we all have our own custom APIs?
That’s not a sustainable way.
See more in the REST module.
Self-assessment 1: HTTP URLs
Why are HTTP URLs important for Linked Data?
HTTP URLs are not important for Linked Data.
No. While not necessary for the RDF data model itself,
the Linked Data principles mandate HTTP URLs.
Because they guarantee consistent semantics.
No. Semantic consistency comes from
the reuse of unique concept identifiers (URIs),
not specifically from HTTP URLs.
So we can look up (un)known concepts.
Yes. HTTP URLs can be dereferenced :
to obtain more data about a concept, follow its URL.
Self-assessment 2: OWL and RDFS
Which of the following propositions are true?
RDFS and OWL are an answer to Schema.org.
No: Schema.org is (mainly) a vocabulary
with terms to describe concrete things,
such as books, people, articles, …
RDFS and OWL contain terms to model other ontologies or vocabularies
(such as Schema.org).
OWL replaces RDFS.
No: RDFS is still required to express basic ontological relations.
As such, RDFS and OWL are often used side by side.
OWL extends RDFS.
Yes: OWL extends RDFS with more advanced ontological concepts.
Self-assessment 3: SPARQL
What is SPARQL?
A data model.
No. The data model is RDF.
A query language.
Yes, SPARQL is a query language for RDF.
A protocol.
Yes, SPARQL is a protocol to execute SPARQL queries over HTTP.
Self-assessment 4: SPARQL queries
Does the same SPARQL query return the same results
on different sources about the same metadata?
In theory, but not in practice.
Yes: in theory, SPARQL queries should be interoperable across datasources. Even if different sources use different ontologies, reasoning can bridge the gap.
In practice, but not in theory.
No: in practice, datasets use different ontologies and most endpoints do not have reasoning enabled to bridge the gap.
In theory and in practice.
No: in practice, datasets use different ontologies and most endpoints do not have reasoning enabled to bridge the gap.