Models helps us to make abstraction of reality. You may think of a data model as a
particular pair of glasses, influencing the way in which you see the world.
Implementation and serialization formats
Models are made concrete through file formats, query languages and software.
How does the model address updates and sharing?
Models and serialization formats
Tabular data
CSV, TSV
Relational model
binary files
Meta-markup languages
SGML, XML
RDF
Turtle, N-Triples, RDF-XML
First model: tabular data
Also referred to as flat files
Intuitive approach to organize data
Represents the world in one big gigantic table or spreadsheet
Consists of columns and rows, their intersection gives meaning to the data
Tabular data—example
Title
Creator
Date
Collection
Guernica
Pablo Picasso
1937
Museo Reina Sofia
First Communion
Picasso
1895
Museo Picasso
Puppy
Koons, Jeff
1992
Guggenheim
Serializing tabular data
Common serialization formats for tabular data include
CSV
and TSV
Example of our data in CSV:
title,creator,date,collection Guernica,Pablo Picasso,1937,Reina Sofia First Communion,Picasso,1895,Museo Picasso Puppy,"Koons, Jeff",1992,Guggenheim
Limits and possibilities of tabular data?
Data quality: prone to inconsistencies!
Search and retrieval: ineffective
Updates—change: easy
Distribution: easy
How do we overcome the inconsistencies and poor search within tabular data?
Copy/paste the SQL code from the previous slides in order to create
a table, insert metadata and query them
Dealing with change
Updating the schema of a database can be very complex
Apart from ensuring the normalization of the modified schema, the modifications
might also affect public front-ends
Quick-and-dirty ad hoc solutions often
are taken, which have disastrous consequences over time
Sharing your Data
Both data and the schema are locked up in a binary file,
coupled to a specific software—you can’t just copy/paste a database and
give it to someone else!
Leaving aside the technical issues of migrating and integrating
databases, the main complexity resides in the semantics:
a database schema is rarely well documented in practice, making it very
complex to understand how sometimes hundreds of tables are connected
Third model: markup
Origins of markup lie in the tradition of typesetting,
where an author marks up
a manuscript in order to explain how it should be printed
Certain passages, such as the titles of chapters, should be printed in bold,
whereas footnotes should be printed in a smaller font than the normal
paragraphs
Two options: either you apply makeup or markup…
Applying makeup
The quick and dirty way…
In the context of HTML, applying makeup would imply the following:
<font size="20"><b>Linked Data for
librarians</b></font>
We simply indicate how that specific string
of characters should be displayed—it’s makeup!
Applying markup
Indicate the role a part of a document
plays and define separately the aesthetics of that role
Let’s use markup on our HTML example:
h1 {font-size: 20pt; font-weight: bold} … <h1>Linked Data for librarians</h1>
Defining the lay-out of structural elements of a document (e.g. h1),
opens a new world of possibilities
HTML detour
1989: Tim Berners-Lee was inspired by SGML
but simplified it radically by proposing a fixed set
of tags to represent the structural components of a Web page
Examples: <head>, <body>,
<h1>, etc.
Most people forgot all about SGML, but its influence has been
enormous…
XML
End of the 1990s => desire to focus again on the structure and not the lay-out
of the Web
XML 1.0: W3C recommendation in 1998
Effort was made to maintain 80% of SGML’s functionality
with only 20% of its complexity
Big impact: open standard which is platform and application independent
Modeling XML
Structure documents with
elements: serialized as tags surrounded by angle brackets
(<tag>)
attribute: key/value modifiers of a tag
Each document starts with a declaration:
<?xml version="1.0" encoding="UTF-8"?>
<Art title="Modern art"/>
Let’s add a work to our XML catalog of art objects
Note how we can model all of the metadata as attributes:
Allows to traverse XML trees and collect element and attribute values
Examples:
/Art/Work/Creator
Creator/LastName
Work/descendent::LastName
Work/@year
When to use a database or XML?
The quick answer is: it depends on the context
Read this
paper by a historian who explains the pros and cons of
each approach to model an inventory
XML is often criticized for its verbose nature, and
JSON has become more popular
to represent structured data
Limitations?
Inconsistencies: usage of XML Schema to validate data is powerful
Search and retrieval: less performant than relational databases
Updates—change: as painful as with relational databases
Distribution: open standard, but still one needs to understand
the schema
Fourth model: RDF
Resource Description Framework (RDF)
Worldview consisting of a gigantic ever-expanding graph of triples
Triple consists of a subject, object and predicate
Any resource (the subject) can have a relationship (the predicate) to any other resource (the object)
Model
Through a radically simplified data model, the semantics are made explicit by the triple itself
Both databases and XML are based on the principle that only data conform to a locally defined schema
may exist => closed world assumption
RDF sails under the open world paradigm flag
Serialization—different options
RDF/XML
Developed in 2001 at the beginning of the XML omnipresence.
Now considered to be too verbose and hard to parse
Turtle
Allows to express RDF triples in a compact and natural text form.
Each of the components (subject, predicate, object) are separated by whitespace and a triple
ends with a dot
Turtle example
Let’s see how we can express the metadata explaining Jeff Koons created
the artwork Puppy:
Multiple statements about the same object can be written
tersely by using a semicolon if the subject is repeated, and a comma
if the subject and predicate are repeated
RDF includes literal values (Puppy) in its model as well, as some properties
eventually do not point at another object but rather at a non-decomposable value.
Also, not how :influencedBy has an empty namespace prefix,
which indicates that it is locally defined:
SPARQL: recursive acronym for SPARQL Protocol and RDF Query Language
Example: let’s retrieve all triples which have Picasso as the subject:
SELECT ?predicate ?object WHERE { <http://dpbedia.org/resource/Pablo_Picasso> ?predicate ?object. }
Implementation
Triplestore: database used to store and query RDF triples
Either natively built or on top of existing relational databases
Despite recent developments, performance remains an issue
Limits of RDF?
Inconsistencies: fantastic in theory, but often problematic in practice
Search and retrieval: tremendous possibilities, but complex to execute
Updates—change: new information can be added at any point
Distribution: where the model shines!
Self-assessment 1: tabular data
Creating metadata as tabular data is a bad idea:
If you want to share your metadata.
No, tabular data is platform-independent and very easy to exchange
If you want to avoid spelling mistakes.
Yes, flat files do not offer the possibility to ensure consistency when encoding metadata
If you want to express hierarchy in your metadata.
Yes, tabular do not give the opportunity to express hierarchy. Relational databases or XML seem then a better fit
Self-assessment 2: entities or attributes
How to decide to represent something as an entity or as an attribute within a database schema?
Opt for an attribute if it is an important aspect of the reality
you are modeling.
No! In that case an entity would be a better choice as you can
further document the entity with the help of attributes.
It depends on how important that part of reality is within the database.
Yes! If you want to give additional information about something,
it will be better to model it as an entity, as you can attach extra information
to an entity with attributes.
It depends on the amount of data you want to store in the database.
No, this is an irrelevant argument.
Self-assessment 3: XML and preservation
Why is XML interesting from a digital preservation point of view?
XML files are non-binary files.
Yes, XML files are text files which can be opened and modified with a simple text editor and are independent of any particular software.
XML is self-describing.
Yes and no: XML tags allow to explicitly state the role of a specific part of a document, but the interpretation of the name of a tag might be problematic. The name of a tag might quickly lose its meaning after a couple of years, especially when acronyms are used.
XML files take up less space than JSON.
No, JSON was actually developed as a reaction to the verbose nature of XML and drastically reduces the amount of characters used to represent data.
Self-assessment 4: XML and HTML
How is XML related to HTML?
XML is a subset of HTML.
No, XML is a meta-markup language whereas HTML is simply a markup language.
Both are examples of a markup language.
The precise answer is no, as XML is a meta-markup language: you have the possibility to define your own tags. HTML is simply a markup language with pre-defined tags.
XML was developed as a reaction to the evolution of HTML from markup to makeup.
Yes! Towards the end of the 1990s, browsers proposed tags such as <blink> which merely had an aesthetic role, undermining the potential for a smarter Web.