Data quality

[The Open-World Assumption can have a dramatic impact.] — ©2016 The Examiner

Achilles heel of Linked Data

As automated reuse of data rises, so does our dependency on their quality.
Anyone interested in Linked Data should first of all come to terms with the quality of their own and other organisation’s data.
Aggregating data on a massive scale has put the topic of data quality high on the agenda.

Relevant but complex topic

Blooming but problematic new field
Examples:

Interpretation of climate change data

the data which indicated holes in the ozone layer were considered at first as errors

Data Quality Act

extremely complex to put in practice, as it is complex to define what good quality data are

How can we define data quality?

The totality of features and characteristics
of a product, process or service that bears
on its ability to satisfy stated or implicit needs.
ISO 9000:2005

Commonly referred to in the literature as the fitness for purpose definition

Differentiate deterministic and empirical data

Deterministic data:
- Existence of a persistent theory makes it possible to decide whether or not a value is correct
- Example: in algebra, 1+1 will always be 2
Empirical data:
- Absence of a direct frame of reference to validate data, subject to human interpretation and evolution through time
- Example: medical field—new illnesses constantly appear

[Crushed fire truck] — ©2014 Menelik Simpson

We can make use of hermeneutics as a tool to make sense of quality

Process of understanding moving from parts of a whole to a global understanding of the whole and back to individual parts in an iterative manner.
Klein et al, 1999

Isabelle Boydens reused Fernand Braudel’s stratified time concept as a hermeneutical approach to audit the quality of social security databases.

Implications of living in an Empirical World

The work of Boydens demonstrates we can not assert a direct correspondence between the empirical, ever-changing world and the metadata and database schema representing it.

Defining data quality in a deterministic manner (e.g., MIT Total Data Quality Program) makes no sense for empirical application domains.

Stratified time applied on social security databases by Boydens

long term
- evolution of policies and legislation
intermediate term
- evolution of technologies and standards
short term
- evolution of the objects the database documents

[Paper Isabelle Boydens and Seth van Hooland]

From theory to practice…

Change is a fundamental notion to deal with…
Boydens underlined that we should not ask Are the metadata correct? but How do they evolve through time?
- Incorporate tools which help to monitor change
- If values do not correspond to the schema, the schema itself may have to be questioned
- Practice of data profiling helps the development of data auditing skills

Getting to grips with data profiling

The use of analytical techniques to discover the true structure, content and quality of a collection of data.
Olson, 2003

The next module will explain how a tool such as OpenRefine can help you to spot data quality issues.

Recipe for a data quality audit

We will focus on the following ingredients:
- flattening data
- column names
- empty columns
- data types
- length of entries
- empty fields
- field overloading

Flattening data

Data profiling and cleaning techniques are applied outside the information system itself.
These applications mostly ingest flat files so an export process from your information system to CSV or TSV files will be needed.

Columns as a starting point

Simple SQL command to create a column of a table:
- ALTER TABLE artwork ADD title VARCHAR;
It is surprising how much can go wrong with
- name of the column
- data type
- length

Interpretation of the title of a field

Seems simple to interpret in the context of our art catalogue but this is not always the case.
Generic, ambiguous or polysemic names can create a lot of confusion:
- What to expect from a field entitled Sold? A binary yes/no, the name of the seller, the amount?
Often unused columns are reused without renaming them.

Issues with empty columns

Appear everywhere, due to a variety of reasons:
- Pre-configured settings of databases resulting in a series of irrelevant columns which are never used
- Data can loose their relevance and be deleted at some point, but the columns as such are rarely removed from the system
Fear of losing data prevents people from removing them, but results in systems with lots of unused columns.

Data types

In our SQL example, VARCHAR was mentioned as a data type.
Other common data types include Text, Number, Boolean, and Date.
All data types can be easily encoded with Text, makes you loose less time but things can get very messy over time.

Biggest metadata horror: dates!

There is an incredible range of possibilities to express dates. Here are just a few examples:

Pattern	Example
empty
9999-9999	1891-1912
9999	1912
99-99/9999	09-10/1912
99/9999	01-1912
99/99/99	04/08/12
AAA 9999	May 1912

Length of entries

Seems to have lost its pertinence with decreasing storage costs
Text length of entries can be a very interesting parameter to analyse
- If you expect the name of an individual in a field and the average text length is 250 characters, there might be an issue…
- Especially the outlier values, being the very short or long ones, can be a good place to start your analysis

Empty fields

Variety of options to express that a field is empty: no value, encoding null, no value etc
Diversity of reasons: value is not (yet) known, not applicable, lack of resources
Rarely possible to document this
Specific case: trailing white spaces

[Malevich painting] — ©2014 Micha Theiner

Field overloading

Encoding in one field of values which should be split out over multiple fields, due to:

repeating values
different realities addressed by a generic field

Impacts search and retrieval but also limits data analysis and cleaning

Self-assessment 1: relevance

Why is data quality so relevant in a Linked Data context?

The closed-world assumption of the Linked Data paradigm limits the amount of data available.
- No, Linked Data is based on the open world assumption, implying that no one at a certain moment knows exactly what type of data are available and the type of constraints they respect.
Linked Data holds the potential danger of introducing erroneous and conflicting data.
- Yes, without specific efforts to clean original data sources and ensuring standardised methods and tools to evaluate and compare data set published as Linked Data, the issue of data quality might seriously undermine the potential of Linked Data for libraries.
The introduction of Linked Data will boost the quality of library catalogs.
- It depends! Using data from very diverse and heterogeneous sources might seriously undermine the quality of catalogs.

Self-assessment 2: deterministic data

Why is it important to distinguish deterministic from empirical data when talking about metadata quality?

Contrary to deterministic data, there exist no formal theories to validate empirical data
- Yes! For deterministic data there are fixed theories which no longer evolve, such as is the case with algebra. 1 + 1 will always equal 2.
There are more issues with deterministic data.
- No, irrelevant answer.
Because empirical data can not be cleaned.
- No, it is not because we can not establish a direct correspondence between the observable and the data that one can not identify errors and rectify them.

Self-assessment 3: field overloading

What is field overloading and why is it problematic?

The issue rises when you go beyond the number of characters which may be encoded in a field.
- No! The length of an entry can definitively be an interesting data quality indicator, but field overloading is not linked to the length of an entry.
This issue mainly rises when you transfer data from a flat file to a database.
- No, it tends to be the other way around. Moving from a well-structured database, with clear definitions of fields, to a flat file might result in packing together related but different (e.g. surname and family name) fields.
Field overloading occurs when related data are put together in the same field.
- Yes, this lowers the possibilities to clearly define encoding constraints and structured search.

Self-assessment 4: absent values

Why is it important to think about how we communicate about absent values?

In order to save space.
- No, this is not a relevant answer.
In order to avoid them at all times.
- No! Both for conceptual and operational reasons, it is impossible to avoid empty fields. The important aspect is to document the reason behind the absence of a value.
An empty field can be there for a large variety of reasons. Knowing the reason can be important in order to know how to interpret the absence.
- Yes! A value might not be known or not applicable, or there simply might not be enough resources to fill it in.

Linked Data for Librarians

Part 1 – Module 4: Data quality