Isabelle Boydens reused Fernand Braudel’s stratified time concept as a
hermeneutical approach to audit the quality of social security databases.
Implications of living in an Empirical World
The work of Boydens demonstrates we can not assert a direct correspondence between the empirical, ever-changing world
and the metadata and database schema representing it.
Defining data quality in a deterministic manner (e.g., MIT Total Data Quality Program)
makes no sense for empirical application domains.
Stratified time applied on social security databases by Boydens
long term
evolution of policies and legislation
intermediate term
evolution of technologies and standards
short term
evolution of the objects the database documents
From theory to practice…
Change is a fundamental notion to deal with…
Boydens underlined that we should not ask Are the metadata correct?
but How do they evolve through time?
Incorporate tools which help to monitor change
If values do not correspond to the schema, the schema itself may have
to be questioned
Practice of data profiling helps the development of data auditing skills
Getting to grips with data profiling
The use of analytical techniques to discover the true structure, content
and quality of a collection of data.
The next module will explain how a tool such as OpenRefine
can help you to spot data quality issues.
Recipe for a data quality audit
We will focus on the following ingredients:
flattening data
column names
empty columns
data types
length of entries
empty fields
field overloading
Flattening data
Data profiling and cleaning techniques are applied outside the information system itself.
These applications mostly ingest flat files so an export process from your
information system to CSV or TSV files will be needed.
Columns as a starting point
Simple SQL command to create a column of a table:
ALTER TABLE artwork ADD title VARCHAR;
It is surprising how much can go wrong with
name of the column
data type
length
Interpretation of the title of a field
Seems simple to interpret in the context of our art catalogue but
this is not always the case.
Generic, ambiguous or polysemic names can create a lot of confusion:
What to expect from a field entitled Sold?
A binary yes/no, the name of the seller, the amount?
Often unused columns are reused without renaming them.
Issues with empty columns
Appear everywhere, due to a variety of reasons:
Pre-configured settings
of databases resulting in a series of irrelevant columns which are never used
Data can loose their relevance and be deleted at some point, but the columns
as such are rarely
removed from the system
Fear of losing data prevents people from removing them, but
results in systems with lots of unused columns.
Data types
In our SQL example, VARCHAR was mentioned as a data type.
Other common data types include Text, Number, Boolean, and Date.
All data types can be easily encoded with Text, makes you loose less time but things
can get very messy over time.
Biggest metadata horror: dates!
There is an incredible range of possibilities to express dates. Here
are just a few examples:
Pattern
Example
empty
9999-9999
1891-1912
9999
1912
99-99/9999
09-10/1912
99/9999
01-1912
99/99/99
04/08/12
AAA 9999
May 1912
Length of entries
Seems to have lost its pertinence with decreasing storage costs
Text length of entries can be a very interesting parameter to analyse
If you expect the name of an individual in a field and the average
text length is 250 characters, there might be an issue…
Especially the outlier values, being the very short or long ones, can
be a good place to start your analysis
Empty fields
Variety of options to express that a field is empty:
no value, encoding null, no value etc
Diversity of reasons: value is not (yet) known, not applicable, lack of resources
Rarely possible to document this
Specific case: trailing white spaces
Field overloading
Encoding in one field of values which should be split out over multiple fields,
due to:
repeating values
different realities addressed by a generic field
Impacts search and retrieval but also limits data analysis and cleaning
Self-assessment 1: relevance
Why is data quality so relevant in a Linked Data context?
The closed-world assumption of the Linked Data paradigm limits the amount of data available.
No, Linked Data is based on the open world assumption, implying that no one at a certain moment knows exactly what type of data are available and the type of constraints they respect.
Linked Data holds the potential danger of introducing erroneous and conflicting data.
Yes, without specific efforts to clean original data sources and ensuring standardised methods and tools to evaluate and compare data set published as Linked Data, the issue of data quality might seriously undermine the potential of Linked Data for libraries.
The introduction of Linked Data will boost the quality of library catalogs.
It depends! Using data from very diverse and heterogeneous sources might seriously undermine the quality of catalogs.
Self-assessment 2: deterministic data
Why is it important to distinguish deterministic from empirical data when talking about metadata quality?
Contrary to deterministic data, there exist no formal theories to validate empirical data
Yes! For deterministic data there are fixed theories which no longer evolve, such as is the case with algebra. 1 + 1 will always equal 2.
There are more issues with deterministic data.
No, irrelevant answer.
Because empirical data can not be cleaned.
No, it is not because we can not establish a direct correspondence between the observable and the data that one can not identify errors and rectify them.
Self-assessment 3: field overloading
What is field overloading and why is it problematic?
The issue rises when you go beyond the number of characters which may be encoded in a field.
No! The length of an entry can definitively be an interesting data quality indicator, but field overloading is not linked to the length of an entry.
This issue mainly rises when you transfer data from a flat file to a database.
No, it tends to be the other way around. Moving from a well-structured database, with clear definitions of fields, to a flat file might result in packing together related but different (e.g. surname and family name) fields.
Field overloading occurs when related data are put together in the same field.
Yes, this lowers the possibilities to clearly define encoding constraints and structured search.
Self-assessment 4: absent values
Why is it important to think about how we communicate about absent values?
In order to save space.
No, this is not a relevant answer.
In order to avoid them at all times.
No! Both for conceptual and operational reasons, it is impossible to avoid empty fields. The important aspect is to document the reason behind the absence of a value.
An empty field can be there for a large variety of reasons. Knowing the reason can be important in order to know how to interpret the absence.
Yes! A value might not be known or not applicable, or there simply might not be enough resources to fill it in.