The LIS community has to fundamentally rethink how to create added-value
metadata in contexts where it is impossible to continue with a traditional
manual cataloging approach
Other communities have illustrated over the last decade how algorithms, crowdsourcing
or outsourcing via micro-payments can provide alternative strategies
Digital Humanities are embracing scale
DH community in particular can inspire libraries to embrace the avalanche of digital collections
not as a problem,
but as an opportunity
Methods and tools developed by the DH community to make sense out of very big volumes
of non-structured text can also be reused for automated metadata creation
What can you experiment with as a librarian?
Low hanging fruits from the information extraction field which are often
used in DH research:
Named-Entity Recognition (NER)
Topic Modeling (TM)
Both methods can be used with a minimum of resources and technical
experience. This module gives you the opportunity to experiment on your own!
The increased availability of data and text-mining technologies has
given a tremendous boost to quantitative methods in the humanities.
The ease with which certain technologies can be applied does hold a danger,
in the sense that the tool increasingly defines the method and the research question
itself. The DH community has been accused of being too tool-driven…
Librarians as data curators
DH have created the opportunity for librarians to reposition themselves
as data curators
Librarians should function increasingly as a middleman between
digital corpora and academics with their research questions
Job opportunity in research environments to help academics assess
critically automated methods to analyze large corpora
Named-Entity Recognition (NER)
Consider the sentence: On 25 September 2006, we visited Washington
to see the White House.
Identification of entities
25 September 2005
Washington
White House
Disambiguation
Washington: the state, the city, the jazz musician, etc?
The same URL cannot be used to also identify the person.
Another URL could identify him—but not locate him,
as we can not represent an individual electronically.
Do URLs identify only electronic resources or also real-world objects?
Information resource
A Wikipedia article about Jeff Koons is clearly an information resource,
as it can be represented in electronic form. Its URL can be used to identify,
locate and access an electronic representation of it.
Non-information resource
We can represent a document about Jeff Koons, but this would be
a different resource than the artist himself, as they have different properties (e.g.
date of creation). The artist is therefore a non-information resource.
DBPedia takes a pragmatic approach to distinguish both
When Jeff Koons is mentioned in the document,
the first URL will be used.
Furthermore, if we visit the first URL in the browser,
DBpedia is not able to represent Jeff Koons,
so it redirects you towards the page about Jeff Koons instead.
NER services
Rapidly evolving marketplace
Start-ups sprout up and are acquired at a rapid pace by bigger players.
Some examples of NER services include:
Commercial services sometimes create custom URLs which are hardly used
outside their service. OpenCalais for example mints its own
PermID, which represents the LoC as
https://permid.org/1-5037352201.
Deciding which service to use—questions you should ask yourself:
How to represent the entity Library of Congress?
Information versus non-information resources:
Services often refer to documentation about a resource,
such as the DBPedia page which represents all sorts of information about for
example the LoC:
http://dbpedia.org/page/Library_of_Congress.
However, services such as Zemanta potentially identify the LoC
with
https://www.loc.gov/ from where one has immediate access to
their services.
You can perform NER on you data directly from within OpenRefine
Click the Named-entity recognition button
at the top right corner,
and choose Configure services…
Start the NER process
First, identify the relevant column.
In contrast to reconciliation,
for NER, fields with a lot of text are the best candidates.
Then, choose the services and start enriching.
Click a column triangle
and choose Extract named entities.
Pick the services you want
and choose Start extracting.
Extracted entities are displayed in a new column.
Topic Modeling (TM)
Returns clusters of terms which appear together
Statistical approach expressing to what extent terms are
used in each others presence
Unsupervised machine-learning approach
No training data or a priori knowledge about the corpus
needed
Number of extracted topics and how many terms per topic are clustered
together are the main parameters to play with
First step: extracting the clusters
Running a popular TM tool such as Mallet on a corpus of English historical newspapers
during the period of the Second World War might result in the following clusters:
potatoes, farmer, transport, hunger, corn
alarm, bunker, explosion, siren, airplane
doctor, nurse, bed, medicine, death
Second step: labeling the clusters
In order to identify an appropriate label for a cluster, a good understanding
of the context of the corpus is often essential. For the three clusters from
the previous slide, we could come up with these labels:
food shortage
airstrikes
hospitalization
TM understands the semantics?
The algorithm is constrained by the words used in the text;
if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription
of your dream of bear-fights and big caves, the algorithm will tell you nothing about
your father and your mother; it’ll only tell you things about bears and caves.
It’s all text and no subtext.
…refers to the possibility to automatically create abstracts of books.
No, the concept refers to a generic approach of using quantitative methods to deal with large volumes of data and does not represent one specific method or technique.
…implies we no longer need to actually read texts.
No, distant and close reading practices complement one another and are not mutually exclusive.
…can be helpful to librarians.
Yes! Automated methods such as TM or NER can be applied to extract metadata from large corpora and create navigational paths for end-users.
Self-assessment 2: NER
What makes NER so interesting as an information extraction technique?
Entities are identified by a URL.
Yes! The URL allows to disambiguate and to obtain information about the entity.
It allows to perform sentiment analysis on feedback by patrons.
No! Sentiment analysis uses other techniques.
You can parse ambiguous dates to an ISO standardized data format.
No! Certain scripts can do that, but they have nothing to do with NER.
Self-assessment 3: (non-)information
Why is it important to distinguish information and non-information resources on the Web?
They have different metadata.
Yes! For example, a Web page about a person (information resource) has a different creation date than the date of creation of the actual individual (non-information resource), which would be his or her date of birth.
It is actually not so important.
It is really important! If not it becomes really difficult to refer to actual objects, people, ideas etc on the Web.
It has an important impact on the design of URLs.
Yes, the URL needs to allow differentiating between the actual thing and documentation about the thing.
Self-assessment 4: Topic Modeling
When is Topic Modeling relevant for librarians?
To extract key terms from metadata fields such as description.
No, TM requires a substantial volume of text and does not work for a couple of paragraphs.
When you want to create an overview of recurring themes in a collection of scanned literature.
Yes, this would potentially be a good application for TM.
When you want to identify all place names from a corpus.
No! NER is the application you need for this type of task.