Metadata enriching

[Mining for semantics.] — ©2015 Bureau of Land Management Oregon and Washington State

Strategies for metadata creation

The LIS community has to fundamentally rethink how to create added-value metadata in contexts where it is impossible to continue with a traditional manual cataloging approach

Other communities have illustrated over the last decade how algorithms, crowdsourcing or outsourcing via micro-payments can provide alternative strategies

Digital Humanities are embracing scale

DH community in particular can inspire libraries to embrace the avalanche of digital collections not as a problem, but as an opportunity

Methods and tools developed by the DH community to make sense out of very big volumes of non-structured text can also be reused for automated metadata creation

What can you experiment with as a librarian?

Low hanging fruits from the information extraction field which are often used in DH research:

Named-Entity Recognition (NER)
Topic Modeling (TM)

Both methods can be used with a minimum of resources and technical experience. This module gives you the opportunity to experiment on your own!

[Distant reading in a library catalog.] — ©2017 DH at Notre Dame University

With great power comes great responsibility.
Spiderman, 1962

The increased availability of data and text-mining technologies has given a tremendous boost to quantitative methods in the humanities. The ease with which certain technologies can be applied does hold a danger, in the sense that the tool increasingly defines the method and the research question itself. The DH community has been accused of being too tool-driven…

Librarians as data curators

DH have created the opportunity for librarians to reposition themselves as data curators
Librarians should function increasingly as a middleman between digital corpora and academics with their research questions
Job opportunity in research environments to help academics assess critically automated methods to analyze large corpora

[DH debate in the NYT.] — ©2012 Stanley Fish

Named-Entity Recognition (NER)

Consider the sentence: On 25 September 2006, we visited Washington to see the White House.

Identification of entities
- 25 September 2005
- Washington
- White House
Disambiguation
- Washington: the state, the city, the jazz musician, etc?

Disambiguation through URLs

Each entity is associated with a meaning:
- http://dbpedia.org/page/White_House
- http://dbpedia.org/page/Washington,_D.C.
Distinguish recognition and disambiguation:
- Recognition is about finding the entities in a text fragment, whereas disambiguation is about mapping those entities to a universal identifier
However, recall that a URL serves the double purpose of identification and location.

What is actually identified by https://en.wikipedia.org/wiki/Jeff_Koons?

Wikipedia page about Jeff Koons?
- But which version? Wikipedia is constantly edited, so one would have to use http://en.wikipedia.org/w/index.php?title=Jeff_Koons&oldid=569081712 in order to refer to a specific version.
Jeff Koons himself?
- The same URL cannot be used to also identify the person. Another URL could identify him—but not locate him, as we can not represent an individual electronically.

Do URLs identify only electronic resources or also real-world objects?

Information resource
- A Wikipedia article about Jeff Koons is clearly an information resource, as it can be represented in electronic form. Its URL can be used to identify, locate and access an electronic representation of it.
Non-information resource
- We can represent a document about Jeff Koons, but this would be a different resource than the artist himself, as they have different properties (e.g. date of creation). The artist is therefore a non-information resource.

DBPedia takes a pragmatic approach to distinguish both

http://dbpedia.org/resource/Jeff_Koons
- identifies the artist
http://dbpedia.org/page/Jeff_Koons
- identifies the document about the artist

When Jeff Koons is mentioned in the document, the first URL will be used. Furthermore, if we visit the first URL in the browser, DBpedia is not able to represent Jeff Koons, so it redirects you towards the page about Jeff Koons instead.

NER services

Rapidly evolving marketplace
- Start-ups sprout up and are acquired at a rapid pace by bigger players. Some examples of NER services include:
  - OpenCalais
  - NetOwl
Open-source alternatives

Deciding which service to use—questions you should ask yourself:

How to represent the entity Library of Congress?

Reusing URLs versus minting URLs:
- Services can refer to recognised knowledge bases, already well-integrated as Linked Data, will refer to for example http://dbpedia.org/page/Library_of_Congress.
- Commercial services sometimes create custom URLs which are hardly used outside their service. OpenCalais for example mints its own PermID, which represents the LoC as https://permid.org/1-5037352201.

Deciding which service to use—questions you should ask yourself:

How to represent the entity Library of Congress?

Information versus non-information resources:
- Services often refer to documentation about a resource, such as the DBPedia page which represents all sorts of information about for example the LoC: http://dbpedia.org/page/Library_of_Congress.
- However, services such as Zemanta potentially identify the LoC with https://www.loc.gov/ from where one has immediate access to their services.

You can perform NER on you data
directly from within OpenRefine

You will need to install the NER extension first.
- Download the extension from our website.
- Follow the installation instructions.
We will use the metadata of the British Library to demonstrate NER in OpenRefine.
- Create a project with the British Library dataset.

Configure the NER services

By default, DBpedia Spotlight is enabled.
- provides good initial results
Other services might require an API key.
- Click the Named-entity recognition button at the top right corner, and choose Configure services…

Start the NER process

First, identify the relevant column.
- In contrast to reconciliation, for NER, fields with a lot of text are the best candidates.
Then, choose the services and start enriching.
- Click a column triangle and choose Extract named entities.
- Pick the services you want and choose Start extracting.
Extracted entities are displayed in a new column.

Topic Modeling (TM)

Returns clusters of terms which appear together
- Statistical approach expressing to what extent terms are used in each others presence
Unsupervised machine-learning approach
- No training data or a priori knowledge about the corpus needed
- Number of extracted topics and how many terms per topic are clustered together are the main parameters to play with

First step: extracting the clusters

Running a popular TM tool such as Mallet on a corpus of English historical newspapers during the period of the Second World War might result in the following clusters:

potatoes, farmer, transport, hunger, corn
alarm, bunker, explosion, siren, airplane
doctor, nurse, bed, medicine, death

Second step: labeling the clusters

In order to identify an appropriate label for a cluster, a good understanding of the context of the corpus is often essential. For the three clusters from the previous slide, we could come up with these labels:

food shortage
airstrikes
hospitalization

TM understands the semantics?

The algorithm is constrained by the words used in the text; if Freudian psychoanalysis is your thing, and you feed the algorithm a transcription of your dream of bear-fights and big caves, the algorithm will tell you nothing about your father and your mother; it’ll only tell you things about bears and caves. It’s all text and no subtext.
Scott B. Weingart, Topic Modeling for Humanists: A Guided Tour , 2012

Self-assessment 1: distant reading

The concept of distant reading…

…refers to the possibility to automatically create abstracts of books.
- No, the concept refers to a generic approach of using quantitative methods to deal with large volumes of data and does not represent one specific method or technique.
…implies we no longer need to actually read texts.
- No, distant and close reading practices complement one another and are not mutually exclusive.
…can be helpful to librarians.
- Yes! Automated methods such as TM or NER can be applied to extract metadata from large corpora and create navigational paths for end-users.

Self-assessment 2: NER

What makes NER so interesting as an information extraction technique?

Entities are identified by a URL.
- Yes! The URL allows to disambiguate and to obtain information about the entity.
It allows to perform sentiment analysis on feedback by patrons.
- No! Sentiment analysis uses other techniques.
You can parse ambiguous dates to an ISO standardized data format.
- No! Certain scripts can do that, but they have nothing to do with NER.

Self-assessment 3: (non-)information

Why is it important to distinguish information and non-information resources on the Web?

They have different metadata.
- Yes! For example, a Web page about a person (information resource) has a different creation date than the date of creation of the actual individual (non-information resource), which would be his or her date of birth.
It is actually not so important.
- It is really important! If not it becomes really difficult to refer to actual objects, people, ideas etc on the Web.
It has an important impact on the design of URLs.
- Yes, the URL needs to allow differentiating between the actual thing and documentation about the thing.

Self-assessment 4: Topic Modeling

When is Topic Modeling relevant for librarians?

To extract key terms from metadata fields such as description.
- No, TM requires a substantial volume of text and does not work for a couple of paragraphs.
When you want to create an overview of recurring themes in a collection of scanned literature.
- Yes, this would potentially be a good application for TM.
When you want to identify all place names from a corpus.
- No! NER is the application you need for this type of task.

Linked Data for Librarians

Part 2 – Module 2: Metadata enriching

Linked Data for Librarians

Part 2 – Module 2: Metadata enriching

Strategies for metadata creation

Digital Humanities are embracing scale

What can you experiment with as a librarian?

Librarians as data curators

Named-Entity Recognition (NER)

Disambiguation through URLs

What is actually identified by https://en.wikipedia.org/wiki/Jeff_Koons?

Do URLs identify only electronic resources or also real-world objects?

DBPedia takes a pragmatic approach to distinguish both

NER services

Deciding which service to use—questions you should ask yourself:

Deciding which service to use—questions you should ask yourself:

You can perform NER on you data
directly from within OpenRefine

Configure the NER services

Start the NER process

Topic Modeling (TM)

First step: extracting the clusters

Second step: labeling the clusters

TM understands the semantics?

Self-assessment 1: distant reading

Self-assessment 2: NER

Self-assessment 3: (non-)information

Self-assessment 4: Topic Modeling

Linked Data for Librarians

Part 2 – Module 2: Metadata enriching

Strategies for metadata creation

Digital Humanities are embracing scale

What can you experiment with as a librarian?

Librarians as data curators

Named-Entity Recognition (NER)

Disambiguation through URLs

What is actually identified by https://en.wikipedia.org/wiki/Jeff_Koons?

Do URLs identify only electronic resources or also real-world objects?

DBPedia takes a pragmatic approach to distinguish both

NER services

Deciding which service to use—questions you should ask yourself:

Deciding which service to use—questions you should ask yourself:

You can perform NER on you data directly from within OpenRefine

Configure the NER services

Start the NER process

Topic Modeling (TM)

First step: extracting the clusters

Second step: labeling the clusters

TM understands the semantics?

Self-assessment 1: distant reading

Self-assessment 2: NER

Self-assessment 3: (non-)information

Self-assessment 4: Topic Modeling

You can perform NER on you data
directly from within OpenRefine