Download Report

Project 3:
Information Retrieval
KDDM2 – Initial Presentation
Team #18
Dominik Moesslang 1031448
Lukas Greussing 1130939
Problem Statement
Task
• Provide a ranked list of item relevant for a given piece of text
Use-Case
• Consider a user write a blog post. While typing the user is presented a list of
suggested items, which might be relevant or helpful
• Data Set: Europeana
Advanced Task
• Identify Wikipedia concepts within the written text. Provide a link to
corresponding Wikipedia Page
• Data Set: Wikipedia
Data Set (Task) – Europeana
• Access to different types of content from different types of heritage
institutions
• More than 30 million items from a range of Europe’s leading galleries,
libraries, archives and museums
Breakdown of Item Types in Europeana: 2012-2015
Data Set (Task) – Europeana
• More than 2000 institutions across Europe have contributed to
Europeana
Items Per Country in Europeana: 2012-2015
Data Set (Task) – Europeana
• Europeana collects metadata including a thumbnail
• Items are not stored on a central server
• Items are licensed differently
Total Items in Europeana & Re-use Status: 2012-2015
Europeana - API
• Standard REST calls over HTTP
• Responses returned in JSON
• Internally, Europeana uses Apache Solr
• Apache Lucene Query Syntax is supported by queries
• Provide Basic Search, Boolean Search, Range Search and Timestamp
Search
• Should be possible to send weighted queries to Europeana API
Data Set (Advanced Task) - Wikipedia
Data Set (Advanced Task) - Wikipedia
Wikipedia - API
• Standard REST calls over HTTP
• Responses returned in JSON
• Internally, Wikipedia uses Cirrus Search Engine
• Support stemming, search suggestions, fuzzy search, phrase search
and proximity
• Different enpoints for different languages
Framework for Text Processing
• https://www.sensium.io/
• Scalable data mining and analysis platform from Know-Center
• Accessible via Standard REST calls over HTTP
• Focused on text mining services
Sensium - Features
Natural Language Processing
• Language Recognition
• Stemming & Normalization
• Part-of-Speech Tagging
Information Extraction
• Keyphrase Extraction
• Temporal Event Extraction
• Named Entity Extraction
Approach – First Design Concept
Ranked list of items
from Europeana
relevant to the Blog
entry
Automatically created
links to Wikipedia
Blog entry
Is updated while the user is typing
Thank you for your attention!