Project 3: Information Retrieval KDDM2 – Initial Presentation Team #18 Dominik Moesslang 1031448 Lukas Greussing 1130939 Problem Statement Task • Provide a ranked list of item relevant for a given piece of text Use-Case • Consider a user write a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful • Data Set: Europeana Advanced Task • Identify Wikipedia concepts within the written text. Provide a link to corresponding Wikipedia Page • Data Set: Wikipedia Data Set (Task) – Europeana • Access to different types of content from different types of heritage institutions • More than 30 million items from a range of Europe’s leading galleries, libraries, archives and museums Breakdown of Item Types in Europeana: 2012-2015 Data Set (Task) – Europeana • More than 2000 institutions across Europe have contributed to Europeana Items Per Country in Europeana: 2012-2015 Data Set (Task) – Europeana • Europeana collects metadata including a thumbnail • Items are not stored on a central server • Items are licensed differently Total Items in Europeana & Re-use Status: 2012-2015 Europeana - API • Standard REST calls over HTTP • Responses returned in JSON • Internally, Europeana uses Apache Solr • Apache Lucene Query Syntax is supported by queries • Provide Basic Search, Boolean Search, Range Search and Timestamp Search • Should be possible to send weighted queries to Europeana API Data Set (Advanced Task) - Wikipedia Data Set (Advanced Task) - Wikipedia Wikipedia - API • Standard REST calls over HTTP • Responses returned in JSON • Internally, Wikipedia uses Cirrus Search Engine • Support stemming, search suggestions, fuzzy search, phrase search and proximity • Different enpoints for different languages Framework for Text Processing • https://www.sensium.io/ • Scalable data mining and analysis platform from Know-Center • Accessible via Standard REST calls over HTTP • Focused on text mining services Sensium - Features Natural Language Processing • Language Recognition • Stemming & Normalization • Part-of-Speech Tagging Information Extraction • Keyphrase Extraction • Temporal Event Extraction • Named Entity Extraction Approach – First Design Concept Ranked list of items from Europeana relevant to the Blog entry Automatically created links to Wikipedia Blog entry Is updated while the user is typing Thank you for your attention!
© Copyright 2024