Download Report

CLARIN-PL
CLARIN-PL – Research User-driven
Language Technology Infrastructure
Maciej Piasecki
Wrocław University of Technology
G4.19 Research Group
[email protected]
2014-11-27
Basic Notions
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Language Technology (LT)




language resources and tools
robust in terms of quality and coverage
multipurpose
component based
 Language Technology Infrastructure
 a software framework (architecture or platform)
 for combining language tools with language resources into
processing chains (or pipelines)
 the defined processing chains are next applied to language
data sources
 interoperability, also with the external systems
LT in Humanities and Social
Sciences: Barriers
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Physical – language tools and resources are not
accessible in Internet
 Informational – descriptions are not available or there is no
means for searching
 Technological – lack of commonly accepted standards for
LT, lack of a common platform, varieties of technological
solutions, insufficient users’ computers
 Related to knowledge – the use of LT requires
programming skills or knowledge from the area of natural
language engineering
 Legal – licences for language resources and tools (LRTs)
limit their applications
CLARIN Support for Humanities
& Social Sciences
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 CLARIN is ERIC type consortium of
 11 countries (Austria, Bulgaria, Czech Republic, Denmark,
Estonia, Germany, Lithuania, The Netherlands, Poland,
Portugal, Sweden) and The Dutch Language Union
 1 observer: Norway
 Focus area:
 Supporting research in Humanities and Social Sciences
 Users: researchers, PhD students, students and scientific
institutions
 CLARIN Mission
 To significantly lower the barriers for the use of Language
Technology in Humanities & Social Sciences (H&SS)
 To facilitate or enable research methods based on automated
analysis of text and speech resources
CLARIN Offer
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Integration of different LT components into one
interoperable system
 Common, flexible meta-data standard (CMDI)
 Central searching for resources (Virtual Language
Observatory)
 One sign on and one login into the distributed infrastructure
 Decreased Physical and Informational Barriers
 Common standards: promoting, co-ordinating, harmonising
 Web Services for Language Tools and Resources
 Decreased Technological Barrier
 Installation-free, access via Web Applications
 Decreased Knowledge Barrier
 Common licences and promotion of the open access
 Decreased Legal Barrier
CLARIN: Portal
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN: Virtual Language
Observatory
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN: Federated Content
Search – Searching Corpora
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
LTI Development Paradigms
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Bottom-up
 a collected offer approach
 based on linking together the already existing Language
Resources and Tools
 focused on accessibility, technical interoperability and
processing chains
 Top-down
 following on user-centred design paradigm
 research applications for H&SS are a starting point
 Bi-directional
 linking of Language Resources and Tools
 combined with the development of research applications
Bi-directional LTI Development
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Idea
 development of the necessary elements
 a distributed network infrastructure
 basic LT processing chain
 combined with user-centred approach to the development of
research applications
 Top-down part
 close co-operation with key users from the H&SS domain
 a metaphor of the Agile-like light weight software designing
method with emphasis to prototyping
 amendments to the shape of the technical basis: LRTs,
standards,
 inspirations, identification of the further user needs, next
iterations
CLARIN-PL: the Consortium
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Polish scientific consortium
 Wrocław University of Technology, G4.19 Research Group
 Institute of Computer Science, Polish Academy of Science
 Polish-Japanese Institute of Information Technology, Chair of
Multimedia
 University of Łódź, PELCRA group at Chair of English Language
and Applied Linguistics
 Institute of Slavic Studies, Polish Academy of Science
 Wrocław University
 Goal: implementation of the Polish part of the CLARIN
ERIC LTI
 Follows the bi-directional approach to LTI development
CLARIN-PL: Mission
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Starting point
 Several publicly available language resources and tools for
Polish,
 But still many were lacking
 Deeper technological barrier: restricted applications
 CLARIN-PL Pillars:
 CLARIN-PL Language Technology Centre
www.clarin-pl.eu
 the Polish node of the CLARIN distributed infrastructure
 Complete set of the basic Language Resources & Tools for
Polish
 Research applications for H&SS
 first set for key users and selected H&SS sub-domains.
CLARIN-PL Language
Technology Centre
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Location in Wrocław University of Technology
 based on modified D-Space system from Lindat (Czech CLARIN)
 One sign-on, one login (a member of the Pioneer.id Federation)
 Advanced repository system for language resources





Persistent Identifiers for resources and tools
Rich CMDI meta-data – CLARIN wide visibility in the central search
Interface for Federated Content Search
depositing service for researchers from H&SS
application for the Data Seal of Approval
 Adherence to all CLARIN specifications about standards and protocols
 Web Services for LRTs:
 the basic processing chain of Polish
 Prototype system for flexible composition of the natural language
processing chains
 support for developers SOAP & REST interfaces
 Web Applications for LRTs
 Knowledge Sharing: expertise and support for the users
CLARIN-PL: Language
Resources
1.
2.
3.
4.
5.
6.
Polish Morphological Dictionary
Polish Speech Corpora
Annotated Polish Corpora
Bilingual Corpora
Polish Historical Corpus
Semantic lexicon
 Wordnet for Polish
 formal description of lexical meanings
7. Dictionary of Multiword Expressions
8. Bilingual semantic lexicon
9. Lexicon of Proper Names
10. Syntactic-semantic Valency Dictionary
11. Robust syntactic-semantic grammar
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN-PL: Language
Resources
1.
2.
3.
4.
5.
6.
Polish Morphological Dictionary
Polish Speech Corpora
Annotated Polish Corpora
Bilingual Corpora
Polish Historical Corpus
Semantic lexicon
 plWordNet 3.0
 formal description of lexical meanings
7. Dictionary of Multiword Expressions
8. Bilingual semantic lexicon
9. Lexicon of Proper Names
10. Syntactic-semantic Valency Dictionary:
11. Robust syntactic-semantic grammar
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN-PL: Language
Resources
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Starting point – a set of large resources
 a huge National Corpus of Polish (1 billion tokens)
 plWordNet 2.1 – a very large wordnet for Polish
 Korpus Politechniki Wrocławskiej – an open Polish corpus
with rich annotation
 Expanded resources
 plWordNet 3.0 – a huge semantic lexicon of Polish
 a comprehensive description of the Polish lexico-semantic
system (~200 000 lemmas, ~280 000 senses)
 fully mapped to English Princeton WordNet
 described formally by mapping to an ontology
 Dictionary of multiword expressions described syntactically
 NELexicon 2.0 – a huge lexicon of Polish Proper Names (2.5
mln)
CLARIN-PL: Language
Resources for Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Expanded resources
 Conversational corpus (following PELCRA and NKJP)
 A large semantic valency lexicon for Polish predicative
lexical units
 Newly built resources
 Transcribed training-testing Polish speech corpus
 Bi-lingual corpora:
 Polish-English, Polish-Bulgarian-Russian, Polish-Lithuanian
 Polish historical corpus (for the years 1945-1954)
 Corpora annotated for: meta-data, anaphora, time
expressions, spatial expressions, semantic relations and
situations
plWordNet 2.2 in CLARIN-PL
http://plwordnet.pwr.edu.pl
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
plWordNet 2.2 in CLARIN-PL
http://plwordnet.pwr.edu.pl
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN-PL: Language Tools for
Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Systems for searching corpora, especially Polish corpora
 Spokes for conversational and bilingual corpora
 Poliqarp 2.0 for richly annotated
 Historical corpora [New]
 Text mining (information extraction)
 Recognition and classification of Proper Names
 Recognition of anaphoric links
 Recognition and classification of time expressions and spatial
expressions [New]
 Situation recognition [New]
 Extraction of multiword expressions (collocations)
 A generic set of morpho-syntactic tools for Polish that can
be adapted to a domain specified by the user [New]
CLARIN-PL: Language Tools for
Polish




Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
Word Sense Disambiguation based on plWordNet
Shallow semantic parser [New]
Deep syntactic-semantic parser [New]
Tools for the extraction of the semantic-pragmatic
information from documents and collections of documents,
e.g.
 keywords [New],
 semantic relations between text fragments
 and text summaries
Basic Language Tools for Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
1. Segmentation into tokens and sentences
2. Morphological analysis
3. Morphological guessing of unknown words (both without context and
context sensitive)
4. Morpho-syntactic tagging
5. Word Sense Disambiguation
6. Chunker and shallow syntactic parser
7. Named Entity Recognition and disambiguation
8. Co-reference and anaphora resolution
9. Temporal expression recognition
10. Semantic relation recognition
11. Event recognition
12. Shallow semantic parser
13. Deep syntactic parser with disambiguated output: dependency and
constituent
14. Deep semantic parser
Basic Language Tools for Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
1. Segmentation into tokens and sentences
2. Morphological analysis
3. Morphological guessing of unknown words (both without context and
context sensitive)
4. Morpho-syntactic tagging
5. Word Sense Disambiguation
6. Chunker and shallow syntactic parser
7. Named Entity Recognition and disambiguation
8. Co-reference and anaphora resolution
9. Temporal expression recognition
10. Semantic relation recognition
11. Event recognition
12. Shallow semantic parser
13. Deep syntactic parser with disambiguated output: dependency and
constituent
14. Deep semantic parser
Basic Language Tools for Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
1. Segmentation into tokens and sentences
2. Morphological analysis
3. Morphological guessing of unknown words (both without context and
context sensitive)
4. Morpho-syntactic tagging
5. Word Sense Disambiguation
6. Chunker and shallow syntactic parser
7. Named Entity Recognition and disambiguation
8. Co-reference and anaphora resolution
9. Temporal expression recognition
10. Semantic relation recognition
11. Event recognition
12. Shallow semantic parser
13. Deep syntactic parser with disambiguated output: dependency and
constituent
14. Deep semantic parser
CLARIN-PL: Processing Chain
for Polish
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
CLARIN-PL: Recognition and
classification of Proper Names
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
Bi-directional - Top-down Part:
First Applications
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Approaching users
 already active, interested, working on large textual and
speech resources, …
 covering a maximal variety of research areas, e.g. linguistics,
literary studies, psychology, political studies and sociology
 matching the available language tools for Polish
 the first set of several prototype application illustrating
possibilities and facilitating identification of the needs
 First applications





Spokes – searching corpora of conversational data
A system for collecting Polish text corpora from the Web
A open textometric and stylometric system focused on Polish
Semantic text classification for sociology
Literary Map
Spokes (University of Łódź)
http://spokes.clarin-pl.eu
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
System for Collecting Polish Text
Corpora from the Web
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Requests from the users revealed gaps in the available
technology
 existing corpus building systems were too sensitive to text
encoding errors found in the web
 not designed for informal corpora like blogs
 A system for collecting Polish text corpora from the Web
had to be constructed:
 based on tools from the Masaryk University in Brno
 to detect texts including larger number of errors (by
morphological analysis)
 supports semi-automated extraction of texts from blogs, posts
on forums, etc.
 integrated with tools for processing
Open Textometric and
Stylometric System
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 System designed for characteristic features of Polish
 like rich inflection, weakly constrained word order
 Based on several existing components including Stylo
(Eder & Rybicki)
 Enabling the use of features defined on any level of the
linguistic structure:
 from the level of word forms
 up to the level of the semantic-pragmatic structures.
 Available as Web Application and a Web Service
 Stylometric techniques appear to be applicable in many
tasks of H&SS
 sociology (characteristic features that are for different
subgroups), political studies (similarity and differences
between political parties), literary studies …
Semantic Text Classification
for Sociology
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Users: Collegium Civitas, Warsaw
 Goal
 Support for large scale analysis of the source materials
 Automatically annotate documents and text fragments with
pre-defined semantic categories
 Definition of categories by examples
 Automated semantic grouping of documents and text
fragments
 Support for




Corpus building
Manual annotation of the learning sub-corpus
Automated annotation process
Statistical analysis of the results
GeTClasS – Generalised Text
Classification for Sociology
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
Literary Map
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Users: Digital Humanities Centre of The Institute of Literary
Research (Polish Academy of Sciences)
 Goal
 Support for using maps in the literary criticism
 Tool for the identification of all geographical names in the
literary text (or a corpus) and mapping them onto a
geographical map
 Tasks
1.Identification and semantic classification of the referring
language expressions
2.Disambiguation of the referents
3.Mapping the referents onto a map (geo-location)
4.Recognition of the semantic relations and statistical analysis
Literary Map
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
Conclusions
Humanistyka Cyfrowa
Warszawa
2014-11-27
CLARIN-PL
 Application of LT to the research in Humanities & Social
Sciences seem to be much more challenging than in
commercial systems!
 LT for Polish achieved a stage in which valuable support
can be provided for research applications
 Bi-directional approach combines
 development of the basic, universal set of language tools and
resources
 with inspirations from the research applications
CLARIN-PL
Thank you very much for your attention!
www.clarin-pl.eu
Supported by the Polish Ministry of Science and Higher
Education [CLARIN-PL]
Bi-directional: bottom-up part
PALC 2014
Łódź
2014-11-22
CLARIN-PL
 LRTs and LRT chains can be useful …
 if the required tools and resources exist,
 and, they are robust!
 What is the minimal set of LRTs?
 What kind of LRTs can be called robust?
 automated applications in H&SS seem to require high quality
of language tools and mostly large coverage of resource
 BLARK – The Basic Language Resource Kit
 “the minimal set of language resources that is necessary to do
any precompetitive research and education at all” (Krauwer,
2003) and also basic processing chains
 possible reference point to compare LRTs for different
languages