plWordNet - Clarin PL

A Multilayer System of Lexical
Resources for Language
Technology Infrastructure
Paweł Kędzia, Michał Marcińczuk,
Marek Maziarz, Maciej Piasecki,
Adam Radziszewski, Ewa Rudnicka
G4.19 Research Group
Wrocław University of Technology
nlp.pwr.wroc.pl
plwordnet.pwr.wroc.pl
System of Lexico-semantic Resources
Lexicon of lexico-syntactic structures of
multi-word expressions
plWordNet 3.0 (Słowosieć 3.0)
plWordNet 3.0 to WordNet 3.1 mapping
Semantic lexicon of proper names
Mapping to an ontology
And a valency lexicon linked to plWordNet
System of Lexico-semantic Resources
Valence lexicon
MWE lexicon
plWordNet 3.0
describes
WordNet 3.1 +
extension
Proper Names
Ontology: SUMO + intermediate level
Wordnet
{ samochodzik 2 `small car’ }
deminutiveness
{samochód 1, pojazd samochodowy 1,
auto 1, wóz 1 `car, automobile’ }
meronymy
hypernymy/hyponymy
{bagażnik 1 `boot’ }
{pogotowie 3, karetka 1, sanitarka 1,
karetka pogotowia 1 `ambulance’ }
plWordNet 2.2
Synset and constitutive relations
Synset as a notational convention
for a group of lexical units sharing certain relations
represents synonyms
{afekt 1 `passion’, uczucie 2 `feeling’} hypernym
{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1
`loving’}
This is based on constitutive relations
Additional distinctions: stylistic register and aspect
Minimal committment principle: make as few assumptions
plWordNet model:
non-relational aspects
Constitutive features
stylistic registers,
verb aspect
and semantic verb classes
Referred to in the relation definitions
e.g. relations limited to verbs of the same
aspect and semantic class
Glosses helps wordnet editors
Usage examples: direct links to the corpus
plWordNet: Constitutive Relations
• Traditional wordnet relations
• e.g. tiger 'Panthera tigris' -hyponymy → tiger
meronymy, cause, instance of (only Proper
Names)
• Additional constitutive relations
• e.g., swallow -verb meronymy→ eat
preceding, presupposition,
gradation (only Adjectives)
• Number: 10 and about 40 subtypes
plWordNet: Relations of Lexical Units
• Traditional wordnet relations
• e.g antonymy, fuzzynymy
• Additional relations
• converse
plWordNet: derivationally based
lexico-semantic relations
• Number: 8 and about 40 subtypes
• For instance,
góral ‘highlander’
–inhabitant– góry ‘highlands’
zapalić sięperfect `light, start burning’
–inchoativity–
palić sięimperfect `burn, produce light’
chamiećimperfect `to become a boor‘
–process–
cham `boor’
State: plWordNet on 22th Nov. 2014
Number of
lemmas
plWN PWN 3.1
156,360 155,593
enWN plWN 3.0
157,691 ~185,000
lexical units
synsets
220,848
164,233
209,329 >260,000
119,441 ~195,000
206,978
117,659
• plWN – plWordNet (the version: 20th Nov. 2014)
• PWN 3.1 – Princeton WordNet 3.1
• enWN 0.1 – PWN 3.1 expanded in CLARIN-PL
(20th Nov. 2014)
• plWN 3.0 – the target size of plWN
Lexicon of multi-word expressions
Non-trivial morphology of Polish MWEs
more than 100 nominal structural patterns
Description of the lexico-syntactic
structures of MWEs
Multi-word LUs as semantic atoms
no internal semantic relations
Dynamic lexicon
a tool for automatic MWE extraction
60 000 described in the lexicon and plWordNet
Multi-word lexical units
Dictionary of MWLUs
goal: 60k entries
semantic description by mapping to plWordNet
syntactic description by WCCL constraints
Criteria of distinguishing MWLUs
* collocations that are:
• terms,
• non-compositional expressions,
• syntactically fixed expressions.
Example: gęśliki podhalańskie
'~fiddle'
<mwegroup type='fix' name='SubstAdjPlFix' class='subst'>
<condition>
and(
inter(base[0],$s:S),
inter(nmb[0], {pl}),
inter(base[1],$s:A),
inter(class[1],{adj,ppas,pact}),
inter(class[0],{subst,ger,depr}),
agrpp(0,1,{nmb,gnd,cas}),
setvar($Pos1, 0),
setvar($Pos2, 1)
)
</condition>
<instances>
<MWE base='gęśliki podhalańskie'>
<head>in(class[0],{subst,ger,depr})</head>
<var name='S'>gęśliki</var>
<var name='A'>podhalański</var>
</MWE>
plWordNet to WordNet 3.1 mapping
plWordNet: built independently to obtain faithful
description
Manual mapping
bottom-up order
comparison of the relations structures
a cascading list of Interlingual-relations
plWordNet verification as an important side effect
Present state: 113,265 N and Adj synsets mapped
Target: complete plWordNet 3.0 mapped
Hierarchy of inter-lingual relations
•
•
•
•
•
•
Inter-lingual Synonymy (only one per synset)
Inter-lingual inter-register synonymy
I-partial synonymy
I-hyponymy
I-hypernymy
I-meronymy
for parts, elements or materials of bigger wholes
• I-holonymy
for a whole made of smaller parts, elements or
materials
WordnetLoom: editing the mapping
NELexicon 2.0
• NELexion 2.0 is a dictionary of proper
names containing 2.3 milion entries.
• Hierarchy of proper name categories is
based on Sekine's Extended Named Entity
Hierarchy [http://nlp.cs.nyu.edu/ene/].
–
–
–
7 top-level categories: event, facility,
living, location, organization, product,
other,
3-level hierarchy,
107 fine-grained categories.
NELexicon 2.0: hierarchy (fragment)
nam_loc (location)
nam_loc_astronomical
nam_loc_country_region
nam_loc_gpe
nam_loc_gpe_admin
nam_loc_gpe_city
nam_loc_gpe_conurbation
nam_loc_gpe_country
nam_loc_gpe_district
nam_loc_gpe_subdivision
nam_loc_historical_region
nam_loc_hydronym
nam_loc_hydronym_bay
nam_loc_hydronym_lagoon
nam_loc_hydronym_lake
nam_loc_hydronym_ocean
nam_loc_hydronym_river
nam_loc_hydronym_sea
nam_loc_land_cape
nam_loc_land_continent
nam_loc_land_desert
nam_loc_land_island
nam_loc_land_mountain
nam_loc_land_peak
nam_loc_land_peninsula
…
nam_fac (facility)
nam_fac_bridge
nam_fac_cossroad
nam_fac_goe
nam_fac_goe_market
nam_fac_goe_stop
nam_fac_park
nam_fac_road
nam_fac_square
nam_fac_system
NELexicon 2.0 – statistics
Coarse-grained categories breakdown
4%
1% 6%
Count
0%
30%
43%
16%
Event
Organization
Facility
Other
Living
Product
Top 10 fine-grained categories
Location
Category
450 351
nam_org_company
418 786
nam_org_organization
371 390
nam_liv_person_last
281 013
nam_liv_person
197 197
nam_loc_gpe_city
72 537
nam_loc_gpe_admin3
44 184
nam_fac_road
34 156
nam_loc_astronomical
28 629
nam_fac_other
23 153
nam_org_institution
Mapping to ontology
Ontology: unambiguous concepts defined formally
Lexical meanings
imprecisely delimited
constrained by usage, stylistic register and sentiment
Mapping to ontology
precise, formal description for meanings
association: concepts – their lexical embodiment
SUMO selected
Princeton WordNet mapping
Semi-automated mapping of plWordNet
SUMO Ontology
• SUMO – Suggested Upper Merged Ontology,
– Available on General Public Licence,
– Contains ~25 000 concepts and ~80 000 axioms,
– Concepts are connected with one of the relations: subclass,
subrelations, instance, subAttribute.
– Eeach concepts has formal definition written in SUO-KIF
Language:
(<=>
(exists (?BUILD)
(and
(instance ?BUILD Constructing)
(result ?BUILD ?ARTIFACT)))
(instance ?ARTIFACT StationaryArtifact)))
PlWordNet mapping to SUMO
Applications
Free WordNet-type licence facilitate applications. Examples:
• Semantic annotation in a corpus of referential gestures (Lis, 2012)
• Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012)
• Features for text mining from Web pages (Maciolek and Dobrowolski, 2013)
• Mapping between a lexicon and an ontology (Wróblewska et al., 2013)
• Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009)
• Text similarity for Information Retrieval (Siemiński, 2012)
• Text classification (Maciołek, 2010)
• Terminology extraction and clustering (Mykowiecka and Marciniak, 2012)
• Automated extraction of Opinion Attribute Lexicons (Wawer and
Gołuchowski, 2012)
• Named Entity Recognition
• Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012)
• Anaphora resolution
About 600 registered users, ~70 declared commercial applications
Conclusions
• plWordNet 2.2 – a national wordnet not
translated from Princeton WordNet
• plWordNet 2.2.1 is larger than WordNet 3.1
in size, as well as in lexical coverage, hypernymy depth
and relation density
• Synset membership depends only on constitutive
relations between lexical units.
• A unique mapping strategy and a unique
opportunity to compare the two lexical systems
• plWordNet 3.0 (2015):
– a comprehensive wordnet of Polish
– 185k of lemmas and 260k of LUs, mapped to enWN
Thank-you
www.plwordnet.pwr.wroc.pl
Thank you!
NELexicon 2.0 – sources
• Sources:
–
–
–
–
–
NELexicon 1.0 (1.4 milion)
Wikipedia infoboxes (manually created
mapping for 970 infobox attributes),
Wikipedia internal links (base forms for
inflected forms),
Names recognized by Liner2 in
Wikipedia (statistical model for NER for
Polish)
Inflected forms from Wiktionary.
Features for the mapping rules
• Interlingual relation between plWordNet and
WordNet: i-synonymy, i-hyponymy, i-part-ofmeronymy, . . .
• Mapping relation between WordNet and SUMO:
equivalent, instance of and subsumed.
• Domains of plWordNet and WordNet synsets:
body, grp, food, loc, . . .
• Capital letter in the first lemma of a plWordNet
synset.
• SUMO concept: Currency, GroupOfPeople,
FieldOfStudy, Human, . . .
Constitutive relations
• Synset = a group of lexical units which share all
constitutive relations
• Constitutive relation = a lexico-semantic relation
which
– is frequent enough
– and frequently shared by groups
Also
– is established in linguistics
– and accepted in the wordnet tradition
• Examples: hypernymy, meronymy, cause
Applications
Strong universal basis
a comprehensive wordnet >200 000 lemmas
resulting in ~285 000 LUs and ~210 000 synsets
one of the largest ever Polish dictionaries
Modularly constructed toolkit
a layered architecture of large software systems
separate but linked layers
each layer based on limited set of notions and
principles and exchangeable
The core of the CLARIN-PL language technology
infrastructure