Lemmatize - LemmaGen

http://kt.jis.si
Matjaž Juršič, Igor Mozetič, Nada Lavrač
1/13
WHAT IS LEMMATIZATION?
process of determining a lemma of a given word
LEMMA
- canonical form of a word
- usually corresponds to a
headword in a dictionary
WORD
pišem
piše
pisali
pisali
pisalom
LEMMA
pisati
pisati
pisati
pisalo
pisalo
MOTIVATION
- important step during preprocessing text for
majority knowledge discovery methods
2/13
WHAT ARE RIPPLE DOWN RULES (RDR)?
- incremental knowledge acquisition methodology
- knowledge representation formalism
DATA STRUCTURE
- tree-like decision structure
- single node is extended if-then rule
- if <A> then <E>
except if <A&B> then <F>
except if <A&C> then <G>
else if <D> then <H>
3/13
RDR ON LEMMATIZATION DOMAIN
if <condition> then <conclusion>
except <exception sub-rules>
else <else sub-rule>
CONDITION
comparison to word’s suffix
CONCLUSION
replace original suffix of a word with another suffix to
generate lemma
EXCEPTION SUB-RULES
more specific rules (suffix in condition is longer)
ELSE SUB-RULE
not needed in this domain
4/13
EXAMPLE OF RDR TREE FOR LEMMATIZATION
1
( if suffix = “”, then “”->“” ) except
1.1 ( if suffix = “i”, then “i”->“o“ ) except
1.1.1 ( if suffix = “li”, then “li”->“ti“ )
1.1.2 ( if suffix = “ni”, then “ni”->“ti“ )
1.1.3 ( if suffix = “ti”, then “”->““ )
1.1.4 ( if suffix = “ši”, then “ši”->“sati“ )
1.2 ( if suffix = “l”, then “l”->“ti“ )
1.3 ( if suffix = “mo”, then “”->““ ) except
1.3.1 ( if suffix = “šemo”, then “šemo”->“sati“ )
1.3.2 ( if suffix = “šimo”, then “šimo”->“sati“ )
5/13
LEMMAGEN ARCHITECTURE
lexicon of previously
lemmatized words
RDR
LEARNER
LEMMATIZER
BUILDER
text that we want
to lemmatize
RDR
LEMMATIZER
RDR tree
lemmatized
text
6/13
RIPPLE DOWN RULE LEARNING
RDR ALGORITHM
- strictly follows original RDR methodology
- Plisson, J., Lavrač, N. in Mladenid, D. A rule based
approach to word lemmatization
LEMMAGEN ALGORITHM
- non incremental learning
- considers previously unused information
- improved learning efficiency
- higher accuracy of learned rules
7/13
LEXICONS STATISTICS
MULTEXT
MULTEXT- EAST
LANGUAGE
SLOVENIAN
SERBIAN
BULGARIAN
CZECH
ENGLISH
ESTONIAN
FRENCH
HUNGARIAN
ROMANIAN
ENGLISH
FRENCH
GERMAN
ITALIAN
SPANISH
NO. OF
RECORDS
557.970
20.294
55.200
184.628
71.784
135.094
306.795
64.042
428.194
66.216
306.795
233.858
145.530
510.709
MORPH.
FORMS
198.507
16.907
40.910
57.391
48.460
89.591
232.079
51.095
352.279
43.371
232.079
51.010
115.614
474.158
LEMMAS
16.389
8.392
22.982
23.435
27.467
46.933
29.446
28.090
39.359
22.874
29.446
10.655
8.877
13.236
NO. OF DIFFERENT
MORPH. FORMS
MORPH. SPECS
PER LEMMA
2.083
906
338
1.428
135
643
380
619
616
133
380
227
247
264
12,63
2,07
1,95
2,55
1,80
2,19
8,01
2,03
9,35
1,93
8,01
4,87
13,85
36,07
LEMS PER
MORPH. FORM
1,0430
1,0285
1,1002
1,0441
1,0206
1,1507
1,0164
1,1209
1,0447
1,0182
1,0164
1,0174
1,0636
1,0069
8/13
EVALUATION
CROSS VALIDATION
- 10 times 5-fold for each lexicon
- both algorithms (RDR & LemmaGen)
PROBLEM
how to split a lexicon on train and test set?
3 different splits:
- known words (optimistic)
- unknown words (pessimistic)
- random test words (approx. realistic)
9/13
EVALUATION RESULTS
ACCURACY (%)
MULTEXT
MULTEXT- EAST
LANGUAGE
SLOVENIAN
SERBIAN
BULGARIAN
CZECH
ENGLISH
ESTONIAN
FRENCH
HUNGARIAN
ROMANIAN
ENGLISH
FRENCH
GERMAN
ITALIAN
SPANISH
KNOWN WORDS
(OPTIMISTIC)
RDR
L-GEN ERRORS
95,35
97,61
-48,6
94,36
97,86
-62,1
91,22
93,68
-28,0
96,61
97,89
-37,8
97,75
98,84
-48,3
86,81
89,51
-20,5
96,72
98,80
-63,5
90,23
91,88
-16,9
94,96
96,75
-35,6
98,20
99,00
-44,5
96,72
98,80
-63,5
95,88
98,70
-68,5
93,75
95,58
-29,2
99,10
99,48
-42,1
RANDOM TEST WORDS
(REALISTIC)
RDR
L-GEN ERRORS
92,59
94,38
-24,1
70,34
73,49
-10,6
74,52
76,10
-6,2
92,77
93,66
-12,3
92,05
93,07
-12,8
73,52
73,93
-1,6
91,78
92,94
-14,1
74,82
74,33
2,0
78,16
79,17
-4,6
93,29
94,14
-12,7
91,79
92,95
-14,2
95,06
97,13
-41,9
85,87
86,08
-1,5
94,65
95,73
-20,1
UNKNOWN WORDS
(PESSIMISTIC)
RDR
L-GEN ERRORS
80,68
82,12
-7,5
64,26
65,85
-4,5
69,29
71,52
-7,2
78,09
81,13
-13,9
89,27
91,03
-16,4
66,69
66,54
0,5
86,80
88,22
-10,8
72,73
72,86
-0,5
73,48
74,14
-2,5
90,82
92,48
-18,1
86,85
88,25
-10,7
79,56
84,15
-22,4
82,05
82,11
-0,3
94,32
95,45
-19,9
10/13
COMPARISON OF ALGORITHMS / LANGUAGES
Error percentage on random test words
SPANISH
SERBIAN
ROMANIAN
GERMAN
HUNGARIAN
ITALIAN
FRENCH
ESTONISH
CZECH
BOLGARIAN
ENGLISH
SLOVENIAN
%
5%
10%
15%
LemmaGen
20%
RDR
25%
30%
35%
11/13
ALGORITHMS COMPARISON
Error decrease (LemmaGen compared to RDR algorithm)
SPANISH
SERBIAN
ROMANIAN
GERMAN
HUNGARIAN
ITALIAN
FRENCH
ESTONISH
CZECH
BOLGARIAN
ENGLISH
SLOVENIAN
-10%
%
10%
20%
30%
40%
50%
12/13
CONCLUSION
SET OF UTILITIES FOR LEMMATIZATION
- learning, building: (LemLearn, LemBuild)
- application: (Lemmatize)
- evaluation: (LemXval, LemTest, LemSplit, LemStat)
OPEN SOURCE LEMMAGEN LIBRARY IN C++
- complete lemmatization functionality for integration in
more complex systems
LEMMATIZERS FOR 12 EUROPEAN LANGUAGES
- pre-learned, pre-build and ready to use lemmatizers
- suitability estimation of RDR principle for each language
13/13
FURTHER WORK
IMPROVED LEMMATIZERS
- Higher accuracy using better and larger lexicons
- New lexicons for new languages
ACCESSIBILITY OF DEVELOPED METHODS
- Integration in existing systems for text mining
- Web service for lemmatization
POSSIBLE APPLICATION ON OTHER DOMAINS ???
- Transformation of serialized data between two
similar forms that have different suffix
DISCUSSION
RESULTS AVAILABLE AT
http://kt.ijs.si/software/LemmaGen
CONTACT(CRITICS, SUGGESTIONS, USAGE SUPPORT)
[email protected], [email protected], [email protected]
APPLICATION ON LEXICONS
TWO SETS OF LEXICONS
- Multext (9 languages)
- Multext-east (5 languages)
12 different languages (Slovenian, Serbian, Bulgarian,
Czech, English, Estonian, French, Hungarian, Romanian,
German, Italian, Spanish)
EXAMPLE OF NOT OPTIMIZED RDR TREE
1
( if suffix = “”, then “”->“” ) except
1.1 ( if suffix = “šemo”, then “šemo”->“sati“ )
1.2 ( if suffix = “ši”, then “ši”->“sati“ )
1.3 ( if suffix = “šimo”, then “šimo”->“sati“ )
1.4 ( if suffix = “l”, then “l”->“ti“ )
1.5 ( if suffix = “i”, then “i”->“o“ ) except
1.5.1 ( if suffix = “ni”, then “ni”->“ti“ )
1.5.2 ( if suffix = “ti”, then “”->““ )
HIGHER ACCURACY
WHAT “NEW” INFORMATION DID WE USE?
- Rule coverage
- determines how general/specific rule is
- to choose the best rule conclusion when creating
ambiguous rules (equal words - diff. lemmas)
- Special treatment for cases when whole word is used
as a suffix
avstrijski,
avstrijski,
torek,
koroŠki,
priČja-gripa
avstrijski-koroŠki dvojeziČn
priČja
gripa
zakon,
istospolen,
elektrarna,
registracija
krŠka,
nuklearen
drnovŠek,
slovenski,
republika-janez-drnovŠek,
reprezentanca,
avstrijski,
janz-drnovŠek
poroČal-slovenski
avstrijski-tiskoven-agencija
korupcija,
avstrijski tiskoven
predsednik,
prekupČevanje,
makedonski,
beograjski-tiskoven-agencija,
obisk,
komisija
maedonski-tiskoven-agencija
beograjski-tiskoven
hrvaŠka
tolar,
makedonski-tiskoven
tiskoven-agencija-beta
cena,
hina,
root
tiskoven-agencija-sČg,
dpa,
goriva
banka,
agencija-sČg-tanjug
nemŠka,
ljubljanski
priČja
sČg-tanjug
nemŠkapriČja-gripa
odstotek,
tiskoven
hrvaŠka,
gripa
povzeti,
tiskoven-agencija-hina
st
hina,
hrvaŠka-tiskoven-agencija
tito,
mercator,
napis
nov,
zunanji,
petek
minister,
predsednik,
zakon,
Rupel
Premier
evropski,
drnovsek,
ekoloska,
mejen,
Janez
unija,
janz
epikontinentale
mejen-prehod,
Jansa
eo
n
prehod
WHAT IS LEMMATIZATION?
process of determining lemma of a given word
LEMMA
- canonical form of a word
- usually corresponds to a
headword in a dictionary
WORD
pišem
piše
pisali
pisali
pisalom
LEMMA
pisati
pisati
pisati
pisalo
pisalo
MOTIVATION
- important step during preprocessing text for
majority knowledge discovery methods
STEM
piš
piš
pisa
pisal
pisal
20/15
.