http://kt.jis.si Matjaž Juršič, Igor Mozetič, Nada Lavrač 1/13 WHAT IS LEMMATIZATION? process of determining a lemma of a given word LEMMA - canonical form of a word - usually corresponds to a headword in a dictionary WORD pišem piše pisali pisali pisalom LEMMA pisati pisati pisati pisalo pisalo MOTIVATION - important step during preprocessing text for majority knowledge discovery methods 2/13 WHAT ARE RIPPLE DOWN RULES (RDR)? - incremental knowledge acquisition methodology - knowledge representation formalism DATA STRUCTURE - tree-like decision structure - single node is extended if-then rule - if <A> then <E> except if <A&B> then <F> except if <A&C> then <G> else if <D> then <H> 3/13 RDR ON LEMMATIZATION DOMAIN if <condition> then <conclusion> except <exception sub-rules> else <else sub-rule> CONDITION comparison to word’s suffix CONCLUSION replace original suffix of a word with another suffix to generate lemma EXCEPTION SUB-RULES more specific rules (suffix in condition is longer) ELSE SUB-RULE not needed in this domain 4/13 EXAMPLE OF RDR TREE FOR LEMMATIZATION 1 ( if suffix = “”, then “”->“” ) except 1.1 ( if suffix = “i”, then “i”->“o“ ) except 1.1.1 ( if suffix = “li”, then “li”->“ti“ ) 1.1.2 ( if suffix = “ni”, then “ni”->“ti“ ) 1.1.3 ( if suffix = “ti”, then “”->““ ) 1.1.4 ( if suffix = “ši”, then “ši”->“sati“ ) 1.2 ( if suffix = “l”, then “l”->“ti“ ) 1.3 ( if suffix = “mo”, then “”->““ ) except 1.3.1 ( if suffix = “šemo”, then “šemo”->“sati“ ) 1.3.2 ( if suffix = “šimo”, then “šimo”->“sati“ ) 5/13 LEMMAGEN ARCHITECTURE lexicon of previously lemmatized words RDR LEARNER LEMMATIZER BUILDER text that we want to lemmatize RDR LEMMATIZER RDR tree lemmatized text 6/13 RIPPLE DOWN RULE LEARNING RDR ALGORITHM - strictly follows original RDR methodology - Plisson, J., Lavrač, N. in Mladenid, D. A rule based approach to word lemmatization LEMMAGEN ALGORITHM - non incremental learning - considers previously unused information - improved learning efficiency - higher accuracy of learned rules 7/13 LEXICONS STATISTICS MULTEXT MULTEXT- EAST LANGUAGE SLOVENIAN SERBIAN BULGARIAN CZECH ENGLISH ESTONIAN FRENCH HUNGARIAN ROMANIAN ENGLISH FRENCH GERMAN ITALIAN SPANISH NO. OF RECORDS 557.970 20.294 55.200 184.628 71.784 135.094 306.795 64.042 428.194 66.216 306.795 233.858 145.530 510.709 MORPH. FORMS 198.507 16.907 40.910 57.391 48.460 89.591 232.079 51.095 352.279 43.371 232.079 51.010 115.614 474.158 LEMMAS 16.389 8.392 22.982 23.435 27.467 46.933 29.446 28.090 39.359 22.874 29.446 10.655 8.877 13.236 NO. OF DIFFERENT MORPH. FORMS MORPH. SPECS PER LEMMA 2.083 906 338 1.428 135 643 380 619 616 133 380 227 247 264 12,63 2,07 1,95 2,55 1,80 2,19 8,01 2,03 9,35 1,93 8,01 4,87 13,85 36,07 LEMS PER MORPH. FORM 1,0430 1,0285 1,1002 1,0441 1,0206 1,1507 1,0164 1,1209 1,0447 1,0182 1,0164 1,0174 1,0636 1,0069 8/13 EVALUATION CROSS VALIDATION - 10 times 5-fold for each lexicon - both algorithms (RDR & LemmaGen) PROBLEM how to split a lexicon on train and test set? 3 different splits: - known words (optimistic) - unknown words (pessimistic) - random test words (approx. realistic) 9/13 EVALUATION RESULTS ACCURACY (%) MULTEXT MULTEXT- EAST LANGUAGE SLOVENIAN SERBIAN BULGARIAN CZECH ENGLISH ESTONIAN FRENCH HUNGARIAN ROMANIAN ENGLISH FRENCH GERMAN ITALIAN SPANISH KNOWN WORDS (OPTIMISTIC) RDR L-GEN ERRORS 95,35 97,61 -48,6 94,36 97,86 -62,1 91,22 93,68 -28,0 96,61 97,89 -37,8 97,75 98,84 -48,3 86,81 89,51 -20,5 96,72 98,80 -63,5 90,23 91,88 -16,9 94,96 96,75 -35,6 98,20 99,00 -44,5 96,72 98,80 -63,5 95,88 98,70 -68,5 93,75 95,58 -29,2 99,10 99,48 -42,1 RANDOM TEST WORDS (REALISTIC) RDR L-GEN ERRORS 92,59 94,38 -24,1 70,34 73,49 -10,6 74,52 76,10 -6,2 92,77 93,66 -12,3 92,05 93,07 -12,8 73,52 73,93 -1,6 91,78 92,94 -14,1 74,82 74,33 2,0 78,16 79,17 -4,6 93,29 94,14 -12,7 91,79 92,95 -14,2 95,06 97,13 -41,9 85,87 86,08 -1,5 94,65 95,73 -20,1 UNKNOWN WORDS (PESSIMISTIC) RDR L-GEN ERRORS 80,68 82,12 -7,5 64,26 65,85 -4,5 69,29 71,52 -7,2 78,09 81,13 -13,9 89,27 91,03 -16,4 66,69 66,54 0,5 86,80 88,22 -10,8 72,73 72,86 -0,5 73,48 74,14 -2,5 90,82 92,48 -18,1 86,85 88,25 -10,7 79,56 84,15 -22,4 82,05 82,11 -0,3 94,32 95,45 -19,9 10/13 COMPARISON OF ALGORITHMS / LANGUAGES Error percentage on random test words SPANISH SERBIAN ROMANIAN GERMAN HUNGARIAN ITALIAN FRENCH ESTONISH CZECH BOLGARIAN ENGLISH SLOVENIAN % 5% 10% 15% LemmaGen 20% RDR 25% 30% 35% 11/13 ALGORITHMS COMPARISON Error decrease (LemmaGen compared to RDR algorithm) SPANISH SERBIAN ROMANIAN GERMAN HUNGARIAN ITALIAN FRENCH ESTONISH CZECH BOLGARIAN ENGLISH SLOVENIAN -10% % 10% 20% 30% 40% 50% 12/13 CONCLUSION SET OF UTILITIES FOR LEMMATIZATION - learning, building: (LemLearn, LemBuild) - application: (Lemmatize) - evaluation: (LemXval, LemTest, LemSplit, LemStat) OPEN SOURCE LEMMAGEN LIBRARY IN C++ - complete lemmatization functionality for integration in more complex systems LEMMATIZERS FOR 12 EUROPEAN LANGUAGES - pre-learned, pre-build and ready to use lemmatizers - suitability estimation of RDR principle for each language 13/13 FURTHER WORK IMPROVED LEMMATIZERS - Higher accuracy using better and larger lexicons - New lexicons for new languages ACCESSIBILITY OF DEVELOPED METHODS - Integration in existing systems for text mining - Web service for lemmatization POSSIBLE APPLICATION ON OTHER DOMAINS ??? - Transformation of serialized data between two similar forms that have different suffix DISCUSSION RESULTS AVAILABLE AT http://kt.ijs.si/software/LemmaGen CONTACT(CRITICS, SUGGESTIONS, USAGE SUPPORT) [email protected], [email protected], [email protected] APPLICATION ON LEXICONS TWO SETS OF LEXICONS - Multext (9 languages) - Multext-east (5 languages) 12 different languages (Slovenian, Serbian, Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, German, Italian, Spanish) EXAMPLE OF NOT OPTIMIZED RDR TREE 1 ( if suffix = “”, then “”->“” ) except 1.1 ( if suffix = “šemo”, then “šemo”->“sati“ ) 1.2 ( if suffix = “ši”, then “ši”->“sati“ ) 1.3 ( if suffix = “šimo”, then “šimo”->“sati“ ) 1.4 ( if suffix = “l”, then “l”->“ti“ ) 1.5 ( if suffix = “i”, then “i”->“o“ ) except 1.5.1 ( if suffix = “ni”, then “ni”->“ti“ ) 1.5.2 ( if suffix = “ti”, then “”->““ ) HIGHER ACCURACY WHAT “NEW” INFORMATION DID WE USE? - Rule coverage - determines how general/specific rule is - to choose the best rule conclusion when creating ambiguous rules (equal words - diff. lemmas) - Special treatment for cases when whole word is used as a suffix avstrijski, avstrijski, torek, koroŠki, priČja-gripa avstrijski-koroŠki dvojeziČn priČja gripa zakon, istospolen, elektrarna, registracija krŠka, nuklearen drnovŠek, slovenski, republika-janez-drnovŠek, reprezentanca, avstrijski, janz-drnovŠek poroČal-slovenski avstrijski-tiskoven-agencija korupcija, avstrijski tiskoven predsednik, prekupČevanje, makedonski, beograjski-tiskoven-agencija, obisk, komisija maedonski-tiskoven-agencija beograjski-tiskoven hrvaŠka tolar, makedonski-tiskoven tiskoven-agencija-beta cena, hina, root tiskoven-agencija-sČg, dpa, goriva banka, agencija-sČg-tanjug nemŠka, ljubljanski priČja sČg-tanjug nemŠkapriČja-gripa odstotek, tiskoven hrvaŠka, gripa povzeti, tiskoven-agencija-hina st hina, hrvaŠka-tiskoven-agencija tito, mercator, napis nov, zunanji, petek minister, predsednik, zakon, Rupel Premier evropski, drnovsek, ekoloska, mejen, Janez unija, janz epikontinentale mejen-prehod, Jansa eo n prehod WHAT IS LEMMATIZATION? process of determining lemma of a given word LEMMA - canonical form of a word - usually corresponds to a headword in a dictionary WORD pišem piše pisali pisali pisalom LEMMA pisati pisati pisati pisalo pisalo MOTIVATION - important step during preprocessing text for majority knowledge discovery methods STEM piš piš pisa pisal pisal 20/15 .
© Copyright 2025