EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland, College Park [email protected] ‫دومین کارگاه پژوهشی زبان فارسی و رایانه دانشگاه تهران‬ Talk Outline  Persian Weblogs –  Description of a finite-state morphological analyzer for Persian – –  Persian is the 4th largest blog language in the world (~75,000 sites) System description Language issues and implementation Computational issues in weblogs Language of Blogs   Contain both formal and informal morphology Morphology – – – Informal text is very different from formal ‫مرا گرفته است‬‫گرفته تم‬ Features that don’t exist in formal ‫فروشندهه؛ رفتش‬ Shortened verbal stems and inflection ‫می گویند‬‫میگن‬ Language of Blogs  Morphology – – – Colloquial pronunciation ‫غلطای امالیی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن‬ ‫ازشون ؛ خودتون ؛ نگاه های شان ؛ همسایه اشون‬ Spelling errors and non-standard punctuation & spacing Emoticons  and hyperlinks ‫‪Language of Blogs‬‬ ‫‪Lexicon‬‬ ‫‪Wordforms follow pronunciation‬‬ ‫اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم‬ ‫‪Colloquial forms‬‬ ‫تو دانشگاه ؛ واسه استادام‬ ‫‪New words‬‬ ‫لینکدونی ؛ دوستان کامنت گذار‬ ‫–‬ ‫–‬ ‫–‬ ‫‪‬‬ Language of Blogs  Lexicon – Loan words ‫چت روم ؛ آن الین ؛ دان لود کنین‬ – Interjections !‫آاااخ! ؛ واال ؛ وای ؛ اوووه‬ – More idiomatic expressions ‫دمش گرم آقا‬ Language of Blogs  Huge amount of variation!!  Need for flexible rules  Phonological rules to represent colloquial speech  Need to disambiguate (statistical component?)  Formal blog text is also different from traditional formal text ‫‪Language of Blogs‬‬ ‫خوابگرد‬ ‫موافق اند‬ ‫بیننده گان‬ ‫کتاب اش‬ ‫کم تر‬ ‫کافی ست‬ ‫حتا‬ ‫‪BBC‬‬ ‫موافقند‬ ‫بینندگان‬ ‫کتابش‬ ‫کمتر‬ ‫کافیست‬ ‫حتی‬ Finite-State Transducers (FST)  Two-level network or transducer – – b b Input = lower-side of arc Output = upper-side of arc i i r r d d +Noun +Pl s MA: System Description  Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992]  Components: – –   Lexicon and morphology rules (lexc) Phonological rules (regular expressions) Compiled into a FST (finite-state transducer) FST for each part of speech created separately then composed  final FST for morphological analysis MA: System Description Noun FST Verb FST Adverb FST COMPOSITION Phonology rules Input string  Final FST For Morphology Output string MA: System Description  Coverage: formal Persian language – Full verbal conjugation – Nonverbal inflection ‫مسافرین ؛ فقرا‬ – Productive derivational morphology ‫سرسام آور‬ – ~20 phonological rules – Proper nouns of people, places, organizations Inflectional Morphology LEXICON Root ktab Noun ; LEXICON Noun +Pl:ha #; +Pl:_ha #; +Sg:0 #; ‫کتابها‬ ‫کتاب ها‬ ‫کتاب‬ +Pl:a ‫کتابا‬ #; Complex Tokens  Two different POS categories ‫ دردفتر ؛ وگفت‬- ‫بعقیده شما ؛ اینکار؛ بهترست‬ bh+Prep<eqydh+Noun+Sg dr+Prep<dftr+Noun+Sg ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Pl ‫بعقیده‬ ‫دردفتر‬ ‫کتابهایمان‬ bradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl ‫برادرشه‬ >bvdn+Verb+Ind+Pres+3P+Sg Verbal Morphology  Two different stems Infinitive Present Stem Past Stem ‫توانستن‬ ‫توان‬ ‫توانست‬ ‫رفتن‬ ‫رو‬ ‫رفت‬ Verbal Morphology LEXICON PastStem tvanst Infl1 ; rft Infl1 ; xndyd Infl1 ; LEXICON PstStemBlog tvnst InflBlog1; LEXICON PresentStem tvanst:tvan Infl2 ; rft:rv Infl2; xndyd:xnd Infl2; LEXICON PrStemBlog tvanst:tvn Infl2 ; rft:r Infl2; Long Distance Dependencies  Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary  problem for linear approaches ‫است‬ Pres. Aux.3sg ‫د‬ ‫گذار‬ Pres.3sg Present ‘’ ‫گذاشت‬ Past.3sg Past ‫می‬ Imperf. ‫می‬ Imperf. ‫ه‬ Past.3sg ‫می‬ Imperf. ‫گذاشت‬ Past ‫می گذارد‬ ‫میذاره‬ ‫می گذاشت‬ ‫میذاشت‬ ‫می گذاشته است‬ ‫میذاشته‬ Long Distance Dependencies    Leads to very complex paths and continuation classes in lexc Using filters largely increases the size of the FST Use flag diacritics for unification (@U.Feature.Value@) - Keeps FST small - Can apply constraints between non-adjacent morphemes Phonology Rules Form of affixes may change based on the ending character of the stem Formal: ‫صدایش ؛ همسایه اش‬/‫کتابش ؛ چشم هایش‬ Informal: ‫صداش ؛ همسایش‬/‫کتابش ؛ چشماش‬  define clitic1 [^NB  0 || Cons __ ] ; define clitic2 [^NB  y || Vowel __ ] ; define clitic3 [^NB  “\u200c” a || e __ ] ; Optional in informal blog text ktab^NBš Sda^NBš hmsaye^NBš Evaluation      FST: 178,452 states; 928,982 arcs before optimization Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation Coverage=97.5%; Accuracy=95% Unanalyzed tokens: proper nouns + missing lexicon words No weblog language rules included yet! Conclusion    Challenges in morphological analysis of Persian formal text  Solutions in XFST system New issues and variance due to blog language Need robust system: Lexicon updated with colloquial forms Flexible morphological rules + derivational morphology rules Transliteration component for loan words Statistical approach to disambiguate and to deal with unknowns