Leta 1882 se je vpisal v osnovno … Ivan Cankar – največji mojster slovenske besede Ivan Cankar se je rodil na Vrhniki (na Klancu) 10. maja 1876 kot osmi otrok v propadajoči obrtniško – proletarski družini trškega krojača … Information extraction Agenda 1 Motivation 2 Information extraction Definition Related work Systems classification Conditional random fields (CRF) 3 Information extraction Coreference resolution Relationship extraction Named entity recognition Iterative and joint information extraction 4 Further work Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 7 / 47 Information extraction Definition Definition Information extraction type of information retrieval goal to automatically extract structured data from (half-)structured data sources Preprocessing Information extraction method Subtasks named entity recognition relationship extraction coreference resolution Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 8 / 47 Information extraction Related work Preprocessing John is married to Jena . They work at OBI . Sentence detection John is married to Jena . They work at OBI . Tokenization John is married to Jena . They work at OBI . Lemmatization John be marry to Jena . They work at OBI . Part-of-speech tagging NNP VBZ VBN TO NNP . PRP VBP IN NNP . Dependency parsing John is married to Jena . They work at . auxpass prep Slavko Žitnik & Marko Bajec (FRI) pobj pobj nsubjpass Information Extraction OBI nsubj prep 15th May 2015 9 / 47 Information extraction Related work Proposed systems Named entity recognition Named entity recognition Named entity recognition Relationship extraction Coreference resolution Relationship extraction PERSON LOCATION poročenZ ORGANIZATION poročenZ poročenZ zaposlenPri DATE Mojca imaPoklic imaDelovnoMesto Mojco zaposlenPri PERSON EVENT poročenZ Janez Janez Janez OBI-ju imaPoklic Janez imaDelovnoMesto mehanik OBI-ju Iterative and joint information extraction using an ontology Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 10 / 47 Information extraction Systems classification General approaches Informa(on)extrac(on Pa/ern0based Discovery !"Seed" expansion Slavko Žitnik & Marko Bajec (FRI) Machine)learning0based Rules !"JAPE !"Taxonomy"label" matching Probabilis(c Induc(on !"Linguis:c !"HMM,"CRF !"Structural !"N!gram !"SVM,"naive"Bayes,"... Information Extraction 15th May 2015 11 / 47 Information extraction Conditional random fields (CRF) Algorithm selection GENERAL GRAPH SEQUENCE Naive Bayes Hidden Markov Models Generative Directed Model CONDITIONAL CONDITIONAL CONDITIONAL GENERAL GRAPH SEQUENCE Logistic regression Slavko Žitnik & Marko Bajec (FRI) Linear-chain CRF Information Extraction General CRF 15th May 2015 12 / 47 Information extraction Conditional random fields (CRF) Conditional random fields (CRF) P(y |x) = 1 Z (x) n Y m1 m2 X X exp λj fj (yi , xi ) exp λj fj (yi , yi−1 , xi ) i=1 j=1 f1 (xi , yi , yi−1 ) = y1 y2 1, 0, y3 j=1 if xi−1 =Mr., yi = PER, yi−1 = O otherwise. yn discriminative model probabilistic label distributions ... complex interdependent sequences x1 x2 x3 Slavko Žitnik & Marko Bajec (FRI) xn lots of features Information Extraction 15th May 2015 13 / 47 Information extraction Coreference resolution Agenda 1 Motivation 2 Information extraction Definition Related work Systems classification Conditional random fields (CRF) 3 Information extraction Coreference resolution Relationship extraction Named entity recognition Iterative and joint information extraction 4 Further work Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 14 / 47 Information extraction Coreference resolution About coreference resolution John is married to Jena . He is a mechanic at OBI and she also works there . It is a DIY market . Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 15 / 47 Information extraction Coreference resolution About coreference resolution John is married to Jena . He is a mechanic at OBI and she also works there . It is a DIY market . Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 15 / 47 Information extraction Coreference resolution About coreference resolution John is married to Jena . He is a mechanic at OBI and she also works there . It is a DIY market . Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 15 / 47 Information extraction Coreference resolution About coreference resolution John is married to Jena . He is a mechanic at OBI and she also works there . It is a DIY market . Coreference resolution approaches: MENTION SEQUENCES MENTION PAIRS Slavko Žitnik & Marko Bajec (FRI) UNSUPERVISED [3], [4], [6] SUPERVISED SkipCor [11], [16], [8] [7], [15], [5], [20], [13], [17], [9], [14], [18] Information Extraction 15th May 2015 15 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method John is married to Jena . He is a mechanic at OBI and she also works there . It is a DIY market . x = [John, Jena, He, OBI, she, there, It, DIY market]. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 16 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method Distribution of distances between two consecutive coreferent mentions – SemEval-2010 data set Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 17 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method x = [John, Jena, He, OBI, she, there, It, DIY market]. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 18 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method x = [John, Jena, He, OBI, she, there, It, DIY market]. O C O O John He she It O O C C Jena OBI there DIY Market Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 18 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method x = [John, Jena, He, OBI, she, there, It, DIY market]. O C O He she It O O C C OBI Slavko Žitnik & Marko Bajec (FRI) there O C John OBI It O C O Jena she DIY Market O O He there O John Jena O DIY Market Information Extraction 15th May 2015 18 / 47 Information extraction Coreference resolution SkipCor – coreference resolution method Input documents Entity clustering Model 0 0 Model 1 1 Model 2 ... 2 s Document mention detection Slavko Žitnik & Marko Bajec (FRI) ... Model s ... Skip-mention sequences Model training Information Extraction Mention labeling 15th May 2015 19 / 47 Information extraction Coreference resolution SkipCor – results Sistem SkipCor SkipCorZero SkipCorPair RelaxCor [19] SUCRE [10] TANL-1 [1] UBIU [22] MUC P R SemEval2010 68.8 30.1 67.0 3.6 76.7 35.6 72.4 21.9 54.9 68.1 24.4 23.7 25.5 17.2 F P BCubed R F P CEAF R F 41.8 6.8 48.7 33.7 60.8 24.0 20.5 94.8 99.6 97.1 97.0 78.5 72.1 83.5 80.8 75.1 79.0 74.8 86.7 74.6 67.8 87.3 85.7 87.1 84.5 82.4 73.4 74.8 74.0 73.0 72.7 75.6 74.3 61.4 68.2 78.5 73.1 79.4 75.6 74.3 75.0 63.4 76.2 73.1 75.9 75.6 74.3 67.6 65.7 Results of the proposed SkipCor system, baseline approaches and other systems against SemEval-2010 data set. Metrics are MUC [21], BCubed [2] in CEAF [12]. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 20 / 47 Information extraction Coreference resolution SkipCor – results examples, CoNLL 2012 Target entities Entity(Milosevic ’s successor, Vojislav Kostunica, a critic of the War Crimes Tribunal at The Hague, He, Kostunica), Entity(Carla Ponte, The chief U.N. war crimes prosecutor), Entity(Milosevic ’s, Milosevic, Milosevic, former President Slobodan Milosevic) Slavko Žitnik & Marko Bajec (FRI) Identified entities Entity(Milosevic ’s, Milosevic ’s successor, Vojislav Kostunica, a critic of the War Crimes Tribunal at The Hague, Milosevic, He, Milosevic, Kostunica, former President Slobodan Milosevic), Entity(Carla Ponte), Entity(The chief U.N. war crimes prosecutor) Information Extraction 15th May 2015 21 / 47 Information extraction Coreference resolution SkipCor – results examples, CoNLL 2012 Target entities Entity(Israel and the Palestinians, The two sides), Entity(Israel, Israel, Israel) Slavko Žitnik & Marko Bajec (FRI) Identified entities Entity(Israel and the Palestinians), Entity(The two sides), Entity(Israel, Israel, Israel) Information Extraction 15th May 2015 21 / 47 Information extraction Coreference resolution SkipCor – results examples, CoNLL 2012 Target entities Entity(shot, shot, an assassination attempt), Entity(Belgrade, Belgrade), Entity(The Serbian Prime Minister, Zoran Djindjic, Zoran Djindjic, the Prime Minister, his, he, his, he) Slavko Žitnik & Marko Bajec (FRI) Identified entities Entity(an assassination attempt), Entity(Belgrade, Belgrade), Entity(shot, The Serbian Prime Minister, Zoran Djindjic, Zoran Djindjic, the Prime Minister, his, shot, he, his, he) Information Extraction 15th May 2015 21 / 47 Information extraction Coreference resolution SkipCor – results examples, CoNLL 2012 Target entities Entity(Northern Ireland, Northern Ireland), Entity(President Clinton, he, Mr. Clinton ’s, Mr. Clinton, He, he) Slavko Žitnik & Marko Bajec (FRI) Identified entities Entity(Northern Ireland, Northern Ireland), Entity(President Clinton, he, Mr. Clinton ’s, Mr. Clinton, He, he) Information Extraction 15th May 2015 21 / 47 Information extraction Coreference resolution Error types at coreference resolution - CoNLL2012 Tabela: Sums of specific error types for SkipCor on CoNLL2012-BN-Test data set. Error Span error Missing entity Extra entity Missing mention Extra mention Divided entity Conflated entities Slavko Žitnik & Marko Bajec (FRI) Information Extraction Occurences 3 124 0 255 3 399 568 15th May 2015 22 / 47 Information extraction Relationship extraction Agenda 1 Motivation 2 Information extraction Definition Related work Systems classification Conditional random fields (CRF) 3 Information extraction Coreference resolution Relationship extraction Named entity recognition Iterative and joint information extraction 4 Further work Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 23 / 47 Information extraction Relationship extraction About relationship extraction relationship subject Slavko Žitnik & Marko Bajec (FRI) object Information Extraction 15th May 2015 24 / 47 Information extraction Relationship extraction About relationship extraction marriedWith employedAt employedAt hasProfession John is married to Jena . He is a mechanic at OBI and she also works there . isA It is a DIY market . Related approaches relationship mention extraction binary classification unsupervised extraction Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 25 / 47 Information extraction Relationship extraction About relationship extraction marriedWith employedAt employedAt hasProfession John is married to Jena . He is a mechanic at OBI and she also works there . isA It is a DIY market . O marriedWith O John Jena O O O employedAt O He mechanic OBI she there It isA DIY market employedAt Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 26 / 47 Information extraction Relationship extraction CRF-based relationship extraction method Relation identification and modification Mention extraction Interaction.Inhibition rocG gerE Interaction.Transcription sigK ykvP Interaction.Regulation comK sigD Site_of lytC Interaction.Binding spoIIG sigA Model 0 0 Model 1 1 Model 2 ... 2 s ... Model s ... Skip-mention sequences Slavko Žitnik & Marko Bajec (FRI) Model training Information Extraction Mention labeling 15th May 2015 27 / 47 Information extraction Relationship extraction Relationship extraction – BioNLP 2013 Input documents Preprocessing sieve Participant U. of Ljubljana K. U. Leuven TEES-2.1 IRISA-TexMex EVEX Mention extraction sieve CRF-based processing sieves (iii) Event extraction sieve (iv) Mention processing sieve (v) Event processing sieve S 8 15 9 27 10 D 50 53 59 25 67 I 6 5 8 28 4 M 30 20 20 36 11 SER 0.73 0.83 0.86 0.91 0.92 (vi) Gene processing sieve BioNLP 2013 GRN challenge official results. The table shows the number of substitutions (S), deletions (D), insertions (I), matches (M) and “slot error rate” (SER) score. (vii) Event-based gene processing sieve Rule-based processing sieve usage of manual rules – regular expressions Data cleaning sieve fine tuned to 0.67 SER (feature functions, additional processing) Interaction.Regulation Interaction.Transcription Identification and modification of relations sigK Interaction.Inhibition rocG comK ykvP gerE sigD Site_of Interaction.Binding lytC spoIIG Slavko Žitnik & Marko Bajec (FRI) sigA Information Extraction 15th May 2015 28 / 47 Information extraction Relationship extraction Relationship extraction – BioNLP 2013 result Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 29 / 47 Information extraction Named entity recognition Agenda 1 Motivation 2 Information extraction Definition Related work Systems classification Conditional random fields (CRF) 3 Information extraction Coreference resolution Relationship extraction Named entity recognition Iterative and joint information extraction 4 Further work Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 30 / 47 Information extraction Named entity recognition About named entity recognition Person Person Position Organization John is married to Jena . He is a mechanic at OBI and she also works there . Organizacija It is a DIY market . named entities also “entity extraction” mostly sequence labeling, also multinomial/binomial classification IOB notation (e.g.: B-ORG I-ORG) Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 31 / 47 Information extraction Named entity recognition About named entity recognition Person Person Position Organization John is married to Jena . He is a mechanic at OBI and she also works there . Organizacija It is a DIY market . PERSON O John Slavko Žitnik & Marko Bajec (FRI) is O married O PERSON O to . Information Extraction Jena 15th May 2015 32 / 47 Information extraction Named entity recognition Named entity recognition – CHEMDNER 2013 Input Text Preprocessing Token detection Sentence detection Part-of-speech tagging Shallow parsing Lemmatization Extraction Token-based All CRF Token-based Mention CRF Phrase-based All CRF Phrase-based Mention CRF Merging and redundancy elimination Meta classification Final results Ranked list of unique mentions Ranked list of mentions CDI CEM Slavko Žitnik & Marko Bajec (FRI) Micro-average R F Model CDI Results TS (run1) TM (run2) TSM (run3) Pred. P 12381 12783 13172 0.83 0.81 0.80 0.75 0.76 0.77 0.79 0.78 0.79 CEM Results TS (run1) TM (run2) TSM (run3) 20438 21109 21562 0.87 0.85 0.85 0.70 0.71 0.72 0.77 0.77 0.78 Official CHEMDNER 2013 results. The table shows the number of extracted chemical entities and drugs (Pred.), precision (P), recall (R) and F1 score. The models were trained against training data set and development data set and evaluated against test set. We used token-based (TS), token-based mention (TM) and a combination of both (TSM). we later improved results to 84.6 F1 (CDI) and 83.2 F1 (CEM) 4% lower than the best performer Information Extraction 15th May 2015 33 / 47 Information extraction Iterative and joint information extraction Agenda 1 Motivation 2 Information extraction Definition Related work Systems classification Conditional random fields (CRF) 3 Information extraction Coreference resolution Relationship extraction Named entity recognition Iterative and joint information extraction 4 Further work Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 34 / 47 Information extraction Iterative and joint information extraction About iterative and joint information extraction marriedTo employedAt employedAt hasProfession Person Person Position Organization John is married to Jena . He is a mechanic at OBI and she also works there . isA Organization It is a DIY market . linear-chain CRF for all tasks ontology as a schema, rules and as a lexicon improvement of specific IE tasks because of the influence of others additional feature functions Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 35 / 47 Information extraction Iterative and joint information extraction Iterative information extractions system – IOBIE Input text Preprocessing Token detection Sentence detection Part-of-speech tagging Shallow parsing Lemmatization Dependency parsing Extraction Named entity recognition Relationship extraction Ontology Data integration Coreference resolution Ontology-based output mechanic hasProfession John, He marriedTo worksAt DIY market Slavko Žitnik & Marko Bajec (FRI) isA Jena, she worksAt OBI, there, It Information Extraction 15th May 2015 36 / 47 Information extraction Iterative and joint information extraction Results Named entity recognition Model Independent Second iteration Third iteration Fourth iteration Fifth iteration Error (%) – 10.3 15.0 15.0 15.0 CA 97.0 97.2 97.8 97.8 97.8 MaP 54.0 55.0 55.2 55.2 55.2 MaR 30.4 33.2 33.5 33.5 33.5 MaF 38.9 41.4 41.7 41.7 41.7 MiF 90.8 91.6 92.2 92.2 92.2 Tabela: Named entity recognition results. Shown measures are error reduction in % (Error), classification accuracy (CA), macro-averaged precision (MaP), macro-averaged recall (MaR), macro-averaged F-score (MaF) and micro-averaged F-score (MiF). Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 37 / 47 Information extraction Iterative and joint information extraction Results Relationship extraction Model Independent Second iteration Third iteration Fourth iteration Fifth iteration Error (%) – 0.8 2.4 2.4 2.0 P 54.3 55.1 54.8 55.0 54.2 R 55.2 55.6 55.6 55.4 55.7 F 54.7 55.3 55.2 55.2 54.9 Tabela: Relationship extraction results. Shown measures are error in % (Error), precision (P), recall (R) and F score. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 37 / 47 Information extraction Iterative and joint information extraction Results Coreference resolution Model Independent Second iteration Third iteration Fourth iteration Fifth iteration MUC 73.2 73.8 74.0 74.3 74.3 BCubed 73.9 73.5 74.1 73.8 73.8 CEAF 49.8 50.0 52.9 52.8 52.8 Tabela: Coreference resolution results. Used measures are MUC [21], BCubed [2] and CEAF [12]. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 37 / 47 Further work Further work Interdependencies between tasks. “End-to-end” information extraction systems evaluation. Models weighting with respect to skip mention number. Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 38 / 47 Further work Thanks! @szitnik [email protected] Slavko Žitnik & Marko Bajec (FRI) Information Extraction 15th May 2015 39 / 47 Further work Literatura I Giuseppe Attardi, Stefano Dei Rossi, and Maria Simi. 