Statistical Relational Learning for NLP Ray Mooney & Razvan Bunescu Statistical Relational Learning Presented by Michele Banko Outline  Quick intro to NLP – Problems, History – Why SRL?  SRL for two NLP tasks – Information Extraction using RMNs – Semantic Parsing using ILP Quick Tour of NLP  Systems – Translation, Question-Answering/NL Search, Reading Comprehension  Sub-tasks – Part-of-Speech Tagging, Phrase Finding, Parsing [syntax, semantics], Named-entity Recognition Quick Tour of NLP [2] Knowledge engineering/linguistics vs. statistics  Recently, many NLP tasks treated as sequence labeling problems  – Pos-tagging: The/DT cat/NN can/MD walk/VB – NP-finding: Stanford/B-NP University/I-NP is/O in/O California/B-NP – SRL with 1 relation, adjacency  HMMs, CRFs to model, Viterbi to label – Find most probable assignment of labels – States = Part-of-speech tags – Compute P(w|t), emit word in a particular state Why SRL? NLP involves..  Complex reasoning about entities and relationships between them  Predicate  logic Resolving & integrating ambiguities on mulitple levels (morphology, syntax, semantics..) – More than just adjacency!  Bayesian methods, graphical models Intro to Information Extraction  Early 90s, DARPA Message Understanding Conference – Identify references to named-entities (people, companies, locations..) – Multi-lingual, multi-document – Attrributes, relationships, events Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Attributes NAME Fletcher Maddox, Dr. Maddox DESCRIPTORS former Dean of the UCSD Business School, his father, the firm's CEO CATEGORY PERSON Facts PERSON Employee_Of Fletcher Maddox UCSD Business School, La Jolla Genomatics IE for Protein Identification  Medline DB: 750K abstracts, 3700 proteins Production of nitric oxide ( NO ) in endothelial cells is regulated by direct interactions of endothelial nitric oxide synthase ( eNOS ) which effector proteins such as Ca2+ - calmodulin . ... which avidly binds to the carboxyl terminal region of the eNOS oxygenase domain. Rule-based, HMM, SVM, MaxEnt  CRFs outperform rest (Ramani et al, 2005)  – May fail to capture long-distance dependencies Collective IE using RMNs (Bunescu & Mooney, 2004)   Typical IE: extractions in isolation Want to consider influences between extractions – If context surrounding one occurrence strongly indicates protein, should affect future taggings in different contexts – Acronyms & their long forms  Use Sequence Labeling Treatment to get all substrings that make up protein names – Start, End, Continue, Unique, Other   Classification of sequence types, not solo tokens RMNs (Relational Markov Networks) RMNs (Taskar et al., 2002)  For each document d in collection – Associate d with set of candidate entities d.E  Entities = token sequences = too many possible phrases  Either: constrain length or form (baseNPs) – Characterize each entity e in d.E with a predefined set of boolean features e.F  E.label=1 if e is a valid extraction  E.HeadWord, E.POS_Left, E.BigramRight, E.Prefix Clique Templates Find all subsets of entities satisfying a given constraint, then for each subset, form a clique c  Local Templates  – Number of hidden labels = 1 – Model correlations between an entity’s observed features and its label  Global Templates – Number of hidden labels > 1 – Model influences among multiple entities  Overlap Template, Repeat Template, Acronym Template Using Local Templates Variable Nodes: labels of all candidate entities in document Potential Nodes: represent correlations between >1 entity attributes by linking to variable nodes – Edges: by matching clique templates against d.E RMN by Local Templates e.label Factor Graph by Local Templates e.label φ.PREFIX=A0 … E.f1=vi E.f2=vj φ.HEAD=enzyme φ.POSLEFT=NOUN E.fh=vk φ.WORDLEFT=the Using Global Templates Connect label nodes of two or more entities  Acronym Factor Graph  The antioxidant superoxide dismutase-1 (SOD1) φAT (acronym potential) uOR V (the entity) φOR antioxidant superoxide dismutase-1 superoxide dismutase-1 dismutase-1 Inference & Learning in RMNs  Inference – Labels are only hidden features – Probability distribution over hidden entity labels computed using Gibbs distribution – Find most probable assignment of values to labels using max-product algorithm  Learning – Structure defined by clique templates – Find clique potentials that maximize likelihood over training data using Voted Perceptron Experiments  Datasets – Yapex, Aimed: ~200 Medline abstracts, ~4000 protein references each  Systems – LT-RMN: RMN with local + overlap templates – GLT-RMN: RMN with local + global templates – CRF: McCallum 2002  Evaluation – Position-based precision, recall Results Yapex Method LT-RMN GLT-RMN CRF Aimed Method LT-RMN GLT-RMN CRF Precision 70.79 69.71 72.45 Recall 53.81 65.76 58.64 F-Measure 61.14 67.68 64.81 Precision 81.33 82.79 85.37 Recall 72.79 80.04 75.90 F-Measure 76.82 81.39 80.36 Intro to Semantic Parsing   NL input  logical form Assignment of semantic roles “John gives the book to Mary” gives1(subj: John2, dobj: book3, iobj: Mary4)  Ambiguities “Every man loves a woman” – Is there one woman loved by all or… YX man(X) ^ woman(Y) -> loves(X,Y) XY man(X) ^ woman(Y) -> loves(X,Y) – LF just says loves1(every m1:man(m1), a w1:woman(w1)) Previous Work in Semantic Parsing  Hand-built/Linguistic systems – Grammar development – NLPwin (MSR)  Data-driven approaches – Tagging problem: for each NP, tag role  Gildea & Jurafsky (2002): NB classifier combination using lexical and syntactic features, previous labels  Jurafsky et al. (2004): SVMs – Sentence + LF  Parser  CHILL: ILP to learn a generalized Prolog parser ILP in CHILL  – – Parser induction = learn control rules of a shift reduce parser Shift symbols from the input string onto a stack Reduce items on the stack by applying a matching grammar rule Can be encoded as Prolog program:  – parse(S,Parse) :- parse([],S,[Parse],[]). Start with 3 generic operators  – – – – INRTRODUCE pushes predicate onto stack based on word input – Lexicon: ‘capital’ -> capital(_,_) COREF_VARS unifies two variables under consideration DROP_CONJ embeds one predicate as argument of another SHIFT pushes word input onto stack What is the capital of Texas? Stack Input Buffer Action [answer(_,_):[]] [what,is,the,capital,of,texas,?] [answer(_,_):[the,is,what]] [capital,of,texas,?] SHIFT, SHIFT, SHIFT [capital(_,_):[], answer(_,_):[the,is,what]] [capital,of,texas,?] INTRODUCE [capital(C,_):[], answer(C,_):[the,is,what]] [capital,of,texas,?] COREF_VARS [capital(C,_):[of,capital], answer(C,_):[the,is,what]] [texas,?] SHIFT, SHIFT [const(S,stateid(texas)):[], capital(C,S):[of,capital], answer(C,_):[the,is,what]] [texas,?] INTRODUCE, COREF_VARS [answer(C, (capital(C,S), const(S,stateid(texas)))): [?,texas,of,capital,the,is,what] ] [] DROP_CONJ, SHIFT, SHIFT Learning Control Rules  Operator Generation – Initial parser is too general, will produce spurious parses – Use training data to extend program  Example Analysis – Use general parser, recording parse states – Positive Examples  Parse states to which operator should be applied  Find 1st correct parse of training pair, ops used to achieve subgoals become positive examples – Negative Examples for single-parse systems  S is a negative example for the current action if S is a positive example for a previous action Induction in CHILL  Control-Rule Induction – Cover all positive examples, not negative – Bottom-Up: Compact rule set by forming LeastGeneral Generalizations of clause pairs – Top-Down: Overly-general rules specialized by addition of literals – Invention of new predicates op([X,[Y,det:the]], [the|Z],A,B) :animate(Y). animate(man). animate(boy). animate(girl). . .  Fold constraints back into general parser ProbCHILL: Parsing via ILP Ensembles  ILP + Statistical Parsing – ILP: not forced to make decisions based on predetermined features – SP: handle multiple sources of uncertainty  Ensemble classifier – Combine outputs of > 1 classifiers – Bagging, boosting  TABULATE: generate several ILP hypotheses – Use SP to estimate probability for each potential operator – Find most probable semantic parse (beam-search)  P(parse state) = product of probs of operators to reach state Experiments U.S. Geography DB with 800 Prolog facts  250 questions from 50 humans, annotated with Prolog queries  What is the capital of the state with the highest population? answer(C, (capital(S,C), largest(P, (state(S), population(S,P)))))  10-fold CV, measuring Recall = # correct queries produced # test sentences Precision = # correct queries produced # complete parses reduced Results System Recall Precision F-measure Geobase 56.00 96.40 70.85 CHILL 68.50 97.65 80.52 ProbCHILL 80.40 88.16 84.10