Announcements Homeworks will be returned on Friday. Context-free Languages I David L. Dill Department of Computer Science Stanford University 1 / 39 Outline 2 / 39 Section Context-free languages 1 Context-free languages Definition Language of a context-free grammar Parse trees Ambiguity and precedence Relation to other classes of languages 3 / 39 4 / 39 Introduction to Context-free Grammars Subsection We just saw that the language L = {0n 1n | n = 0} is not regular. Definition But it is easy to describe recursively: is is in L. 0S1 is in L if S is in L. There is a general notation for recursive definitions of languages like this: S S → → 0S1 This is called a context-free grammar (CFG). CFGs describe the context-free languages, which is a class of languages that is a proper superset of the regular languages. 5 / 39 Context-Free Languages Context-free languages are more expressive than regular languages, but not as powerful as Turing machine deciders (to be presented later). 6 / 39 A Context-free Grammar for Arithmetic Expressions Here is a simple CFG for arithmetic expressions. E E E E E E Context-free languages were identified in the 1950’s by linguist Noam Chomsky, as a natural place in a hierarchy of languages, which included the regular languages. They are extremely important in practical systems. They are used to specify all kinds of notations, from programming languages to email header formats. → → → → → → E +E E −E E ∗E −E (E ) num Here “num” represents any number. I could have specified the structure of numbers as sequences of digits, too. There are widely-used, efficient parser generators that automatically build parsers from context-free grammars. (Parsers check grammar, and build a tree capturing the structure of the input.) I used the “jison” parser generator for Javascript to implement the parsers for arithmetic (in the first lecture), Boolean formulas (in the truth table), and first order logic (in the blocks world). 7 / 39 8 / 39 Formal Definition of Context-Free Grammars A context-free grammar (CFG) is a 4-tuple (V , Σ, R, S) Notation a, b, c, 0, 1, . . . are terminals V is a finite set of variables (also non-terminal symbols), A, B, S, . . . are non-terminals Σ is a finite set of terminal symbols, w , x, y , z are terminal strings (members of Σ∗ ) R ⊆ V × (V ∪ Σ)∗ is a finite set of rules (also productions), and α, β, γ are strings of terminals and/or variables (members of (V ∪ Σ)∗ ). S ∈ V is the sentence symbol. Several productions with common heads can be combined: A → a | Aa | bAb V and Σ must be disjoint sets. A → means the RHS is the empty string. Productions are written as A → aBc. A is the left-hand side (LHS), also called the head, and aBc is the right-hand side (RHS), also called the body. is not a terminal symbol. 9 / 39 Subsection 10 / 39 Derivations and the Language of a CFG A CFG can be regarded as a collection of rules. Language of a context-free grammar Strings in the language are generated by starting with S and repeatedly replacing A with α whenever the CFG has a production A → α. (CFGs are sometimes called generative grammars.) The sequence of replacements is called a derivation. Here is the CFG for arithmetic, again (in compressed notation): E → E + E | E − E | E ∗ E | −E | (E ) | num Example: E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒ num + num ∗ num shows that num + num ∗ num is in the language of the CFG. 11 / 39 12 / 39 Derivations and the Language of a CFG The language of a given CFG, G = (V , Σ, R, S), can be characterized using the concept of a derivation. Subsection Parse trees Definition αAβ yields αγβ (written as αAβ =⇒ αγβ) whenever A → γ is in R. Definition (Derives) We say α derives β if α =⇒ γ1 =⇒ γ2 =⇒ . . . =⇒ β in zero or more steps. ∗ This is written: α =⇒ β. Definition (Language of a CFG) ∗ The language of G (L(G )) is {w ∈ Σ∗ | S =⇒ w } Note that w must be a terminal string (also called a sentence). A language is context-free if it is L(G ) for some CFG G . Note: The intermediate strings in a derivation are called sentential forms. 13 / 39 Parse trees 14 / 39 Yield of a parse tree A parse tree is a tree that shows how to derive a string from a non-terminal. The concatenation of the symbols (or ) at the leaves of a parse tree is called the yield of the parse tree. An interior node represents the LHS of a production, and it’s children represent the RHS that would replace it in a derivation. The yield can always be derived from the symbol at the root of the tree. The root is S and the yield is x ∈ Σ∗ iff x is in L(G ). Here is a parse tree for num + num − num + num E E E E num(1) + - E + E E num(4) num(3) num(2) 15 / 39 16 / 39 Leftmost Derivations Leftmost Derivations There are many ways to extract a derivation from a parse tree. There are many ways to extract a derivation from a parse tree. If we put a restriction on how the derivation is done, we can get a unique derivation. If we put a restriction on how the derivation is done, we can get a unique derivation. Definition (Leftmost Deriviation) Definition (Leftmost Deriviation) A leftmost derivation is one where each step replaces the leftmost non-terminal in the sentential form. A leftmost derivation is one where each step replaces the leftmost non-terminal in the sentential form. Here is the derivation I showed earlier. Is it leftmost? Here is the derivation I showed earlier. Is it leftmost? E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒ num + num ∗ num E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒ num + num ∗ num No. The second step, E + E =⇒ E + E ∗ E , replaces a non-leftmost nonterminal. 17 / 39 Lefmost Derivations, cont. Leftmost derivation of num + num ∗ num: 18 / 39 Subsection Ambiguity and precedence E =⇒ E + E =⇒ num + E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒ num + num ∗ num 19 / 39 20 / 39 Ambiguity Ambiguity Here is the CFG for arithmetic expressions, again: Definition (Ambiguous CFG) E → E + E | E − E | E ∗ E | −E | (E ) | num A CFG is ambiguous if there is more than one leftmost derivation for the same string. Is this ambiguous? Equivalently: A CFG is ambiguous if there is more than one parse tree for the same string. Ambiguity often causes problems: With interpretation. With parsing. 21 / 39 22 / 39 Ambiguity Arithmetic expressions Here is the CFG for arithmetic expressions, again: CFG: E → E + E | E − E | E ∗ E | −E | (E ) | num E → E + E | E − E | E ∗ E | −E | (E ) | num Suppose 1, 2, 3, and 4 are all instances of num. Is this ambiguous? Yes, in several ways, all related to precedence and associativity. Since the CFG is ambiguous, there are multiple ways to parse 1 + 2 − 3 + 4. The tree on the left groups it as (1 + 2) − (3 + 4) = −4 (wrong!) while the tree on the right groups it as ((1 + 2) − 3) + 4 = 4 (right!). Multiple parses for num + num ∗ num. Multiple parses for num + num + num. Similar problems for all combinations of operators. E E + E E - E num + E num E E - E E E num 23 / 39 + E E num num + E num num num 24 / 39 Making CFGs Unambiguous Making a CFG Unambiguous Recipe: An ambiguous CFG can sometimes be rewritten to make it unambiguous by making it group according to precedence and association rules. CFG: E → E + E | E − E | E ∗ E | −E | (E ) | num There is no general method and sometimes it is not possible, but there is a way to do it that often works for practical instances. Rank operators by increasing precedence. (For arithmetic: unary −, ∗, (+ and binary − have the same precedence). Introduce a new non-terminal for each precedence level. (E will represent + and −, T will represent ∗, and F will represent −) Make the sentence symbol go to the lowest precedence non-terminal (We will keep E as the sentence symbol.) Productions choose between generating the current operator or the non-terminal for the next higher precedence operator. Use left/right recursion for left/right associativity. 25 / 39 26 / 39 Unambiguous CFG for Arithmetic → → → E T F E +T |E −T |T T ∗F |F −F | (E ) | num Parentheses If we really wanted to compute 1 + 2 − (3 + 4), we would have to put the parentheses in the input string. The precedence is correct, but the tree is more complex than that from the original grammar. E E E - + T T F E T E + F E - T F F num(4) num(1) E + T T F F num(2) F ( E ) E + T T num(3) num(2) T F F num(4) num(3) num(1) 27 / 39 28 / 39 Example: Function Call A function call in many programming languages looks like: f(x + 1, y + 1). Here is a fragment of a CFG (E is a non-terminal for expressions): C P Example: Function Call A function call in many programming languages looks like: f(x + 1, y + 1). Here is a fragment of a CFG (E is a non-terminal for expressions): → id(P) → P, E | → id(P) → P, E | C P What’s the problem? It puts a comma before each argument, not between them. What’s the problem? C =⇒ id(P) =⇒ id(P, E ) =⇒ id(, E ) Here is a way to do it right: C P → id(P) | id() → P, E | E 30 / 39 31 / 39 Subsection Every Regular Language is Context-Free Relation to other classes of languages Theorem Every regular language is context-free. proof. Let L be any regular language. Then, there is a regular expression R such that L = L(R). We can construct a CFG G with the same language as R. The proof is by induction on the structure of regular expressions. If R = ∅, the CFG: S → S has the same language. If R = , the CFG: S → . if R = a, the CFG is S → a. 32 / 39 33 / 39 ∗ Every Regular Language is Context-Free, cont. in a CFG R = R1∗ is S → | SS1 , where S1 is the sentence symbol for the CFG for R1 (or S → | S1 S if it’s more convenient) For the inductive cases, the induction hypothesis is that the subexpressions Ri have equivalent CFGs. Note that we can rename non-terminals arbitrarily, so, without loss of generality, let the CFG for Ri be Gi with sentence symbol Si , and assume that non-terminals for Ri and Rj are distinct if i 6= j. S S R = R1 ∪ R2 , the desired G is the union of the productions from G1 and G2 , with additional productions S → S1 and S → S2 , where S is the sentence symbol of G . S R = R1∗ is S → | SS1 , where S1 is the sentence symbol for the CFG for R1 (or S → | S1 S if it’s more convenient) (R = R1 · R2 left as an exercise.) So, L is context-free. S 2 S1 S1 S1 S1 e S e 34 / 39 35 / 39 Extended BNF A CFL that is not regular The language {0i 1i | i ≥ 0} is context-free. BNF = “Backus-Naur Form” is another name for context-free grammars. The notation was used to define the syntax of the ALGOL programming language. S S Extended Backus-Naur Form EBNF allows regular expressions on the right-hand sides of productions. → → 0S1 We proved earlier that this language is not regular. (parentheses here are EBNF notation, not terminal symbols, and R? means R|). E C P S This is a context-free language that is not regular. → E (+E )∗ → id lparen P? rparen → P(, E )∗ An EBNF grammar can be expanded to an ordinary CFG using the transformations to convert regular expressions to CFGs from the previous slides. 36 / 39 37 / 39 Languages that are not context-free Classes of Languages: Summary A simple language that is not context-free is {0i 1i | i ≥ 0}∗ i i i {0 1 2 | i ≥ 0}. This can be proved using the Context-free pumping lemma, which is like the pumping lemma for regular languages, only harder. 0∗ CFLs regular “Regular languages can count one thing, context-free languages can count two things, and Turing machines can count many things.” Warning: This is only approximately correct, and never acceptable as a part of a proof. finite 38 / 39 39 / 39
© Copyright 2024