Context-free Languages I Announcements Outline Section Context

Announcements
Homeworks will be returned on Friday.
Context-free Languages I
David L. Dill
Department of Computer Science
Stanford University
1 / 39
Outline
2 / 39
Section
Context-free languages
1 Context-free languages
Definition
Language of a context-free grammar
Parse trees
Ambiguity and precedence
Relation to other classes of languages
3 / 39
4 / 39
Introduction to Context-free Grammars
Subsection
We just saw that the language L = {0n 1n | n = 0} is not regular.
Definition
But it is easy to describe recursively:
ˆ is is in L.
ˆ 0S1 is in L if S is in L.
There is a general notation for recursive definitions of languages like this:
S
S
→ → 0S1
This is called a context-free grammar (CFG). CFGs describe the
context-free languages, which is a class of languages that is a proper
superset of the regular languages.
5 / 39
Context-Free Languages
Context-free languages are more expressive than regular languages, but
not as powerful as Turing machine deciders (to be presented later).
6 / 39
A Context-free Grammar for Arithmetic Expressions
Here is a simple CFG for arithmetic expressions.
E
E
E
E
E
E
Context-free languages were identified in the 1950’s by linguist Noam
Chomsky, as a natural place in a hierarchy of languages, which included
the regular languages.
They are extremely important in practical systems.
They are used to specify all kinds of notations, from programming
languages to email header formats.
→
→
→
→
→
→
E +E
E −E
E ∗E
−E
(E )
num
Here “num” represents any number. I could have specified the structure
of numbers as sequences of digits, too.
There are widely-used, efficient parser generators that automatically build
parsers from context-free grammars. (Parsers check grammar, and build
a tree capturing the structure of the input.)
I used the “jison” parser generator for Javascript to implement the
parsers for arithmetic (in the first lecture), Boolean formulas (in the truth
table), and first order logic (in the blocks world).
7 / 39
8 / 39
Formal Definition of Context-Free Grammars
A context-free grammar (CFG) is a 4-tuple (V , Σ, R, S)
Notation
a, b, c, 0, 1, . . . are terminals
ˆ V is a finite set of variables (also non-terminal symbols),
A, B, S, . . . are non-terminals
ˆ Σ is a finite set of terminal symbols,
w , x, y , z are terminal strings (members of Σ∗ )
ˆ R ⊆ V × (V ∪ Σ)∗ is a finite set of rules (also productions), and
α, β, γ are strings of terminals and/or variables (members of (V ∪ Σ)∗ ).
ˆ S ∈ V is the sentence symbol.
Several productions with common heads can be combined:
A → a | Aa | bAb
V and Σ must be disjoint sets.
A → means the RHS is the empty string.
Productions are written as A → aBc. A is the left-hand side (LHS), also
called the head, and aBc is the right-hand side (RHS), also called the
body.
is not a terminal symbol.
9 / 39
Subsection
10 / 39
Derivations and the Language of a CFG
A CFG can be regarded as a collection of rules.
Language of a context-free grammar
Strings in the language are generated by starting with S and repeatedly
replacing A with α whenever the CFG has a production A → α. (CFGs
are sometimes called generative grammars.)
The sequence of replacements is called a derivation. Here is the CFG for
arithmetic, again (in compressed notation):
E → E + E | E − E | E ∗ E | −E | (E ) | num
Example: E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒
num + num ∗ E =⇒ num + num ∗ num shows that num + num ∗ num is in
the language of the CFG.
11 / 39
12 / 39
Derivations and the Language of a CFG
The language of a given CFG, G = (V , Σ, R, S), can be characterized using
the concept of a derivation.
Subsection
Parse trees
Definition
αAβ yields αγβ (written as αAβ =⇒ αγβ) whenever A → γ is in R.
Definition (Derives)
We say α derives β if α =⇒ γ1 =⇒ γ2 =⇒ . . . =⇒ β in zero or more steps.
∗
This is written: α =⇒ β.
Definition (Language of a CFG)
∗
The language of G (L(G )) is {w ∈ Σ∗ | S =⇒ w }
Note that w must be a terminal string (also called a sentence).
A language is context-free if it is L(G ) for some CFG G .
Note: The intermediate strings in a derivation are called sentential forms.
13 / 39
Parse trees
14 / 39
Yield of a parse tree
A parse tree is a tree that shows how to derive a string from a
non-terminal.
The concatenation of the symbols (or ) at the leaves of a parse tree is
called the yield of the parse tree.
An interior node represents the LHS of a production, and it’s children
represent the RHS that would replace it in a derivation.
The yield can always be derived from the symbol at the root of the tree.
The root is S and the yield is x ∈ Σ∗ iff x is in L(G ).
Here is a parse tree for num + num − num + num
E
E
E
E
num(1)
+
-
E
+
E
E
num(4)
num(3)
num(2)
15 / 39
16 / 39
Leftmost Derivations
Leftmost Derivations
There are many ways to extract a derivation from a parse tree.
There are many ways to extract a derivation from a parse tree.
If we put a restriction on how the derivation is done, we can get a unique
derivation.
If we put a restriction on how the derivation is done, we can get a unique
derivation.
Definition (Leftmost Deriviation)
Definition (Leftmost Deriviation)
A leftmost derivation is one where each step replaces the leftmost
non-terminal in the sentential form.
A leftmost derivation is one where each step replaces the leftmost
non-terminal in the sentential form.
Here is the derivation I showed earlier. Is it leftmost?
Here is the derivation I showed earlier. Is it leftmost?
E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒
num + num ∗ num
E =⇒ E + E =⇒ E + E ∗ E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒
num + num ∗ num
No. The second step, E + E =⇒ E + E ∗ E , replaces a non-leftmost
nonterminal.
17 / 39
Lefmost Derivations, cont.
Leftmost derivation of num + num ∗ num:
18 / 39
Subsection
Ambiguity and precedence
E =⇒ E + E =⇒ num + E =⇒ num + E ∗ E =⇒ num + num ∗ E =⇒
num + num ∗ num
19 / 39
20 / 39
Ambiguity
Ambiguity
Here is the CFG for arithmetic expressions, again:
Definition (Ambiguous CFG)
E → E + E | E − E | E ∗ E | −E | (E ) | num
A CFG is ambiguous if there is more than one leftmost derivation for the
same string.
Is this ambiguous?
Equivalently: A CFG is ambiguous if there is more than one parse tree for
the same string.
Ambiguity often causes problems:
ˆ With interpretation.
ˆ With parsing.
21 / 39
22 / 39
Ambiguity
Arithmetic expressions
Here is the CFG for arithmetic expressions, again:
CFG: E → E + E | E − E | E ∗ E | −E | (E ) | num
E → E + E | E − E | E ∗ E | −E | (E ) | num
Suppose 1, 2, 3, and 4 are all instances of num.
Is this ambiguous? Yes, in several ways, all related to precedence and
associativity.
Since the CFG is ambiguous, there are multiple ways to parse
1 + 2 − 3 + 4. The tree on the left groups it as (1 + 2) − (3 + 4) = −4
(wrong!) while the tree on the right groups it as ((1 + 2) − 3) + 4 = 4
(right!).
Multiple parses for num + num ∗ num.
Multiple parses for num + num + num.
Similar problems for all combinations of operators.
E
E
+
E
E
-
E
num
+
E
num
E
E
-
E
E
E
num
23 / 39
+
E
E
num
num
+
E
num
num
num
24 / 39
Making CFGs Unambiguous
Making a CFG Unambiguous
Recipe:
An ambiguous CFG can sometimes be rewritten to make it unambiguous
by making it group according to precedence and association rules.
CFG: E → E + E | E − E | E ∗ E | −E | (E ) | num
There is no general method and sometimes it is not possible, but there is
a way to do it that often works for practical instances.
ˆ Rank operators by increasing precedence. (For arithmetic: unary −,
∗, (+ and binary − have the same precedence).
ˆ Introduce a new non-terminal for each precedence level. (E will
represent + and −, T will represent ∗, and F will represent −)
ˆ Make the sentence symbol go to the lowest precedence non-terminal
(We will keep E as the sentence symbol.)
ˆ Productions choose between generating the current operator or the
non-terminal for the next higher precedence operator.
ˆ Use left/right recursion for left/right associativity.
25 / 39
26 / 39
Unambiguous CFG for Arithmetic
→
→
→
E
T
F
E +T |E −T |T
T ∗F |F
−F | (E ) | num
Parentheses
If we really wanted to compute 1 + 2 − (3 + 4), we would have to put the
parentheses in the input string.
The precedence is correct, but the tree is more complex than that from
the original grammar.
E
E
E
-
+
T
T
F
E
T
E
+
F
E
-
T
F
F
num(4)
num(1)
E
+
T
T
F
F
num(2)
F
(
E
)
E
+
T
T
num(3)
num(2)
T
F
F
num(4)
num(3)
num(1)
27 / 39
28 / 39
Example: Function Call
A function call in many programming languages looks like: f(x + 1, y + 1).
Here is a fragment of a CFG (E is a non-terminal for expressions):
C
P
Example: Function Call
A function call in many programming languages looks like: f(x + 1, y + 1).
Here is a fragment of a CFG (E is a non-terminal for expressions):
→ id(P)
→ P, E | → id(P)
→ P, E | C
P
What’s the problem? It puts a comma before each argument, not
between them.
What’s the problem?
C =⇒ id(P) =⇒ id(P, E ) =⇒ id(, E )
Here is a way to do it right:
C
P
→ id(P) | id()
→ P, E | E
30 / 39
31 / 39
Subsection
Every Regular Language is Context-Free
Relation to other classes of languages
Theorem
Every regular language is context-free.
proof. Let L be any regular language. Then, there is a regular expression
R such that L = L(R).
We can construct a CFG G with the same language as R. The proof is
by induction on the structure of regular expressions.
ˆ If R = ∅, the CFG: S → S has the same language.
ˆ If R = , the CFG: S → .
ˆ if R = a, the CFG is S → a.
32 / 39
33 / 39
∗
Every Regular Language is Context-Free, cont.
in a CFG
R = R1∗ is S → | SS1 , where S1 is the sentence symbol for the CFG for
R1 (or S → | S1 S if it’s more convenient)
For the inductive cases, the induction hypothesis is that the
subexpressions Ri have equivalent CFGs.
Note that we can rename non-terminals arbitrarily, so, without loss of
generality, let the CFG for Ri be Gi with sentence symbol Si , and assume
that non-terminals for Ri and Rj are distinct if i 6= j.
S
S
R = R1 ∪ R2 , the desired G is the union of the productions from G1 and
G2 , with additional productions S → S1 and S → S2 , where S is the
sentence symbol of G .
S
R = R1∗ is S → | SS1 , where S1 is the sentence symbol for the CFG for
R1 (or S → | S1 S if it’s more convenient)
(R = R1 · R2 left as an exercise.)
So, L is context-free.
S
2
S1
S1
S1
S1
e
S
e
34 / 39
35 / 39
Extended BNF
A CFL that is not regular
The language {0i 1i | i ≥ 0} is context-free.
BNF = “Backus-Naur Form” is another name for context-free grammars.
The notation was used to define the syntax of the ALGOL programming
language.
S
S
Extended Backus-Naur Form EBNF allows regular expressions on the
right-hand sides of productions.
→ → 0S1
We proved earlier that this language is not regular.
(parentheses here are EBNF notation, not terminal symbols, and R?
means R|).
E
C
P
S
This is a context-free language that is not regular.
→ E (+E )∗
→ id lparen P? rparen
→ P(, E )∗
An EBNF grammar can be expanded to an ordinary CFG using the
transformations to convert regular expressions to CFGs from the previous
slides.
36 / 39
37 / 39
Languages that are not context-free
Classes of Languages: Summary
A simple language that is not context-free is
{0i 1i | i ≥ 0}∗
i i i
{0 1 2 | i ≥ 0}.
This can be proved using the Context-free pumping lemma, which is like
the pumping lemma for regular languages, only harder.
0∗
CFLs
regular
“Regular languages can count one thing, context-free languages can
count two things, and Turing machines can count many things.”
Warning: This is only approximately correct, and never acceptable as a
part of a proof.
finite
38 / 39
39 / 39