Winter 2012-2013
Compiler Principles
Lexical Analysis (Scanning)
Mayer Goldberg and Roman Manevich
Ben-Gurion University
General stuff

Topics taught by me







Lexical analysis (scanning)
Syntax analysis (parsing)
…
Dataflow analysis
Register allocation
Slides will be available from web-site after
lecture
Request: please mute mobiles, tablets,
super-cool squeaking devices
2
Today

Understand role of lexical analysis

Lexical analysis theory

Implementing modern scanner
3
Role of lexical analysis

First part of compiler front-end
High-level
Language
Lexical
Analysis
Syntax
Analysis
Parsing
AST
Symbol
Table
etc.
Inter.
Rep.
(IR)
Code
Generation
Executable
Code
(scheme)

Convert stream of characters into stream
of tokens


Split text into most basic meaningful strings
Simplify input for syntax analysis
4
From scanning to parsing
5 + (7 * x)
program text
Lexical
Analyzer
token stream
Grammar:
E  id
E  num
EE+E
EE*E
E(E)
num
+
(
num
*
id
)
Parser
valid
syntax
error
+
num
Abstract Syntax Tree
*
num
x
5
Javascript example

Identify basic units in this code
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
6
Javascript example

Identify basic units in this code
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
7
Javascript example

Identify basic units in this code
operator
keyword
whitespace
numeric literal
var currOption = 0;
string literal
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications", "teaching",
"software", "activities"];
for (i = 0; i < menu.length; i++) {
identifier
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
punctuation
elt.style.display = "none";
}
}
}
8
Scanner output
var currOption = 0;
// Choose content to display in lower pane.
function choose ( id ) {
var menu = ["about-me", "publications“,
"teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) {
currOption = menu[i];
var elt = document.getElementById(currOption);
if (currOption == id && elt.style.display == "none") {
elt.style.display = "block";
}
else {
elt.style.display = "none";
}
}
}
Stream of Tokens
LINE: ID(value)
1: VAR
1: ID(currOption)
1: EQ
1: INT_LITERAL(0)
1: SEMI
3: FUNCTION
3: ID(choose)
3: LP
3: ID(id)
3: EP
3: LCB
...
9
What is a token?

Lexeme – substring of original text
constituting an identifiable unit


Record type storing:





Identifiers, Values, reserved words, …
Kind
Value (when applicable)
Start-position/end-position
Any information that is useful for the parser
Different for different languages
10
C++ example 1


Splitting text into tokens can be tricky
How should the code below be split?
vector<vector<int>> myVector
>>
operator
or
>, >
two tokens
?
11
C++ example 2


Splitting text into tokens can be tricky
How should the code below be split?
vector<vector<int> > myVector
>, >
two tokens
12
Example tokens
Type
Examples
Identifier
x, y, z, foo, bar
NUM
42
FLOATNUM
-3.141592654
STRING
“so long, and thanks for all the fish”
LPAREN
(
RPAREN
)
IF
if
…
13
Separating tokens

Type
Examples
Comments
/* ignore code */
// ignore until end of line
White spaces
\t \n
Lexemes are recognized but get consumed
rather than transmitted to parser

if
if
i/*comment*/f
14
Preprocessor directives in C
Type
Examples
Inlude directives
#include<foo.h>
Macros
#define THE_ANSWER 42
15
Designing a scanner

Define each type of lexeme






Reserved words: var, if, for, while
Operators: < = ++
Identifiers: myFunction
Literals: 123 “hello”
Annotations: @SuppressWarnings
But how do we define lexemes of
unbounded length?
16
Designing a scanner

Define each type of lexeme






Reserved words: var, if, for, while
Operators: < = ++
Identifiers: myFunction
Literals: 123 “hello”
Annotations: @SuppressWarnings
But how do we define lexemes of
unbounded length?

Regular expressions
17
Regular languages refresher

Formal languages




Alphabet = finite set of letters
Word
= sequence of letter
Language = set of words
Regular languages defined equivalently by


Regular expressions
Finite-state automata
18
Regular expressions





Empty string: Є
Letter: a
Concatenation: R1 R2
Union: R1 | R2
Kleene-star: R*



Shorthand: R+ stands for R R*
scope: (R)
Example: (0* 1*) | (1* 0*)

What is this language?
19
Exercise 1 - Question

Language of Java identifiers


Identifiers start with either an underscore ‘_’
or a letter
Continue with either underscore, letter, or digit
20
Exercise 1 - Answer

Language of Java identifiers




Identifiers start with either an underscore ‘_’
or a letter
Continue with either underscore, letter, or digit
(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*
Using shorthand macros
First
= _|a|b|…|z|A|…|Z
Next
= First|0|…|9
R
= First Next*
21
Exercise 2 - Question

Language of rational numbers in decimal
representation (no leading, ending zeros)





0
123.757
.933333
Not 007
Not 0.30
22
Exercise 2 - Answer


Language of rational numbers in decimal
representation (no leading, ending zeros)
Digit
= 1|2|…|9
Digit0
= 0|Digit
Num
= Digit Digit0*
Frac
= Digit0* Digit
Pos
= Num | .Frac | 0.Frac| Num.Frac
PosOrNeg = (Є|-)Pos
R
= 0 | PosOrNeg
23
Exercise 3 - Question

Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
24
Exercise 3 - Answer




Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
Not regular
Context-free
Grammar:
S ::= []
| [S]
25
Finite automata

An automaton is defined by states and
transitions
transition
accepting
state
b
c
a
start
b
start
state
26
Automaton running example

Words are read left-to-right
a
b
c
b
c
a
start
b
27
Automaton running example

Words are read left-to-right
a
b
c
b
c
a
start
b
28
Automaton running example

Words are read left-to-right
a
b
c
b
c
a
start
b
29
Automaton running example

Words are read left-to-right
a
b
c
word
accepted
b
c
a
start
b
30
Word outside of language
b
b
c
b
c
a
start
b
31
Word outside of language

Missing transition means non-acceptance
b
b
c
b
c
a
start
b
32
Exercise - Question

What is the language defined by the
automaton below?
b
c
a
start
b
33
Exercise - Answer

What is the language defined by the
automaton below?


a b* c
Generally: all paths leading to accepting states
b
c
a
start
b
34
Non-deterministic automata

Allow multiple transitions from given state
labeled by same letter
b
c
a
start
a
c
b
35
NFA run example
a
b
c
b
c
a
start
a
c
b
36
NFA run example

Maintain set of states
a
b
c
b
c
a
start
a
c
b
37
NFA run example
a
b
c
b
c
a
start
a
c
b
38
NFA run example

Accept word if any of the states in the set
is accepting
a
b
c
b
c
a
start
a
c
b
39
NFA+Є automata

Є transitions can “fire” without reading the
input
b
start
a
c
Є
40
NFA+Є run example
a
b
c
b
start
a
c
Є
41
NFA+Є run example

Now Є transition can non-deterministically
take place
a
b
c
b
start
a
c
Є
42
NFA+Є run example
a
b
c
b
start
a
c
Є
43
NFA+Є run example
a
b
c
b
start
a
c
Є
44
NFA+Є run example
a
b
c
b
start
a
c
Є
45
NFA+Є run example

Word accepted
a
b
c
b
start
a
c
Є
46
Reg-exp vs. automata

Regular expressions are declarative



Offer compact way to define a regular
language by humans
Don’t offer direct way to check whether a
given word is in the language
Automata are operative


Define an algorithm for deciding whether a
given word is in a regular language
Not a natural notation for humans
47
From reg. exp. to automata


Theorem: there is an algorithm to build an
NFA+Є automaton for any regular
expression
Proof: by induction on the structure of the
regular expression



For each sub-expression R we build an
automaton with exactly one start state and
one accepting state
Start state has no incoming transitions
Accepting state has no outgoing transitions
48
From reg. exp. to automata


Theorem: there is an algorithm to build an
NFA+Є automaton for any regular
expression
Proof: by induction on the structure of the
regular expression
start
49
Base cases
R=
start
R=a
start

a
50
Construction for R1 | R2
R1

start


R2

51
Construction for R1 R2
R1
start

R2


52
Construction for R*
R

start



53
From NFA+Є to DFA



Construction requires O(n) states for a regexp of length n
Running an NFA+Є with n states on string
of length m takes O(m·n2) time
Solution: determinization via subset
construction


Number of states worst-case exponential in n
Running time O(m)
54
Subset construction


For an NFA+Є with states M={s1,…,sk}
Construct a DFA with one state per set of
states of the corresponding NFA


M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Simulate transitions between individual
states for every letter
NFA+Є
s1 a s2
s4
a
s7
DFA
[s1,s4]
a
[s2,s7]
55
Subset construction


For an NFA+Є with states M={s1,…,sk}
Construct a DFA with one state per set of
states of the corresponding NFA


M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Extend macro states by states reachable
via Є transitions
NFA+Є
s1 Є s4
DFA
[s1,s2]
[s1,s2,s4]
56
Scanning challenges


Regular expressions allow us to define the
language of all sequences of tokens
Automata theory provides an algorithm for
checking membership of words



But we are interested in splitting the text not
just deciding on membership
How do we determine lexemes?
How do we handle ambiguities – lexemes
matching more than one token?
57
Separating lexemes



ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
58
Separating lexemes



ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
a-z
start
ID
a-z
1
ONE
59
Maximal munch




ID
= (a+b+…+z) (a+b+…+z)*
ONE
=1
Input: abb1
How do we identify ID(abb), ONE?
Solution: find longest matching lexeme



Keep reading text until automaton leaves
accepting state
Return token corresponding to accepting state
Reset – go back to start state and continue
reading input from there
60
Handling ambiguities




ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
a-z
start
a-z
ID
NFA
i
f
IF
61
Handling ambiguities




ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
a-z
a-z\i
start
i
ID
a-z
a-z\f
ID
f
DFA
IF ID
62
Handling ambiguities





ID = (a+b+…+z) (a+b+…+z)*
IF = if
Input: if
Matches both tokens
What should the scanner output?
Solution: break tie using order of
definitions
a-z\i
ID

Output: ID(if)
start
i
a-z
a-z
a-z\f
ID
f
IF ID
63
Handling ambiguities





IF = if
ID = (a+b+…+z) (a+b+…+z)*
Input: if
Conclusion: list keyword
token definitions
before identifier definition
Matches both tokens
What should the scanner output?
Solution: break tie using order of
a-z
definitions
a-z\i
ID

Output: IF
a-z
start
i
a-z\f
ID
f
IF ID
64
Implementing scanners in
practice
65
Implementing scanners

Manual construction of automata +
determinization is




Very tedious
Error-prone
Non-incremental
Fortunately there are tools that
automatically generate code from a
specification for most languages

C: Lex, Flex
Java: JLex, JFlex
66
Using JFlex



Define tokens (and states)
Run Jflex to generate Java implementation
Usually MyScanner.nextToken() will be
called in a loop by parser
Stream of characters
MyScanner.lex
Regular
Expressions
JFlex
MyScanner.java
Tokens
67
Common format for reg-exps
Basic Patterns
Matching
x
The character x
.
Any character, usually except a new line
[xyz]
Any of the characters x,y,z
Repetition Operators
R?
An R or nothing (=optionally an R)
R*
Zero or more occurrences of R
R+
One or more occurrences of R
Composition Operators
R1R2
An R1 followed by R2
R1|R2
Either an R1 or R2
Grouping
(R)
R itself
68
Escape characters

What is the expression for one or more +
symbols?






(+)+ won’t work
(\+)+ will
backslash \ before an operator turns it to
standard character
\*, \?, \+, …
Newline: \n or \r\n depending on OS
Tab: \t
69
Shorthands

Use names for expressions





letter = a | b | … | z | A | B | … | Z
letter_ = letter | _
digit = 0 | 1 | 2 | … | 9
id = letter_ (letter_ | digit)*
Use hyphen to denote a range


letter = a-z | A-Z
digit = 0-9
70
Catching errors


What if input doesn’t match any token
definition?
Trick: Add a “catch-all” rule that matches
any character and reports an error

Add after all other rules
71
Next lecture: parsing
72