6th Lecture

A
common word processor facility is to
search for a given word in a document.
Generally, the problem is to search for
occurrences of a short string in a long string.
the
Do the first then do the other one
Dr. Maged Wafy
1
The
brute force algorithm:
◦ invented in the dawn of computer history
◦ re-invented many times, still common
Knuth
& Pratt invented a better one in 1970
◦ invented independently by Morris
◦ published 1976 as “Knuth-Morris-Pratt”
Boyer
1976
& Moore found a better one before
◦ found independently by Gosper
Karp
& Rabin found a “better” one in 1980
Dr. Maged Wafy
2

The obvious algorithm is to try the word at each
possible place, and compare all the characters:
for i := 0 to n-m do
for j := 0 to m-1 do
(doc length n)
(word length m)
compare word[j] with doc[i+j]
if not equal, exit the inner loop

The complexity is at worst O(m*n) and best
O(n).
Dr. Maged Wafy
3
Surprisingly,
there is a faster algorithm where
you compare the last characters first:
Do the first then do the other one
the
compare ‘e’ with ‘ ‘, fail so move along 3 places
Do the first then do the other one
the can only move along 2 places
Dr. Maged Wafy
4

In every case where the document character
is not one of the characters in the word, we
can move along m places. Sometimes, it is
less.
Dr. Maged Wafy
5





Let p be the pattern string
Let t be the target string
Let k be the index of the character in the
target string that “lies over” the first
character of the pattern
Given two strings, p and t, over the
alphabet , determine whether p occurs as
the substring of t
That is, determine whether there exists k
such that p=Substring(t,k,|p|).
Dr. Maged Wafy
6
function SimpleStringSearch(string p,t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
for k from 0 to Length(t) – Length(p) do
i <- 0
while i < Length(p) and p[i] = t[k+i] do
i <- i+1
if i == Length(p) then return k
return -1
Dr. Maged Wafy
7
t[0]
A
t[1]
B
p[0]
A
Y
p[1]
B
Y
t[2]
C
t[3]
E
p[2]
C
Y
F
t[4]
t[5]
G
t[6]
A
t[7]
B
t[8]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
8
t[0]
A
t[1]
B
t[2]
C
p[0]
A
t[3]
E
t[4]
F
p[1]
p[2]
B
C
t[5]
G
t[6]
A
t[7]
B
t[8]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
9
t[0]
A
t[1]
B
t[2]
C
t[3]
E
p[0]
A
t[4]
F
p[1]
B
t[5]
G
p[2]
C
t[6]
A
t[7]
B
t[8]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
10
t[0]
A
t[1]
B
C
t[2]
t[3]
E
t[4]
F
p[0]
A
p[1]
B
t[5]
G
p[2]
C
t[6]
A
t[7]
B
t[8]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
11
t[0]
A
t[1]
B
C
t[2]
E
t[3]
t[4]
F
t[5]
G
p[0]
A
t[6]
A
p[1]
B
t[7]
B
p[2]
C
t[8]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
12
t[0]
A
t[1]
B
C
t[2]
E
t[3]
F
t[4]
t[5]
G
t[6]
A
p[0]
A
t[7]
B
p[1]
B
t[8]
C
p[2]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
13
t[0]
A
t[1]
B
C
t[2]
E
t[3]
F
t[4]
t[5]
G
t[6]
A
p[0]
A
t[7]
B
p[1]
B
t[8]
C
p[2]
C
t[9]
D
t10]
E
p[3]
D
N
Dr. Maged Wafy
14
t[0]
A
t[1]
B
C
t[2]
E
t[3]
F
t[4]
t[5]
G
t[6]
t[7]
t[8]
t[9]
A
B
C
D
p[0]
p[1]
p[2]
p[3]
A
B
Y
Dr. Maged Wafy
Y
C
D
Y
Y
t10]
E
15

Worst case:

Okay if patterns are short, but better
algorithms exist
◦ Pattern string always matches completely
except for last character
◦ Example: search for XXXXXXY in target string of
XXXXXXXXXXXXXXXXXXXX
◦ Outer loop executed once for every character in
target string
◦ Inner loop executed once for every character in
pattern
◦ (|p| * |t|)
Dr. Maged Wafy
16


(|p| * |t|)
Key idea:
◦ if pattern fails to match, slide pattern to right by as
many boxes as possible without permitting a match
to go unnoticed
Dr. Maged Wafy
17
t[0]
X
t[1]
Y
p[0]
t[2]
X
p[1]
X
Y
Y
Y
t[3]
Y
p[2]
X
Y
X
p[3]
Y
Y
X
t[4]
Y
t[6]
Y
c
Y
Z
t[7]
t[8]
t[9]
t10]
p[4]
Z
Y
Y
t[5]
N
X
Y
Y
?
Dr. Maged Wafy
18




Correct motion of pattern depends on both
location of mismatch and the mismatching
character
If c == X : move 2 boxes to right
If c == E : move 5 boxes to right
If c == Z : target found; alg terminates
Dr. Maged Wafy
19

Goal: determine d, number of boxes to right
pattern should move; smallest d such that:





p[0] = t[k+d]
p[1] = t[k+d+1]
p[2] = t[k+d+2]
…
p[i-d] = t[k+i]
Dr. Maged Wafy
20


Note: can be stated largely in terms of
pattern alone.
Value of d depends only on:
◦ The pattern
◦ The value of i
◦ The mismatching character c (at t[k+i])
Dr. Maged Wafy
21

Can define a function KMPskip(p,i,c) to give
correct d
◦ Return smallest integer d such that 0 <= d <=I,
such that p[i-d] == c and p[j] == p[j+d] for
each 0 <=j <= i-di1
◦ Return i+1 if no such d exists


Calculate all values of KMPskip for pattern p
and store it in KMPskiparray
do lookup at each mismatch
Dr. Maged Wafy
22

For pattern ABCD:
A
B
C
D
0
1
2
3
B 1
0
3
4
C 1
2
0
4
D 1
2
3
0
1
2
3
4
A
other
Dr. Maged Wafy
23

For pattern XYXYZ:
X
X
Y
Z
other
Y
X
Y
Z
0
1
0
3
2
1
0
3
0
5
1
2
3
4
0
1
2
3
4
5
Dr. Maged Wafy
24
Function KMPSearch(string p, t): integer
{Find p in t; return its location or -1 if p is not a substring of t}
KMPskiparray <- ComputeKMPskiparray(p)
k <- 0
i <- 0
While k < Length(t) – Length(p) do
if i == Length(p) then return k
d <- KMPskiparray[I,t[k+i]]
k <- k + d
i <- I + 1 –d
Return -1
Dr. Maged Wafy
25

Coming soon ….
Dr. Maged Wafy
26
To
work out how far to skip when the last
character does not match, build a table.
Care is needed with repeated letters:
word cab
skip
1 * 2 3 3 ...
a b c d e ...
word abba
skip[c]
end
skip
* 1 4 4 4 ...
a b c d e ...
= distance of last occurrence of c from
Dr. Maged Wafy
27
The
algorithm becomes:
i := 0
while i <= n-m do
if word[m-1] = doc[i+m-1] then
for j := 0 to m-1 do
compare word[j] with doc[i+j]
This
i := i + 1
else i := i + skip[doc[i+m-1]]
is still O(n*m) in the worst case, but now it
is O(n/m) in the best case, because m
characters may be skipped at each stage.
Dr. Maged Wafy
28
The
last-character algorithm can be generalised
by making the skip table work for partial
matches, and by adding a secondary table. The
result is the Boyer-Moore algorithm.
It is possible to show that the complexity of the
Boyer-Moore algorithm is guaranteed to be only
O(n) in the worst case, as well as O(n/m) in the
best case.
It has generally been regarded as too difficult to
understand, and so has not been used much.
Dr. Maged Wafy
29
Karp
& Rabin found an algorithm which is:
◦ almost as fast as Boyer-Moore
◦ simple enough to understand easily
◦ can be adapted for 2-dimensional searches
for patterns in pictures
Go
back to the brute force idea, but now use
a single number to represent the word you
are searching for, and a single number for the
current portion of the document you are
comparing against.
Dr. Maged Wafy
30


Suppose we are searching for 4-letter words. Then
the whole (English) word fits in one (computer)
word w of 4 bytes. If the current 4 bytes of the
document are also in one word d, a single
comparison can match the two in one step. To
move along the document, shift d and add in the
next character.
For longer words, use hashing. The characters of
the word and the document are combined into
single hash numbers wh and dh. The hash number
dh can be updated by doing a suitable sum and
adding in the code for the next character.
Dr. Maged Wafy
31