Language Learning Week 10 Pieter Adriaans: Sophia Katrenko:

Language Learning Week 10
Pieter Adriaans: [email protected]
Sophia Katrenko: [email protected]
Contents Week 10
•
•
•
•
Shallow languages
Complexity of sets
Kolmogorov’s structure function
Randomness deficiency
The universal distribution: m
• The coding theorem (Levin):
-log m(x) = -log PU(x)+O(1)=K(x)+O(1)
• A distribution is simple if it is dominated by a
recursively enumerable distribution
• Li & Vitanyi: A concept class C is learnable
under m(x) iff C is also learnable under any
arbitrary simple distribution P(x) provided the
samples are taken according to m(x).
Problem: Finite recursive
grammar with infinite #
sentences
• How do we know we have
‘enough’examples?
• The notion of a characteristic sample
Informally:
A sample S of a language G is
characteristic if we can reconstruct G from
S
How do can we draw a
characteristic sample under m
• Solution: notion of shallowness
Informally: A language is shallow if we only
need short examples to learn it
• A language S=L(G) is shallow is there
exists a characteristic sample CG for
S such that
 s  CG (|s|)  c log K(G)
Simple and Shallow for
dummies
• Solution: notion of shallowness
Informally: A language is shallow if we only need short examples to learn
it
• Shallow structures are constructed from an exponential number of
small buidingblocks
• Simple distributions are typical for those sets that are generated by a
computational process
• The Universal Distribution m is a non-computable distribution that
dominates all simple distributions multiplicatively
• Objects with low Kolmogorov complexity have high probability
• Li & Vitanyi: A concept class C is learnable under m(x) iff C is also
learnable under any arbitrary simple distribution P(x) provided the
samples are taken according to m(x).
Characteristic sample
Let  be an
alphabet,
 the set of
all strings
over 
L(G) =S  
is the
language
generated by a
grammar G
CG  S is a
characteristic
sample for G

S
CG
Shallowness
• Seems to be an independent category
• There are finite languages that are not
shallow
01
11
101
100
001
110
1110
0001
1011
000000000001100101011000010010010100101010
type
0
Context
sensitive
Context-free
C
regular
Shallow
Shallowness is very restrictive
rules
< log
|G|
Contextfree
Grammar
G
<|G|
In terms of natural growth:
If one expands the longest rule by
1 bit one
must double the number of rules
Complexity of sets
Enumeration 1
• Example program p1:
For x is 1 to i print x
• Example program p2:
For x is 1 to i print x!
• Example program p3 (G is a context free
grammar)
For x is 1 to i print Gi
The conditional complexity of an
element of a set
• Given a program p that enumerates
elements of a set S the conditional
complexity of element x of S is:
• K(x|S) < log i + c
(where i is the index of x in S)
• Example:
K(479001600| p-factorial) < log 12 + c
Complexity of sets
Enumeration 3
• Given a program p that enumerates
elements of a set S the conditional
complexity of an initial finite segment of k
elements of S is:
• K(S<k|S) < log k + c
Complexity of sets
Enumeration 2
• Given a program p that enumerates
elements of an (in)finite set S the
complexity of S is:
• K(S) < K(p) + c
Complexity of sets
Selection 1
1
2
3
4
5
6
7
8
9
10
0
1
1
0
0
1
0
0
1
0
• The vector
v = 0110010010
is a characteristic vector
• The set {2,3,6,9} is
described by the following
program:
• For i is 1 to 10
if vi = 1 print i
Estimate of K(v)
n
n!
  
 k  k!(n  k )!
| v |
K (v)  log    c
k 
• K(v) is dependent on the density of v!
Complexity of sets
Selection 2
• Given a program p that enumerates elements
of a set S the conditional complexity of an finite
subset S’ of size k is:
| v |
K ( S ' | S )  K (v)  c  log    c
k 
• Where v is the characteristic vector of S’
Basic Intuition
• For a good model M of some string s the
complexity K(s) of s will be close to the
length of a random index of M
• In this case s is a typical element of M
and the randomness deficiency is low
• Model optimization = minimization of
randomness deficiency
Complexity of the Data given a
Model
• Suppose |M|=m and |D|=d
Regular Languages:
Deterministic Finite Automata
(DFA)
a
0
b
1
a
b
b
b
aa
abab
abaaaabbb
a
2
3
a
{w {a,b}* | # aW and # aW both even}
DFA = NFA (Non-deterministic) = REG
Learning DFA using only
positive examples with MDL
S+ = (c, cab, cabab, cababab, cababababab }
a
L1
a
L2
0
c
Outside world
# states
Coding in bits:
|L1|  5 log2 (3+1) 2 log2 (1+1) = 20
|L2|  5 log2 (3+1) 2 log2 (1+3) = 40
# letters
Empty letter
0
1
2
b
b
# arrows
c
Randomness Deficiency
Minimal Randomness
Deficiency
Kolmogorov Structure function
Two-Part Code
Constrained MDL estimator
• Where the MDL code (M) =
Two-part Code Optimization
• Let x be a dataset and S a set (xS) then:
K ( x)  K ( S )  log | S | O(1)
• Kolmogorov structure function: Two part code
optimization always results in a model of best fit,
irrespective of whether the source producing the data
is in the model class considered. (Vereshchagin &
Vitanyi 2004)
hx ( )  min {log | S |: x  S , K ( S )   }
S
A useful approximation
log k!
lim
k  k
1
log
x
dx

1
n
log     log x dx   log x dx
 k  nk
1
n
k
Regular Languages:
Deterministic Finite Automata
(DFA)
a
0
b
1
a
b
b
b
aa
abab
abaaabbb
a
2
3
a
{w {a,b}* | # aW and # aW both even}
DFA = NFA (Non-deterministic) = REG
Contents Week 10
•
•
•
•
Shallow languages
Complexity of sets
Kolmogorov’s structure function
Randomness deficiency