Language Learning Week 10 Pieter Adriaans: [email protected] Sophia Katrenko: [email protected] Contents Week 10 • • • • Shallow languages Complexity of sets Kolmogorov’s structure function Randomness deficiency The universal distribution: m • The coding theorem (Levin): -log m(x) = -log PU(x)+O(1)=K(x)+O(1) • A distribution is simple if it is dominated by a recursively enumerable distribution • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x). Problem: Finite recursive grammar with infinite # sentences • How do we know we have ‘enough’examples? • The notion of a characteristic sample Informally: A sample S of a language G is characteristic if we can reconstruct G from S How do can we draw a characteristic sample under m • Solution: notion of shallowness Informally: A language is shallow if we only need short examples to learn it • A language S=L(G) is shallow is there exists a characteristic sample CG for S such that  s  CG (|s|)  c log K(G) Simple and Shallow for dummies • Solution: notion of shallowness Informally: A language is shallow if we only need short examples to learn it • Shallow structures are constructed from an exponential number of small buidingblocks • Simple distributions are typical for those sets that are generated by a computational process • The Universal Distribution m is a non-computable distribution that dominates all simple distributions multiplicatively • Objects with low Kolmogorov complexity have high probability • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x). Characteristic sample Let  be an alphabet,  the set of all strings over  L(G) =S   is the language generated by a grammar G CG  S is a characteristic sample for G  S CG Shallowness • Seems to be an independent category • There are finite languages that are not shallow 01 11 101 100 001 110 1110 0001 1011 000000000001100101011000010010010100101010 type 0 Context sensitive Context-free C regular Shallow Shallowness is very restrictive rules < log |G| Contextfree Grammar G <|G| In terms of natural growth: If one expands the longest rule by 1 bit one must double the number of rules Complexity of sets Enumeration 1 • Example program p1: For x is 1 to i print x • Example program p2: For x is 1 to i print x! • Example program p3 (G is a context free grammar) For x is 1 to i print Gi The conditional complexity of an element of a set • Given a program p that enumerates elements of a set S the conditional complexity of element x of S is: • K(x|S) < log i + c (where i is the index of x in S) • Example: K(479001600| p-factorial) < log 12 + c Complexity of sets Enumeration 3 • Given a program p that enumerates elements of a set S the conditional complexity of an initial finite segment of k elements of S is: • K(S<k|S) < log k + c Complexity of sets Enumeration 2 • Given a program p that enumerates elements of an (in)finite set S the complexity of S is: • K(S) < K(p) + c Complexity of sets Selection 1 1 2 3 4 5 6 7 8 9 10 0 1 1 0 0 1 0 0 1 0 • The vector v = 0110010010 is a characteristic vector • The set {2,3,6,9} is described by the following program: • For i is 1 to 10 if vi = 1 print i Estimate of K(v) n n!     k  k!(n  k )! | v | K (v)  log    c k  • K(v) is dependent on the density of v! Complexity of sets Selection 2 • Given a program p that enumerates elements of a set S the conditional complexity of an finite subset S’ of size k is: | v | K ( S ' | S )  K (v)  c  log    c k  • Where v is the characteristic vector of S’ Basic Intuition • For a good model M of some string s the complexity K(s) of s will be close to the length of a random index of M • In this case s is a typical element of M and the randomness deficiency is low • Model optimization = minimization of randomness deficiency Complexity of the Data given a Model • Suppose |M|=m and |D|=d Regular Languages: Deterministic Finite Automata (DFA) a 0 b 1 a b b b aa abab abaaaabbb a 2 3 a {w {a,b}* | # aW and # aW both even} DFA = NFA (Non-deterministic) = REG Learning DFA using only positive examples with MDL S+ = (c, cab, cabab, cababab, cababababab } a L1 a L2 0 c Outside world # states Coding in bits: |L1|  5 log2 (3+1) 2 log2 (1+1) = 20 |L2|  5 log2 (3+1) 2 log2 (1+3) = 40 # letters Empty letter 0 1 2 b b # arrows c Randomness Deficiency Minimal Randomness Deficiency Kolmogorov Structure function Two-Part Code Constrained MDL estimator • Where the MDL code (M) = Two-part Code Optimization • Let x be a dataset and S a set (xS) then: K ( x)  K ( S )  log | S | O(1) • Kolmogorov structure function: Two part code optimization always results in a model of best fit, irrespective of whether the source producing the data is in the model class considered. (Vereshchagin & Vitanyi 2004) hx ( )  min {log | S |: x  S , K ( S )   } S A useful approximation log k! lim k  k 1 log x dx  1 n log     log x dx   log x dx  k  nk 1 n k Regular Languages: Deterministic Finite Automata (DFA) a 0 b 1 a b b b aa abab abaaabbb a 2 3 a {w {a,b}* | # aW and # aW both even} DFA = NFA (Non-deterministic) = REG Contents Week 10 • • • • Shallow languages Complexity of sets Kolmogorov’s structure function Randomness deficiency