Language Learning Week 10 Pieter Adriaans: [email protected] Sophia Katrenko: [email protected] Contents Week 10 • • • • Shallow languages Complexity of sets Kolmogorov’s structure function Randomness deficiency The universal distribution: m • The coding theorem (Levin): -log m(x) = -log PU(x)+O(1)=K(x)+O(1) • A distribution is simple if it is dominated by a recursively enumerable distribution • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x). Problem: Finite recursive grammar with infinite # sentences • How do we know we have ‘enough’examples? • The notion of a characteristic sample Informally: A sample S of a language G is characteristic if we can reconstruct G from S How do can we draw a characteristic sample under m • Solution: notion of shallowness Informally: A language is shallow if we only need short examples to learn it • A language S=L(G) is shallow is there exists a characteristic sample CG for S such that s CG (|s|) c log K(G) Simple and Shallow for dummies • Solution: notion of shallowness Informally: A language is shallow if we only need short examples to learn it • Shallow structures are constructed from an exponential number of small buidingblocks • Simple distributions are typical for those sets that are generated by a computational process • The Universal Distribution m is a non-computable distribution that dominates all simple distributions multiplicatively • Objects with low Kolmogorov complexity have high probability • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x). Characteristic sample Let be an alphabet, the set of all strings over L(G) =S is the language generated by a grammar G CG S is a characteristic sample for G S CG Shallowness • Seems to be an independent category • There are finite languages that are not shallow 01 11 101 100 001 110 1110 0001 1011 000000000001100101011000010010010100101010 type 0 Context sensitive Context-free C regular Shallow Shallowness is very restrictive rules < log |G| Contextfree Grammar G <|G| In terms of natural growth: If one expands the longest rule by 1 bit one must double the number of rules Complexity of sets Enumeration 1 • Example program p1: For x is 1 to i print x • Example program p2: For x is 1 to i print x! • Example program p3 (G is a context free grammar) For x is 1 to i print Gi The conditional complexity of an element of a set • Given a program p that enumerates elements of a set S the conditional complexity of element x of S is: • K(x|S) < log i + c (where i is the index of x in S) • Example: K(479001600| p-factorial) < log 12 + c Complexity of sets Enumeration 3 • Given a program p that enumerates elements of a set S the conditional complexity of an initial finite segment of k elements of S is: • K(S<k|S) < log k + c Complexity of sets Enumeration 2 • Given a program p that enumerates elements of an (in)finite set S the complexity of S is: • K(S) < K(p) + c Complexity of sets Selection 1 1 2 3 4 5 6 7 8 9 10 0 1 1 0 0 1 0 0 1 0 • The vector v = 0110010010 is a characteristic vector • The set {2,3,6,9} is described by the following program: • For i is 1 to 10 if vi = 1 print i Estimate of K(v) n n! k k!(n k )! | v | K (v) log c k • K(v) is dependent on the density of v! Complexity of sets Selection 2 • Given a program p that enumerates elements of a set S the conditional complexity of an finite subset S’ of size k is: | v | K ( S ' | S ) K (v) c log c k • Where v is the characteristic vector of S’ Basic Intuition • For a good model M of some string s the complexity K(s) of s will be close to the length of a random index of M • In this case s is a typical element of M and the randomness deficiency is low • Model optimization = minimization of randomness deficiency Complexity of the Data given a Model • Suppose |M|=m and |D|=d Regular Languages: Deterministic Finite Automata (DFA) a 0 b 1 a b b b aa abab abaaaabbb a 2 3 a {w {a,b}* | # aW and # aW both even} DFA = NFA (Non-deterministic) = REG Learning DFA using only positive examples with MDL S+ = (c, cab, cabab, cababab, cababababab } a L1 a L2 0 c Outside world # states Coding in bits: |L1| 5 log2 (3+1) 2 log2 (1+1) = 20 |L2| 5 log2 (3+1) 2 log2 (1+3) = 40 # letters Empty letter 0 1 2 b b # arrows c Randomness Deficiency Minimal Randomness Deficiency Kolmogorov Structure function Two-Part Code Constrained MDL estimator • Where the MDL code (M) = Two-part Code Optimization • Let x be a dataset and S a set (xS) then: K ( x) K ( S ) log | S | O(1) • Kolmogorov structure function: Two part code optimization always results in a model of best fit, irrespective of whether the source producing the data is in the model class considered. (Vereshchagin & Vitanyi 2004) hx ( ) min {log | S |: x S , K ( S ) } S A useful approximation log k! lim k k 1 log x dx 1 n log log x dx log x dx k nk 1 n k Regular Languages: Deterministic Finite Automata (DFA) a 0 b 1 a b b b aa abab abaaabbb a 2 3 a {w {a,b}* | # aW and # aW both even} DFA = NFA (Non-deterministic) = REG Contents Week 10 • • • • Shallow languages Complexity of sets Kolmogorov’s structure function Randomness deficiency
© Copyright 2024