talk - University of Warwick

Handwriting,signatures, and convolutions
Ben Graham
University of Warwick
Department of Statistics and Centre for Complexity Science
November 2014
Machine Learning
I
Machine learning: building systems that can learn from data
I
Classication problems: i.e. handwriting recognition
I
In general it is very dicult
I
Over the last
n
years there have been huge improvements
I Learning Algorithms
I
I
I
SVMs
Back propagation for non-convex optimization of ANNs
Convolutional NNs
I Computing power / GPUs
I
Challenging datasets
I MNIST
I Google Street View House Numbers
I CIFAR-10, CIFAR-100 and ImageNet
I CASIA-OLHWD1.0
Machine learning (Paul Handel, 1931)
MNIST (Oine)
I
I
I
I
60,000 training images, 10,000 test images. Greyscale 28x28x1
784 features: ANN with dropout 99% (Hinton)
28x28 grid: CNN with dropconnect 99.8% (LeCun)
28x28 grid: Need ∼3000 training samples for 99% accuracy
Online handwriting recognition
I
(For simplicity) Consider an isolated character
I
Each character is made up of a number of strokes
I
Each stroke is a list of (x,y) coordinates
I
Hard if
I Small #training samples/character class
I Lots of character classes
I Lots of variation between writers.
I
Simple strategies
I trace the strokes to produce an oine bitmap
I
I
take advantage of oine classiers
morally wrong
I produce a low resolution array of pen direction histograms
I
need to iron out the characters to approach invariance
n = 10:
Pendigits
n = 183:
Assamese (UCI dataset)
n = 183:
Assamese (UCI dataset)
n ≈ 7000:
Chinese (CASIA-OLHWDB datasets)
Y (0) = 0
dY (t ) = f (Y (t ))dX (t )
Y0 (t0 )
Y1 (t1 )
Y2 (t2 )
Y3 (t3 )
=
Z t3
0
f
=
=
Z t1
0
Z t2 Z t1
0
f
0
Z t2 Z t1
0
f
=0
0
Picard Iteration
f
(0)dX (t0 )
f
(0)dX (t0 ) dX (t1 )
!
f (0)dX (t0 ) dX (t1 ) dX (t2 )
...
Tensors
Tensor product
(xi )ai=1 ⊗ (yj )bj=1 = (xi yj )ai =,b1, j =1
Ra ⊗ Rb ≡ Rab
Example: Probability distributions of indepedent random variables
X
∈ { 1, . . . , a }
L(X )
∈ Ra
L(X , Y )
Y
∈ {1, . . . , b}
L(Y )
∈ Rb
= L(X ) ⊗ L(Y ) ∈ Rab
Iterated integrals
Driving path X : [0, 1] → Rd
n -th iterated integral
n =
Z
X0,1
0<u1 <···<u <1
1 dX (u1 ) ⊗ . . . ⊗ dX (un ) ∈ Rd
n
Path signature
1
2
s t = (1, Xs ,t , Xs ,t , . . . )
linear,(f (n) ) operator depending on f
S (X ) ,
Picard iteration:
f
Y (1)
hard
∞
=
easy
∑ f (n) X0n,1
n =0
Computationally: Path←−− −−→ Signature
n
Sound and rough paths
0.5
0
−0.5
0
1
2
3
4
5
6
4
x 10
Frequency
8000
6000
4000
2000
0
0.5
1
1.5
2
2.5
3
200
250
300
Time
20
15
10
5
50
I
I
100
150
Lyons and Sidorova: Sound compression via the signature.
Limiting factor: signature →path
Papavasiliou: Sound recognition from the signaturetime lag.
Calculating signatures
I
If Xs ,t is a straight line from 0 to x then
x ⊗x x ⊗x ⊗x
S (X )s ,t =
1, x ,
,
,...
2!
3!
I
Chen's identity: If s , t , u ∈ R then
n
su=
X ,
I
n
∑ Xsk,t ⊗ Xtn,u−k
k =0
n
= 0, 1, 2, . . .
Higher order iterated integrals are Hölder smoother
Log signatures
I
Tensor log
I
Free Lie algebra: bracket [·, ·] : g × g → g
I
I
I
I
log(1 + x ) = − ∑ (−x )n /n
n≥1
[ax + by , z ] = a[x , z ] + b [y , z ], [z , ax + by ] = a[z , x ] + b [z , y ]
[x , x ] = 0
[x , [y , z ]] + [z , [x , y ]] + [y , [z , x ]] = 0
Dimensionality reduction: Hall basis
1, 2, [1, 2], [1, [1, 2]], [2, [1, 2]], [1, [1, [1, 2]]], [2, [1, [1, 2]]], [2, [2, [1, 2]]]
BakerCampbellHausdor formula for ecient computation?
(∑(ri + si ))−1 r1 s1
(−1) n+1
[X Y . . . X r
log(e X e Y ) = ∑
∑
n
r1 !s1 ! . . . rn !sn !
n>0
r +s >0,1≤i ≤n
1
1
= X + Y + [X , Y ] + ([X , [X , Y ]] + [Y , [Y , X ]])
2
12
I
n
i
−
i
1
1
[Y , [X , [X , Y ]]] −
...
24
720
Y
s ]
n
Uniqueness
s t characterizes Xs ,t more or less
S (X ) ,
Theorem
α, β : [s , t ] → Rd
S (α)s ,t = S (β )s ,t if
(Ben Hambly, Terry Lyons 2010) Let
be two
paths of bounded variation. Then
and only if
α ∗ β −1
is tree like. Given the signature, there is a unique path of
bounded varation with minimal length.
Inverting Signatures
Inverting signatures is hard:
Theorem
(Terry Lyons, Weijun Xu, 2014): Using symmetrization, you can
recover any C
1 path from its signature
I Only uses the
2n
n-th order iterated integrals
Intuition
I
Consider an increasing 2d path (x (t ), y (t ))t ∈[0,1] ,
˙ (t ), y˙ (t ) > 0
x
I
Consider a Poisson process producing letters
and y at rate y (t ) → W = xxyxyx . . . xy
I
P(W = w | |w | = n) ∝ Xs ,t (w )
x
at rate x˙ (t )
Rotational Invarients of the signature
Theorem
(Joshca Diehl 2013) Rotational invariants for paths in
1
1
C (xx ) + C (xy )
2
2
1
1
C (xy ) − C (yx )
2
2
R2
displacement squared
enclosed
area
Plus 3 of order 4.
Plus 7 of order 6.
Can be used for rotation invariant character recognition!
Characters as paths
Pretend the pen never left the writing surface.
Normalize to get X : [0, 1] → [0, 1]2 .
The signature truncated at level m, S (X )m
0,1 , has dimension
2 + 22 + · · · + 2m = 2m+1 − 2
1D: Consider a sliding window of truncated signature
{S (X )m
(i −k )/n,i /n : i = k , k + 1, ..., n}
2D: Calculate the sliding windows for each stroke.
Draw them in a square grid.
Signature of characters
I
Random forest
I
A classier composed of many random decision trees
I
Trees constructed iid and form a democracy
I Each tree sees a random subset of the data.
I Tree branches iteratively use individual features
I From a random subset of features, the most informative
feature is used to split the dataset roughly 50:50.
1000 trees. Error %:
m
1
2
3
4
5
6
7
8
9
#Features
2
6
14 30 62 126 254 510 1022
Pendigits 47.5 18.7 7.5 6.0 4.4 3.7 3.4 3.1 3.2
Assamese 93.4 85.2 70.2 64.1 56.9 53.1 49.5 47.6 46.7
An ink dimension
I
Signatures are invariant w.r.t. tree-like excursions.
I
Useful to know if the pen is on the paper?
I
Solution: add a third dimension = ink used.
1000 trees. Error %:
m
1
2
#Features
3
12
Pendigits 39.7 11.1
Assamese 89.8 69.9
3
39
6.1
56.8
4
120
3.7
48.1
5
363
2.5
42.3
ANN+translations:
m
1
#Features
3
Assamese 87.1
3
39
39.9
4
120
28.8
5
363
21.9
2
12
64.7
6
1092
1.7
37.2
7
3279
1.5
34.2
Articial neural networks
Directed weighted graph
For each node:
output=σ (b + ∑i w (i )input(i )
For classication, the nal layer is weighted
to give a probability distribution.
input∈ Ra
hidden1 = σ (input · W1 + B1 ) ∈ Rb
hidden2 = σ (hidden1 · W2 + B2 ) ∈ Rc
hidden3 = σ (hidden2 · W3 + B3 ) ∈ Rd
output= softmax(hidden3 · W4 + B4 ) ∈ Re
1
-4
0
#Parameters (a + 1) × b + (b + 1) × c + (c + 1) × d + (d + 1) × e
4
Boolean function
If we normalize all the features, it looks a bit like we are dealing
with Boolean functions, i.e.
f
: {0, 1}n → {0, 1}
Building block Boolean functions
x
0
1
NOT
1
0
x
x
y
0
0
1
1
0
1
0
1
x
AND
0
0
0
1
y
x
OR
0
1
1
1
y
AND b = σ (20a + 20b=30)
OR b = σ (20a + 20b=10)
NOT a = σ (−20a + 10)
a
a
Shallow networks bad, deep networks good
I
MNIST: 1-layer NN, 12.0% (LeCun)
I
The XOR function cannot be represented by a 1-layer NN
I
Almost all Boolean functions have exponential circuit
complexity (Shannon)
I
Some functions can be represented far more eciently by
DEEP Boolean formulae.
(x1 , ..., xn ) := x1 + +xn mod 2
O(log n) →size O(n)
Fixed depth → size exponential in n
I Parity
I Depth
I
I
(Håstad)
In the brain there are ∼20 layers of neurons between seeing
and recognizing.
What do the layers do?
ANNs training roughly does the following:
I Find features that will be useful, i.e. the curve at the top of a
2 or 3.
I Calculating how the features correspond to the dierent
classes.
H1 features weakly correlated with being a 2 or a 3.
H2 look at how many hidden layer 1 features seem to
indicate 2-ness.
H3 look at how many hidden layer 2 features seem to
indicate 2-ness.
output weigh the evidence.
I
I
I
I
Learning is actually top down
Start with random weights
Forward propagate input→output
Errors at the top are back-propagated down (chain rule)
What do the layers do?
Ranzato, Boureau, LeCun Sparse features
Convolutional Neural Networks
I
LeCun, Bottou, Bengio, Haner
1998
I
Spatial pooling using "Max-Pool"
I
Shared weights within each layer
I
Easy to train
I
Spatial invariances encoded
Representing pen strokes for CNNs
Motivation:
I
The 8 × 8 × 8 grid for Chinese character recognition
I
Convolutional networks
I
Rough path theory
Algorithm:
I
I
I
Normalized2 -strokesXi : [0, l (i )] → [0, 1]2 .
Initialize a(2m+1 − 1) × k × k array to all zeros.
CalculateS (Xi )m
t −ε,t +ε , truncated at level
I
Put this information into the array.
I
Put the array into a CNN.
m.
Sparsity: most of the columns of the array are zero.
Cost of evaluating the CNN greatly reduced.
CASIA-OLHWDB1.1
I
3755 Chinese characters
I
240 training samples
I
60 test samples
Results
I
8x 8x 8 method and MQDF classier: 7.61%
I
CNN 5.61% (Ciresan et al)
I
Sparse Signature CNN 3.59% (G.)
Convolutional architectures
I
Input 28x28x1
I
20 5x5 convolutional lters: 24x24x20
I
2x2 pooling: 12x12x20
I
50 5x5 convolutional lters: 8x8x50
I
2x2 pooling; 4x4x50≡800
I
Fully connected layer: 500
I
Output: #classes
input-20C5-MP2-50C5-MP2-500N-output
Ciresan, Meier and Schmidhuber
I
Input 48x48x1
I
100 3x3 convolutional lters: 46x46x100
I
2x2 pooling: 23x23x100
I
200 2x2 convolutional lters: 22x22x200
I
2x2 pooling; 11x11x300
I
300 2x2 convolutional lters: 10x10x300
I
2x2 pooling; 5x5x300
I
400 2x2 convolutional lters: 4x4x400
I
2x2 pooling; 2x2x400≡1600
I
Fully connected layer 500
I
Output: #classes
input-100C3-MP2-200C2-MP2-300C3-MP2-400C2-MP2-500Noutput
DeepCNets(l,k)
I
input-
I
(k)C3-MP2-
I
(2k)C2-MP2-
I
...
I
(lk)C2-MP2-
I
(l+1)k N-
I
output
Sparse DeepCNets
I
The convolutional and pooling operations can be memoized
I
Computation bottleneck becomes the top of the network
I
Normally the other way round!
I
GPU
I 3000 MNIST digits/second
I 200 Chinese characters/second
CIFAR-10
I
50,000 training images, 10,000 test images. Color 32x32x3
I
Kaggle competition
I Top entry is 92.61% accuracy
I Can you do better? (4 months to go)
Frogs and Horses
Frogs and Horses
Frogs and Horses
CIFAR-100
aquatic mammals
sh
owers
food containers
fruit and vegetables
household electrical devices
household furniture
insects
large carnivores
large man-made outdoor things
large natural outdoor scenes
large omnivores and herbivores
medium-sized mammals
non-insect invertebrates
people
reptiles
small mammals
trees
vehicles 1
vehicles 2
I
I
I
beaver, dolphin, otter, seal, whale
aquarium sh, atsh, ray, shark, trout
orchids, poppies, roses, sunowers, tulips
bottles, bowls, cans, cups, plates
apples, mushrooms, oranges, pears, sweet peppers
clock, computer keyboard, lamp, telephone, television
bed, chair, couch, table, wardrobe
bee, beetle, buttery, caterpillar, cockroach
bear, leopard, lion, tiger, wolf
bridge, castle, house, road, skyscraper
cloud, forest, mountain, plain, sea
camel, cattle, chimpanzee, elephant, kangaroo
fox, porcupine, possum, raccoon, skunk
crab, lobster, snail, spider, worm
baby, boy, girl, man, woman
crocodile, dinosaur, lizard, snake, turtle
hamster, mouse, rabbit, shrew, squirrel
maple, oak, palm, pine, willow
bicycle, bus, motorcycle, pickup truck, train
lawn-mower, rocket, streetcar, tank, tractor
100 catagories (on the right)
50,000 training images (i.e. 500/class), 10,000 test images,
32x32,3
25M parameters, 70.2% accuracy
CIFAR 100
CIFAR 100
Cifar 100
ImageNet 2010
Example Neural Networks
I
Theano CNN - DeepCNet(4,20)
I
ann.py
I
OverFeat
CUDANVIDIA C extension
CUDA-Matrix Multiplication
Tricks
I
Minibatches
I When calculating gradients, don't use the while dataset.
I Use a subset of size
∼100
I Much quicker
I The noise in the gradients stop you getting stuck
I
Rectied Linear units
I Positive part function
I
P(d ReLu(x )/dx = 1) ≈ 1/2
Tricks
I
Dropout
I Deleting many of the hidden nodes during training forces the
network to be more robust
I Delete half of the hidden units during training (and maybe
some of the input).
I Halve W1,W2, . . .
during testing to balance things out
I Back propagation is adjusted accordingly.
I Robust natural process that deletes 50% of the available
data??
I
Nestorov's Accellerated Gradient
I A momentum method, similar to
vt +1 = µ vt − ε∇f (θt )
θt +1 = θt + vt +1
but looking ahead:
vt +1 = µ vt − ε∇f (θt + µ vt )
θt +1 = θt + vt +1