The A2iA Multi-lingual Text Recognition System at the second

The A2iA Multi-lingual Text Recognition
System at the second Maurdor Evaluation
Bastien Moysset, Th´eodore Bluche, Maxime Knibbe
Mohamed Faouzi Benzeghiba, Ronaldo Messina, J´erˆome Louradour
Christopher Kermorvant
www.a2ialab.com
www.a2ialab.com
1/17
The Maurdor database
8,774 pages fully annotated : layout, images, graphics, languages, textual
content, logical order.
www.a2ialab.com
2/17
The Maurdor challenge
Text recognition given the position, type (printed/handwritten) and
language
HW-AR!
HW-EN!13 146
11 745
HW-FR!
24 515
PRN-FR!
79 248
PRN-AR!
28 595
PRN-EN!
35 028
Text zone statistics per language
and type
www.a2ialab.com
3/17
The A2iA system for this challenge
Same system for Printed/Handwritten, French/English/Arabic :
Test
Images
Image
pre-processing
Text line
segmentation
Text
detection
Train2+Dev2
text
Language
models
IRSTLM
Text line
hypothesis
decoding
graph
Lexicon
Train
Images
Optical
Model
RNN training
on text line
images
RNN
character
recognition
hypothesis
character
lattice
Language Model
decoding with
language model
KALDI
Paragraph
transcription
Best
transcription
selection
Postprocessing
www.a2ialab.com
4/17
Decision
PRN AR
90
58
23
HW FR
110
72
22
HW EN
137
59
35
125
58
30
The A2iA system for this challenge
HW AR
Performance improvement
140
Word Error Rate (%)
120
12/2012
04/2013
12/2013
100
80
60
40
20
0
PRN FR
PRN EN
PRN AR
HW FR
HW EN
Error rate divided by 3 to 5
www.a2ialab.com
5/17
HW AR
Keys of success
Consider several line
detection hypothesis
Text
detection
Optical
Model
Model word and
character sequences
Language Model
Train using curriculum
www.a2ialab.com
6/17
Text line detection
Difficult to find the text line, even if the text bloc is given
www.a2ialab.com
7/17
Text line detection
Difficult to find the text line, even if the text bloc is given
Generate hypothesis :
use 2 different line detectors
try with and without deskew
try line compression and
stretching
normalize or not the line height
re-order the text line
try not to detect the line
Recognize all the hypothesis, keep the best
www.a2ialab.com
7/17
HW-EN
43
35
HW-AR
33
30
Text line detection
Results
without alternatives
with alternatives
50
43
Word Error Rate (%)
40
35
30
20
0
26
23
23
20
10
11
PRN-FR
33
30
28
22
13
PRN-EN
PRN-AR
HW-FR
HW-EN
HW-AR
up to 50% word error rate reduction
www.a2ialab.com
8/17
Optical model : LSTM neural network
The RNN predicts hypothesis of characters :
Latin letters with accent, upper and lower cases
120 shapes of Arabic forms
punctuation, inter-word space and not a character
www.a2ialab.com
9/17
Optical model : LSTM neural network
Training
Curriculum training : start training with with easy samples and
gradually increase the difficulty
simple database at word level (Rimes, IAM, OpenHaRT)
simple database at line level
difficult database at line level (Maurdor)
www.a2ialab.com
10/17
Optical model
Improvements
RNN training set
Single lines
First step of automatic
location
Second step of automatic location
Total number of lines
(without location)
# of training lines
7310
10570
Word error rate
54.7%
43.8%
10925
35.2%
11608
-
40% word error rate reduction for handwritten English
www.a2ialab.com
11/17
Language models
Prepare the data :
explicitly model the inter-word space
separate punctuation signs and digit sequences (reduce the
vocabulary)
clean the textual data from unknown characters
www.a2ialab.com
12/17
Language models
Prepare the data :
explicitly model the inter-word space
separate punctuation signs and digit sequences (reduce the
vocabulary)
clean the textual data from unknown characters
Train the language model :
Out-of-vocabulary words modelling with character n-gram
In-vocabulary words sequence modelling with word n-gram
for Arabic, use part-of-Arabic word decomposition
www.a2ialab.com
12/17
Language models
Language models evaluation :
Language
English
French
Arabic
Type
PPL
%OOV
PRN
HWR
PRN
HWR
PRN
HWR
111
66
48
73
146
134
3.9
4.5
3.6
3.9
11.2
8.4
www.a2ialab.com
3-gram
37.1
49.7
54.7
48.6
31.8
29.5
%Hit-Ratio
2-gram
41.8
35.1
31.9
36.6
33.9
44.9
13/17
1-gram
21.1
15.2
13.4
14.7
34.3
25.6
PRN AR
23
HW FR
37
22
HW EN
48
35
HW AR
44
30
Maurdor evaluation
Maurdor second evaluation results :
50
48
Word Error Rate on Test set (%)
second best
A2iA
44
40
37
35
30
20
30
10
0
23
21
11
*PRN FR
www.a2ialab.com
22
20
13
*PRN EN
PRN AR
HW FR
14/17
HW EN
HW AR
Now ?
Layout Analysis is the weak link
Need for document understanding
www.a2ialab.com
15/17
The future of document processing ?
Initial Maurdor workflow
www.a2ialab.com
16/17
The future of document processing ?
Initial Maurdor workflow
www.a2ialab.com
Non linear workflow
16/17
Thank you !
www.a2ialab.com
www.a2ialab.com
17/17