Download Report

Data Recording, Transcription, and
Speech Recognition for Egypt
Tanja Schultz
Carnegie Mellon University
Cairo, Egypt, May-21 2001
Outline

Requirements for Speech Recognition

Data Requirements
 Audio
data
 Pronunciation dictionary
 Text corpus data

Recording of Audio data
 Transcription

of Audio data
Initialization of an Egypt Speech Recognition Engine

Multilingual Speech Recognition

Rapid Adaptation to new Languages
Part 1
Requirements for Speech Recognition

Data Requirements
 Audio
data
 Pronunciation dictionary
 Text corpus data

Recording of Audio data
 Transcription of Audio
data
Thanks to Celine Morel and Susanne Burger
Speech Recognition
Speech Input - Preprocessing
Decoding
/ Search
h
e
l
l
o
Postprocessing - Synthesis
Hello
Hale Bob
Hallo
:
:
TTS
Fundamental Equation of SR
h
e
l
l
o
P(W/x) = [ P(x/W) * P(W) ] / P(x)
A-b A-m A-e
Acoustic Model
Am
Are
I
you
we
AE M
AR
AI
JU
VE
Pronunciation
I am
you are
we are
:
Language Model
SR: Data Requirements
A-b A-m A-e
Acoustic Model
Am
Are
I
you
we
AE M
AR
AI
JU
VE
Pronunciation

Audio Data
Phoneme Set

Pronunciation
Dictionary

Text Data
I am
you are
we are
:
Language Model
Audio Data 
For training and testing the SR-engine many high quality
data in the target language should be collected
 What
kind of data are needed

Scenario and Task

How to collect these data, Recording setup

Preparation of Information
 Quality

of data
Sampling rate, resolution
 Amount

of data
Number of dialogs and speakers
 Transcription
of Audio Data
What kind of Audio Data
C-Star Scenario: Travel arrangement
(planning a vacation trip, booking a hotel room, ...)
 Scenario
 Dialog
is realistic and attractive to the people
between two people:

One Agent: Travel assistant

One Client: Traveler, pretends to visit a specific site
 Speakers
get instructions about what task they have to
accomplish but not HOW to do that
 Role
playing setup
How to collect Audio Data
 Recording





The dialog partners can NOT see each other, i.e. no face-to-face
(in preparation of telephone, web applications)
No non-verbal communication
Spontaneous Speech (noise effects, disfluencies, ... may occur)
No Push-to-talk, try to avoid crosstalk
Balanced dialogs
 Dialog




setup
structure, Task
Greetings and formalities between dialog partners
Client gives information like number of persons traveling, date of
travel (arrival/departure), interest
Client ask questions about means of transportation (train,flight),
hotel or appartment modalities, visits of sights or cultural events
Agent provides information according to clients questions
Prepare Information for Client and Agent







A: Hotel list (3-4 hotels per dialog)
A: Transportation list (3-4 flights, train, bus schedules)
A: List of 3-4 cultural events per dialog
C: information about specific task:
 who is traveling (i.e. client travels with partner + two kids)
 when is s/he traveling (i.e. 2 weeks vacation trip in July)
 where (i.e. trip to Pennsylvania, US)
 how ( i.e. direct flight to Pittsburgh, rental car)
 what are the places of interest (CMU - Pittsburgh, Liberty
Bell in Philadelphia, ...)
Date and time of recording might be faked
Dialog takes place at recording place
Example sheets  Celine Morel
Quality and Quantity of Audio Data
 Quality


of data
High quality clean speech
 close-speaking microphone, like Sennheiser H-420
16kHz sampling rate, 16 bit resolution
 Amount



Minimum of 10 hours of spoken speech
Average length of dialogs 10 - 20 minutes
10 hours  30 - 60 dialogs
 Number



of data
of speakers
as much speakers as possible (speaker independent AM)
30 - 60 dialogs = maximum of 120 different spk
Split up the speakers/dialogs into three disjunctive subsets:
 training set, development testset, evaluation testset
Recording Tool: Total Recorder
 http://www.highcriteria.com/download/totrec.exe
Registration fee: 11.95 $
 IBM compatible PC, soundcard (i.e. Soundblaster)
 Close-speaking microphone (i.e. Sennheiser H-420)
 Win95, Win98, Win2000, WinNT

Soundboard
Soundboard
Driver
Total
Recorder
Transcription of Audio Data
For training the SR-engine we need to transcribe the
spoken data manually
 Very
 The
time consuming (10-20 times real time)
more accurate transcribed the more valuable
 Since
we do have the pronunciations, only word-based
transcriptions are needed
 Transcription
convention from Susanne Burger

download from http://www.cs.cmu.edu/~tanja

Describes notation
 Transcription
tool: transEdit (Burger & Meier)
Transliteration conventions
Example:
tanja_0001: this sentence +uhm+ was spoken +pause+ by
~Tanja and +/cos/+ contains one restart
 Parsability
- one turn per line: Tanja_0001
 Consistency
 Filter programs
 tagging of proper names ~Tanja
 tagging of numbers
 special noise markers +uhm+
 no capitalization at the beginning of turns
Pronunciation Dictionary 
For each word seen in the training set, a pronunciation of
this word has to be defined in terms of the phoneme set



Define an appropriate phoneme set: atomar sounds of language
Describe each word to be recognized in terms of this phoneme set
Example in English:
I
you


AI
JU
Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the
vocalization is transcribed, romanized transcription
Grapheme-to-Phoneme tool for Standard Arabic (collected in
Tunesia and Palestine) already developed at CMU (master student
Jamal Abu-Alwan)
Phoneme Set (i.e. Standard Arabic)
Phon. Trans.
Symbol
Name
Arabic Phon. Trans.
Symbol Symbol
Name
Arabic
Symbol
‫ء‬
SD
DD
TT
DS
E3
GH
F
Sd
Dd
Tt
D~
3
Gh
F
saad
daad
tta
tha
ain
gin
fa
‫ص‬
‫ض‬
‫ط‬
‫ظ‬
‫ع‬
‫غ‬
‫ف‬
E
AA
AE
O
I
A
U
E
A~
Ae
O
I
A
U
Q
K
L
M
N
W
Y
a
Q
K
L
M
N
W
Y
a
qaaf
kaaf
lam
mim
noon
waw
yaa
fatha
‫ق‬
‫ك‬
‫ل‬
‫م‬
‫ن‬
‫و‬
‫ي‬
B
TE
T
TH
J
H7
H
KH
B
Te
T
Th
J
7
H
Kh
hamza
wasla
hamza
hamza
hamza
alif
alif
maksura
ba
ta marbuta
ta
sa
jeem
ha
ha
khaf
u
u
damma
D
D
daal
‫د‬
i
an
i
an
DH
R
Dh
R
thal
ra
‫ذ‬
‫ر‬
un
un
Z
Z
za
‫ز‬
in
in
kasra
tanwin
fatha
tanwin
damma
tanwin
kasra
S
S
seen
‫س‬
SH
Sh
sha
‫ش‬
َ
َ
ِِ
َ
َ
َ
‫آ‬
‫إ‬, ‫أ‬
‫ؤ‬
‫ئ‬
‫ا‬
‫ى‬
‫ب‬
‫ة‬
‫ت‬
‫ث‬
‫ج‬
‫ح‬
‫هه‬
‫خ‬
Text Data 
For training the language model we need a huge corpus
of text data of same domain
 The language model helps guiding the search
 Compute probabilities of words, word pairs and word
tripels
 Millions of words needed to calculate these probs
 Text corpus should be as close as possible to the given
domain
 Writing systems must be the same
 Other text might be useful as background information
Computer Requirements

Data collection





IBM compatible PC
High quality Soundcard like Soundblaster
Close-speaking microphone like Sennheiser H-420
Operating System  Win95
Large Harddisc



Speech Recognition




16000 x 2 bytes per sec  30 kBytes/sec  2 Mb/min  120 Mb/hr 
1.2 GigaBytes for 10hr spoken speech
Fast processor - as fast as possible
RAM  512 Mb
Additional 2-4 GigaBytes for temporary files during training and testing
Translation

Donna, Lori?
Discussion


Speech Recognizer in Egypt or Standard Arabic language ?
Egypt






Standard Arabic





Spoken -used- language  more interesting for a human-to-human
speech-to-speech translation system?
Standardized pronunciation?
Large text resources available in Egypt?
Parser output follows Standard Arabic vocalization?
Use Egypt CallHome data and pronunciation dictionaries (LDC)?
Useful to a larger community?
Canonical pronunciation?
Preliminary speech recognizer and data already available at CMU
Larger text resources available?
Do we want monolingual dialogs (agent&client) or
multilingual recordings?
Part 2
Initialization of an Egypt Speech Recognition Engine

Multilingual Speech Recognition

Rapid Adaptation to new Languages
Initialization of Egypt SR Engine

Rapid initialization of an Egypt/Arabic speech recognizer?

Pronunciation dictionary: Grapheme-to-Phoneme tool available
if vocalization, romanization is provided by trl

Language model: text corpora if vocalized

Apply Egypt parser for vocalization?

Acoustic models: Initialization or Adaptation according to our
fast adaptation approach PDTS
GlobalPhone
Arabic
Ch-Mandarin
Ch-Shanghai
English
French
German
Japanese
Korean
Croatian
Portuguese
Russian
Spanish
Swedish
Tamil
Turkish
Multilingual Database
 Widespread languages
 Native Speakers
 Uniformity
 Broad domain
 Huge text resources
 Internet Newspapers
Total sum of resources
 15 languages so far
  300 hours speech data
  1400 native speakers
Speech Recognition in Multiple Languages
Goal: Speech recognition in a many different languages
Problem: Only few or no training data available (costs, time)
Sound system
Speech data
Pronunciation
rules
( 10 hours)
Text data
ela /e/l/a/
eu /e/u/
sou /s/u/
AM
Lex
eu sou
você é
ela é
LM
Speech Recognition in Multiple Languages
  
Sound system
Speech data
Pronunciation
rules
Text data
ela /e/l/a/
eu /e/u/
sou /s/u/
AM
Lex
eu sou
você é
ela é
LM
Multilingual Acoustic Modeling
  
Step 1:
• Combine acoustic models
• Share data across languages
Multilingual Acoustic Modeling
Sound production is human not language specific:
 International Phonetic Alphabet (IPA)
 Multilingual Acoustic Modeling
1) Universal sound inventory based on IPA
485 sounds are reduced to 162 IPA-sound classes
2) Each sound class is represented by one “phoneme”
which is trained through data sharing across languages
m,n,s,l occur in all languages
 p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages
 no sharing of triphthongs and palatal consonants

Rapid Language Adaptation
Step 2:
• Use ML acoustic models, borrow data
• Adapt ML acoustic models to target language
ela /e/l/a/
eu /e/u/
sou /s/u/
AM
Lex
eu sou
você é
ela é
LM
Rapid Language Adaptation
Model mapping to the target language
1) Map the multilingual phonemes to Portuguese ones
based on the IPA-scheme
2) Copy the corresponding acoustic models in order to
initialize Portuguese models
Problem: Contexts are language specific, how to apply
context dependent models to a new target language
Solution: Adaptation of multilingual contexts to the target
language based on limited training data
Language Adaptation Experiments
100
Ø Tree
Word Error rate [%]
80
ML-Tree
Po-Tree
PDTS
69,1
57,1
60
49,9
40,6
40
32,8
28,9
19,6
19
20
0
0
0:15
0:15
0:25
+
0:25
0:25
1:30
16:30
Summary
 Multilingual database suitable for MLVCSR
 Covers the most widespread languages
 Language dependent recognition in 10 languages
 Language independent acoustic modeling
 Global phoneme set that covers  10 languages
 Data sharing thru multilingual models
 Language adaptive speech recognition
 Limited amount of language specific data
 Create speech engines in new target languages
using only limited data, save time and money
Selected Publications

Language Independent and Language Adaptive Acoustic Modeling
Tanja Schultz and Alex Waibel in: Speech Communication, To appear 2001

Multilinguality in Speech and Spoken Language Systems
Alex Waibel, Petra Geutner, Laura Mayfield-Tomokiyo, Tanja Schultz, and
Monika Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken
Language Processing, Volume 88(8), pp 1297-1313, August 2000

Polyphone Decision Tree Specialization for Language Adaptation Tanja
Schultz and Alex Waibel in: Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000),
Istanbul, Turkey, June 2000.

Download from http://www.cs.cmu.edu/~tanja