ppt - University of Macau

Domain Adaptation for
Statistical Machine Translation
Master Defense
By
Longyue WANG, Vincent
MT Group, NLP2CT Lab, FST, UM
Supervised by Prof. Lidia S. Chao, Prof. Derek F. Wong
20/08/2014
Research Scope
Computational
Linguistics
Machine
Translation
Speech
Translation
Rule-based MT
Text
Translation
Hybrid MT
Domain-Specific
Domain-Specific
Statistical MT
MT
Figure 1: Our Research Scope [1] [2]
[1] Daniel Jurafsky and James Martin (2008) An Introduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition, Second Edition. Prentice Hall.
[2] Wikipedia, http://en.wikipedia.org/wiki/Machine_translation.
(2/84)
Agenda
 Introduction
 Proposed Method I: New Criterion
 Proposed Method II: Combination
 Proposed Method III: Linguistics
 Domain-Specific Online Translator
 Conclusion
(3/84)
Part I: Introduction
(4/84)
The First Question
WHAT IS STATISTICAL MACHINE
TRANSLATION?
5
Statistical Machine Translation
Figure 2: Phrase-based SMT Framework
SMT translations are generated on the basis of statistical models whose
parameters are derived from the analysis of text corpora [3].
 Currently, the most successful approach of SMT is phrase-based SMT, where the
smallest translation unit is n-gram consecutive words.

[3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine
translation: Parameter estimation. Computational Linguistics. 19:263–311.
(6/84)
Statistical Machine Translation
Parallel
Corpus
Monolingual
Corpus
Figure 2: Phrase-based SMT Framework
Corpus is a collection of texts. e.g., IWSLT2012 official corpus.
 Bilingual corpus is a collection of text paired with translation into another
language. Monolingual corpus, in one (mostly are the target side) language.
 Corpus may come from different genres, topics etc.

(7/84)
Statistical Machine Translation
Word
Alignment
Translation
Table
Reordering
Model
Figure 2: Phrase-based SMT Framework
Word alignment can be mined by the help of EM algorithm.
 Then extract phrase pairs from word alignment to generate translation table.
 Distance-based reordering model is a penalty of changing position of
translated phrases.

(8/84)
Statistical Machine Translation
Language
Model
Figure 2: Phrase-based SMT Framework

Language model assigns a probability to a sequence of words. (n-gram) [4]
l 1
pLM ( s )   p ( wi | wii1n 1 )
(1)
i 1
[4] F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval.
pp. 279–280..
(9/84)
Statistical Machine Translation
Source
Text
Searching
Translation
Candidates
Target
Text
Decoding
Figure 2: Phrase-based SMT Framework
I
e
i 1
i 1
ebest  arg max e   ( fi | ei )d ( starti  endi 1  1) PLM (ei | e1...ei 1 )
(2)
Decoding function consists of three components: the phrase translation table, which
ensure the foreign phrase to match target ones; reordering model, which reorder
the phrases appropriately; and language model, which ensure the output to be
fluent.
(10/84)
The Second Question
WHAT IS DOMAIN-SPECIFIC SMT
SYSTEM?
11
Typical SMT vs. Domain-Specific SMT
Typical SMT systems are trained on a large and broad corpus
(i.e., general-domain) and deal with texts with ignoring domain.
 Performance depends heavily upon the quality and quantity of
training data.
 Outputs preserve semantics of the source side but lack
morphological and syntactic correctness.
 Understandable translation quality. BBC News Example [5].

Input:
Hollywood actor Jackie Chan has apologised over his son's arrest on
drug-related charges, saying he feels "ashamed" and "sad".
Google Output:
好萊塢影星成龍已經道歉了他兒子的被捕與毒品有關的指控,說他
感覺“羞恥”和“悲傷”。
[5] Available at http://www.bbc.com/news/world-asia-china-28871698. (BBC News 20 August 2014.)
(12/84)
13
Typical SMT vs. Domain-Specific SMT
Domain-Specific SMT systems are trained on a small but
relative corpus (i.e., in-domain) and deal with texts from one
specific domain.
 Consider relevance between training data and what we want to
translate (test data).
 Outputs preserve semantics of the source side, morphological
and syntactic correctness.
 Publishable quality. Patent Document Example [6]

Input:
本发明涉及新的tetramic酸型化合物,它从CCR-5活性复合物中分离出来,在控制
条件下通过将生物纯的微生物培养液(球毛壳霉Kunze SCH 1705 ATCC 74489)发酵来
制备复合物。[5]
ICONIC Translator Output:
Novel tetramic acid-type compounds isolated from a CCR-5 active complex produced by
fermentation under controlled conditions of a biologically pure culture of the
microorganism, Chaetomium globosum Kunze SCH 1705, ATCC 74489 ., pharmaceutical
compositions containing the compounds.
[6] Chinese Patent WO01/74772《受体拮抗剂趋化因子》.
(14/84)
The Third Question
WHAT IS DOMAIN-SPECIFIC
TRANSLATION CHALLENGE?
15
Challenge 1 – Ambiguity

Multi-meaning may not coincide in bilingual environment.
The English word Mouse refers to both animal and
electronic device. While in the Chinese side, they are two
words. Choosing wrong translation variants is a potential
cause for miscomprehension.
1
2
Figure 3: Translation ambiguity example
(16/84)
Challenge 2 – Language Style
News Domain
 Try to deliver rich information with very economical
language.
 Short and simple-structure sentence make it easy to
understand.
 A lot of abbreviation, date, named entitles.
China's Li Duihong won the women's 25-meter sport pistol
Olympic gold with a total of 687.9 points early this morning
Beijing time. (Guangming Daily, 1996/07/02)
我国女子运动员李对红今天在女子运动手枪决赛中,以
687.9环战胜所有对手,并创造新的奥运记录。(《光明
日报》 1996年7月2日)
(17/84)
Challenge 2 – Language Style
Law Domain
 Very rigorous even with duplicated terms.
 Use fewer pronouns, abbreviations etc. to avoid any
ambiguity.
 High frequency words of shall, may, must, be to.
 Long sentence with long subordinate clauses.
When an international treaty that relates to a contract and which the
People’s Republic of China has concluded on participated into has
provisions of the said treaty shall be applied, but with the exception of
clauses to which the People’s Republic of China has declared reservation.
中华人民共和国缔结或者参加的与合同有关的国际条约同中华人民共
和国法律有不同规定的,适用该国际条约的规定。但是,中华人民共和
国声明保留的条款除外。
(18/84)
Challenge 3 – Out-Of-Vocabulary


Terminology: words or phrases that mainly occur in specific
contexts with specific meanings.
Variants, increasing, combination etc.
8.36%
BHT
91.64%
2,6-二叔丁基
-4-甲基苯酚
Figure 4: Out-of-Vocabulary Example
(19/84)
Domain Adaptation
As SMT is corpus-driven, domain-specificity of training
data with respect to the test data is a significant factor that
we cannot ignore.
 There is a mismatch between the domain of available
training data and the target domain.
 Unfortunately, the training resources in specific domains
are usually relatively scarce.

In such scenarios, various domain adaptation techniques are
employed to improve domain-specific translation quality by
leveraging general-domain data.
(20/84)
Domain Adaptation for SMT
Domain adaptation can be employed in different SMT
components: word-alignment model, language model,
translation model and reordering model. [6] [7]
Model
Figure 5: Domain Adaptation Approaches
[6] Hua, Wu, Wang Haifeng, and Liu Zhanyi. "Alignment model adaptation for domain-specific word alignment." Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.
[7] Koehn, Philipp, and Josh Schroeder. "Experiments in domain adaptation for statistical machine translation." Proceedings of the Second
(21/84)
Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2007.
Domain Adaptation for SMT
Various resources can be used for domain adaptation:
monolingual corpora, parallel corpora, comparable corpora,
dictionaries and dictionary. [8]
Resources
Figure 5: Domain Adaptation Approaches
[8] Wu, Hua, Haifeng Wang, and Chengqing Zong. "Domain adaptation for statistical machine translation with domain dictionary and
monolingual corpora." Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1. Association for
(22/84)
Computational Linguistics, 2008.
Domain Adaptation for SMT
Considering supervision, domain adaptation approaches can
be decided into supervised, semi-supervised and unsupervised.
[9]
Supervision
Figure 5: Domain Adaptation Approaches
[9] Snover, Matthew, Bonnie Dorr, and Richard Schwartz. "Language and translation model adaptation using comparable corpora."
(23/84)
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
My Thesis
Data Selection: solve the ambiguity and language style
problems by moving the data distribution of training
corpora to target domain.
 Domain Focused Web-Crawling: reduce the OOVs by
mining in-domain dictionary, parallel and monolingual
sentences from comparable corpus (web).

Figure 6: My Domain Adaptation Approaches
(24/84)
Part II: Data Selection
(25/84)
Definition
Selecting data suitable for the domain at hand from large
general-domain corpora, under the assumption that a
general corpus is broad enough to contain sentences that are
similar to those that occur in the domain.
General-domain Corpus
…
Law
Subtitle
Dialog
News
Novel
SMT System
Spoken
Domain
Figure 7: Data Selection Definition
(26/84)
Framework – TM Adaptation
Source
Language
Target
Language
Domain Estimation
Score Si ,Ti   Sim(Vi , M R )
Source
Language
Target
Language
We define the set {<Si>,
<Ti>, <Si,Ti>} as Vi.
MR is an abstract model
representing the target
domain.
Figure 8: My Data Selection Framework
(27/84)
Framework – TM Adaptation
Source
Language
Target
Language
Domain Estimation
Source
Language
Target
Language
Source
Language
Target
Language
• Rank sentence pairs
according to score.
• Select top K% of
general-domain data.
• K is a tunable threshold.
Figure 8: My Data Selection Framework
(28/84)
Framework – TM Adaptation
Source
Language
Target
Language
Translation
Model (IN)
Translation
Model
(Final)
Log-linear /linear
Interpolation
Domain Estimation
Source
Language
Target
Language
Translation
Model
(Pseudo)
n
p ( x)  exp  i hi ( x)
i 1
Source
Language
0  i  1,  i  1
Target
Language
i
n
pw ( f | e , a )    i p w , i ( f | e , a )
i 0
0  i  1,  i  1
i
Figure 8: My Data Selection Framework
(29/84)
Framework – LM Adaptation
Target
Language
Language
Model (IN)
Target
Language
Domain Estimation
n
Log-linear/Linear
Interpolation
Language
Model
(Pseudo)
Target
Language
p ( x)  exp  i hi ( x)
i 1
0  i  1,  i  1
i
n
Language
Model (Final)
p ( s )   i PLM i ( s )
i 1
0  i  1,  i  1
i
Figure 8: My Data Selection Framework
(30/84)
Framework – LM Adaptation
Figure 8: My Data Selection Framework
(31/84)
Related Work
Vector space model (VSM), which converts sentences into a
term-weighted vector and then applies a vector similarity
function to measure the domain relevance. The sentence Si is
represented as a vector:
Si  wi1 , wi 2 ,..., win
(3)
Standard tf-idf weight: Each sentence Si is represented as a
vector (wi1, wi2,…, win), and n is the size of the vocabulary. So
wij is calculated as follows:
wij  tf ij  log(idf j )
(4)
Cosine measure: The similarity between two sentences is then
defined as the cosine of the angle between two vectors.
cos  
SGen S IN
SGen S IN
(5)
(32/84)
Related Work
Perplexity-based model, which employs n-gram in-domain
language models to score the perplexity of each sentence in
general-domain corpus.
 Cross-entropy is the average of the negative logarithm of
the word probabilities.
n
1
H ( p, q)   p( wi ) log q( wi )  
N
i 1
n
 log q(w )
i 1
i
(6)

Perplexity pp can be simply transformed with a base b with
respect to which the cross-entropy is measured (e.g., bits or
nats).
(7)
pp  b H ( p ,q )

Perplexity and cross-entropy are monotonically related.
(33/84)
Related Work
Until now, there are three perplexity-based variants:
 The first basic one [13]:
H I src ( x)

(8)
The second is called Moore-Lewis [14]:
H I  src ( x)  H O  src ( x)
(9)
which tries to select the sentences that are more similar to indomain but different to out-of-domain.
 The third is modified Moore-Lewis [15]:
 H I  src ( x)  H O  src ( x)   H I t g t ( x)  H O t g t ( x) 
(10)
which considers both source and target language.
[13] Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese.
ACM Transactions on Asian Language Information Processing (TALIP). 1:3–33.
[14] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. Proceedings of ACL: Short Papers. pp.
220–224.
[15] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In: Proceedings of
(34/84)
EMNLP. pp. 355–362.
Discussion: Grain Level
By reviewing their work, I found
 VSM-based methods can obtain about 1 BLEU point
improvement using 60% of general-domain data [10, 11
and 12].
 Perplexity-based approaches allow to discard 50% - 99%
of the general corpus resulted in an increase of 1.0 - 1.8
BLEU points [13, 14, 15, 16 and 17].
[10] Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query
models. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Geneva,
Switzerland.
[11] Almut Silja Hildebrand, Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Adaptation of the translation model for statistical machine
translation information retrieval. In 10th Annual Conference of the European Association for Machine Translation (EAMT 2005). Budapest,
Hungary.
[12] Yajuan Lü, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and
optimization. Proceedings of EMNLP-CoNLL. pp. 343–350..
[15] Keiji Yasuda and Eiichiro Sumita. 2008. Method for building sentence-aligned corpus from wikipedia. In 2008 AAAI Workshop on
Wikipedia and Artificial Intelligence (WikiAI08).
[16] George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adaptation in statistical machine
translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459. Association for
Computational Linguistics, Cambridge, Massachusetts.
(35/84)
Discussion: Grain Level



VSM-based similarity is a simple co-occurrence based
matching, which only weights single overlapping words.
Perplexity-based similarity considers not only the
distribution of terms but also the n-gram word collocation.
String-difference can comprehensively consider word
overlap, n-gram collocation and word position.
Edit-distance
Word Position
Perplexity-based
Word Order
Query
Sentence
(V)
Cosine tf-idf
Word Overlap
Figure 9: Data Selection Pyramid
Candidate
Sentence
(R)
(36/84)
The First Proposed Method
EDIT DISTANCE: A NEW DATA
SELECTION CRITERION FOR SMT
DOMAIN ADAPTATION
37
New Criterion
String-difference metric is a better similarity function [21], with
higher grain level.
Edit-distance is proposed as a new selection criterion. Given a
sentence sG from general-domain corpus and a sentence sI from
in-domain corpus, the edit distance for these two sequences is
defined as the minimum number of edits, i.e. symbol insertions,
deletions and substitutions, needed to transform sG into sI.
FMS  1 
EDword ( sG , sI )
Max( sG , sI )
(11)
The normalized similarity score (fuzzy matching score, FMS) is
given by Koehn and Senellart [22] in translation memory work.
[21] Wang, Longyue, et al. "Edit Distance: A New Data Selection Criterion for Domain Adaptation in SMT." RANLP. 2013.
[22] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen,
(38/84)
Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180.
New Criterion
For each sentence in general-domain corpus, we traverse all
in-domain sentences to calculate FMS score and then average
them.
1
Score( sG ) 
N
In-domain
Corpus
N
 FMS (s
G
, sI i )
(12)
i
•
•
•
Figure 10: Edit-distance based data selection
General-domain
Corpus
(39/84)
Experiment: Corpora (Chinese-English)
General-domain parallel corpus (in-house) includes
sentences comparing a various genres such as movie
subtitles, law literature, news and novels.
 In-domain parallel corpus, dev set, test set are randomly
selected from the IWSLT2010 Dialog [37], consisting of
transcriptions of conversational speech in travel.
 We use parallel corpora for TM training and the target
side for LM training.

Data Set
Test Set
Dev Set
In-domain
General-domain
Sentences
3,500
3,000
17,975
5,211,281
Ave. Len.
9.60
9.46
9.45
12.93
Table 1: Corpora Statistics (English-Chinese)
[37] Available at http://iwslt2010.fbk.eu/node/33.
(40/84)
Experiment: System Setting
Baseline: SMT trained on all general-domain corpus;
 VSM-based system (VSM): SMT trained on top K% of
general-domain corpus ranked by Cosine tf-idf metric;
 Perplexity-based system (PL): SMT trained on top K% of
general-domain corpus ranked by basic cross-entropy
metric;
 String-difference system (SD): SMT trained on top K% of
general-domain corpus ranked by Edit-distance metric;
We investigate K={20, 40, 60, 80}% of ranked generaldomain data as pseudo in-domain corpus for SMT training,
where K% means K percentage of general corpus are selected
as a subset.

(41/84)
Experiment: Results
Three adaptation methods do better than baseline.
 VSM can improve nearly 1 BLEU using 80% (more) entire
data.
 PL is a simple but effective method, which increases by 1.1
BLEU using 60% (less) data.
 SD performs best, which achieve higher BLEU than other
two methods with less data.

System
Baseline
VSM
PL
SD
20%
29.00 (-0.34)
29.45 (+0.11)
29.25 (-0.09)
40%
60%
29.34
29.50 (+0.16) 30.02 (+0.68)
29.65 (+0.31) 30.44 (+1.10)
30.22 (+0.88) 30.97 (+1.63)
80%
30.31 (+0.97)
29.78 (+0.44)
30.21 (+0.87)
Table 2: Translation Quality of Adapted Models
(42/84)
Discussion


SD > PL > VSM > Baseline.
Higher grained similarity metrics perform better than
lower grained ones.
Edit-distance
Word Position
Perplexity-based
Word Order
Query
Sentence
(V)
Cosine tf-idf
Word Overlap
Candidate
Sentence
(R)
Figure 9: Data Selection Pyramid


However, different grained level methods have their own
advanced nature.
How about combining the individual models.
(43/84)
The Second Proposed Method
A HYBRID DATA SELECTION
MODEL FOR SMT DOMAIN
ADAPTATION
44
Combination
We investigate the combination of the above three individual
models at two levels [23].
 Corpus level: weight the pseudo in-domain sub-corpora
selected by different methods and then join them together.
General-domain
Corpus
Combined
Corpus
Figure 11: Combination Approach
VSM
•
•
•
General-domain
Corpus
ED
[23] Wang, Longyue, et al. "iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation." Chinese Computational Linguistics and
(45/84)
Natural Language Processing Based on Naturally Annotated Big Data. Springer Berlin Heidelberg, 2013. 280-290.
Combination

Model level: perform linear interpolation on the translation
models trained on difference sub-corpora.
n
 ( f | e )    ii ( f | e )
(13)
i 0
n
pw ( f | e , a )    i p w , i ( f | e , a )
(14)
i 0
where i = 1, 2, and 3 denoting the phrase translation
probability and lexical weights trained on the VSM, perplexity
and edit-distance’s subsets. αi and βi are the tunable
interpolation parameters, subject to i   i  1
(46/84)
Experiment: Corpora (Chinese-English)

General-domain parallel corpus includes sentences
comparing a various genres such as movie subtitles, law
literature, news and novels etc.
Domain
News
Novel
Law
Others
Total
Sent. No.
279,962
304,932
48,754
504,396
1,138,044
%
24.60%
26.79%
4.28%
44.33%
100.00%
Table 3: Translation Quality of Adapted Models

In-domain parallel corpus, dev set, test set are disjoinedly
and randomly selected from LDC corpus [38] (Hong Kong
law domain).
[38] LDC2004T08, https://catalog.ldc.upenn.edu/LDC2004T08.
(47/84)
Experiment: Corpora (Chinese-English)
Data Set
Test Set
Dev Set
In-domain
Training Set
Lang.
EN
ZH
EN
ZH
EN
ZH
EN
ZH
Sentences
2,050
2,000
45,621
1,138,044
Tokens
60,399
59,628
59,924
59,054
1,330,464
1,321,655
28,626,367
28,239,747
Av. Len.
29.46
29.09
29.92
29.53
29.16
28.97
25.15
24.81
Table 4: Corpora Statistics



Corpus size, data-type distribution, in/gen domain ratio
are different.
Data selection performance may be different.
We use parallel corpora for TM training and the target
side for LM training.
(48/84)
Experiment: System Setting
Baseline: the general-domain baseline (GC-Baseline) are
respectively trained on entire general corpus.
 Individual Model: Cosine tf-idf (Cos), proposed editdistance based (ED) and three perplexity-based variants:
cross-entropy (CE), Moore-Lewis (ML) and modified MooreLewis (MML).
 Combined Model: combined Cos, ED and the best
perplexity-based model at corpus level (iCPE-C) and
model level (named iCPE-M).
We report selected corpora in a step of 2x starting from using
3.75% of general corpus K={3.75, 7.5, 15, 30, 60}%.

(49/84)
Experiment: Individual Model Results
Perplexity-based variants are all effective methods.
 MML performs best: improve highest (nearly 2 BLEU) with least
data (15%).
 MML> ED > CE > ML > Cos > Baseline

System
GC-Baseline
CE
ML
MML
Cos
ED
3.75%
37.10 (-)
38.07 (-)
38.26(-)
37.87 (-)
37.70 (-)
7.5%
15%
30%
60%
39.15
39.82 (+0.67) 40.39 (+1.24) 40.79 (+1.64) 39.43(+0.28)
40.33 (+1.18) 40.08 (+0.93) 40.46 (+1.31) 40.27 (+1.12)
40.91 (+1.76) 41.12 (+1.97) 40.02 (+0.87) 39.82 (+0.67)
38.44 (-) 39.45 (+0.30) 40.17 (+1.02) 39.88 (+0.73)
39.00 (-) 40.88 (+1.73) 40.24 (+1.09) 40.00 (+0.85)
Table 5: Translation Quality of Adapted Models
(50/84)
Experiment: Results

Good performances are at K={7.5, 15, 30}%, thus we
conduct combination methods in this section.
Considering different nature of them, we will further
combine MML (best perplexity-based), Cos and ED.
CE
ML
MML
Cos
ED
GC-Base
41
40
BLEU

39
38
37
0
20
40
60
80
100
Size of Selected Data K%
Figure 12: Combination Approach
(51/84)
Experiment: Combination Model Results



Two combination methods perform better than the best
individual model. (slightly)
Model-level combination is better than corpus-level one.
(+0.23 BLEU)
Combination models > individual models > Baseline
System
7.5%
15%
30%
GC-Baseline
39.15
MML
40.91 (+1.76) 41.12 (+1.97) 40.02 (+0.87)
iCPE-C 41.01 (+1.86) 41.95 (+2.80) 41.98 (+2.83)
iCPE-M 41.13 (+1.98) 42.21 (+3.06) 41.84 (+2.69)
Table 6: Translation Quality of Adapted Models
(52/84)
Discussion
We compare many data selection methods:
 VSM-based: cosine tf-idf.
 Perplexity-based: basic cross-entropy, Moore-Lewis and
modified Moore-Lewis.
 String-difference: edit-distance.
 Combination: Corpus-level and Model-level
Above methods only consider word itself (surface information).
 Languages have a larger set of different words leads to
sparsity problems.
 Weak at capturing language style, sentence structure,
sematic information.
(53/84)
The Third Proposed Method
LINGUISTICALLY-AUGMENTED
DATA SELECTION FOR SMT
DOMAIN ADAPTATION
54
Linguistic DS
We explore two more linguistic information for data
selection approach [25]:
Surface form (f), word itself, have rich lexicon information.
 Named Entity categories (n) group together proper nouns
that belong to the same semantic class (person, location,
organization) [26].
 Part-Of-Speech tags (t) group together words that share
the same grammatical function (e.g. adjectives, nouns,
verbs) [27].

[25] Antonio Toral, Pavel Pecina, Longyue Wang, Josef van Genabith. (2014). “Linguistically-augmented Perplexity-based Data Selection for
Language Models.” Computer Speech and Language, (accepted and in minor revisions)..
[26] E. W. D. Whittaker, P. C. Woodland, Comparison of language modelling techniques for russian and english, in: ICSLP, ISCA, 1998.
[27] P. A. Heeman, Pos tags and decision trees for language modeling, in: 1999 Joint SIGDAT Conference on Empirical Methods in Natural
(55/84)
Language Processing and Very Large Corpora, 1999, pp. 129{137.
Linguistic DS
Change the original corpus (f) into linguistic format (fn, ft
and t) and use them for LM training and sentence scoring.
 The core metric is the modified Moore-Lewis.
 According to the scores, select data from original corpus
(surface) to train adapted SMT models.

 H I  src ( x)  H O  src ( x)
  H I t g t ( x)  H O t g t ( x) 
Need 4 LM models:
1, in-domain corpus in source language
2, in-domain corpus in target language
3, out-of-domain corpus in source language
4, out-of-domain corpus in target language
Figure 13: Linguistically-based Data Selection Method
(56/84)
Linguistic-based DS
Based on individual models, we further combine different
types of linguistic knowledge:
 Corpus level: given the sentences selected by all the
individual models considered for a given threshold, we
traverse the first ranked sentence by each of the methods,
then we proceed to the set of second best ranked
sentences, and so forth.
 Model level: Similar. The traversed sentences are kept in
different sets. Build LMs on each set and then interpolate
them.
They are same as the second experiment.
(57/84)
Experiment: Corpora (Chinese-English)
General-domain parallel corpus combined with generaldomain corpora: CWMT2013 [39], UMCorpus [40], News
Magazine [41] etc.
 In-domain parallel corpus, dev set, test set are the
IWSLT2014 TED Talk (talk domain) [42].

Data Set (EN/ZH)
Test Set
Dev Set
In-domain
General-domain
Sentences
1,570
887
177,477
10,021,162
Ave. Len.
26.54/23.41
26.47/23.24
26.47/23.58
23.02/21.36
Table 7: Corpora Statistics
[39] http://www.liip.cn/cwmt2013/.
[40] http://nlp2ct.cis.umac.mo/um-corpus/.
[41] LDC2005T10. https://catalog.ldc.upenn.edu/LDC2005T10.
[42] http://workshop2014.iwslt.org/.
(58/84)
Experiment: System Setting
All adapted systems are log-linearly interpolated with the indomain model to further improve performance.
Baseline: GI-Baseline is trained on all in-domain corpus
and general corpus.
 Individual Model: surface form based (f), POS based (t),
surface+named entity based (fn), surface+POS (ft) .
 Combined Model: corpus level (Comb-C) and model level
(Comb-M).
We investigate K={25, 50, 75}% of ranked general-domain
data as pseudo in-domain corpus for SMT training.

(59/84)
Experiment: Individual Model Results



After adding more linguistic information, fn and ft can
improve baseline by about 1 BLEU.
t (only POS) perform poorly due to lack of lexicon
information.
Considering their performance, we will combine f, fn and
ft.
System
GI-Baseline
f
25%
31.91 (-8.29)
50%
40.20
38.83 (-1.37)
75%
41.37 (+1.17)
t
21.20 (-19.00)
27.90 (-12.30)
27.90 (-12.30)
fn
31.93 (-8.27)
37.86 (-2.34)
40.93 (+0.73)
ft
30.00 (-10.20)
38.74 (-1.46)
41.81 (+1.61)
Table 8: Translation Quality of Adapted Models
(60/84)
Experiment: Combination Model Results



Both combination methods are better than best individual
model (from +0.64 to +0.11 BLEU).
Combination may success the advantages of each linguisticbased methods. (lexicon, spacity, language style)
High-inflected languages such as English-German may
have better performance with more linguistic information.
System
GI-Baseline
f
25%
31.91 (-8.29)
50%
40.20
38.83 (-1.37)
41.37 (+1.17)
ft
30.00 (-10.20)
38.74 (-1.46)
41.81 (+1.61)
Comb-C
33.01 (-7.19)
39.07 (-1.13)
41.92 (+1.72)
Comb-M
32.74 (-7.46)
38.95 (-1.25)
42.01 (+1.81)
Table 9: Translation Quality of Adapted Models
75%
(61/84)
Part III: Real-Life System
(62/84)
Real-life Environment
To prove the robustness and language-independence of
some domain adaptation approaches, we evaluation it in reallife system.
WMT (since 2005) is most famous workshop with high-quality
shared task on machine translation. We attended WMT2014
medical translation task [43]:
 Czech-English, French-English, German-English. (6 pairs)
 Very large resources: up to 36 million general-domain
parallel sentences and 4 million in-domain parallel
sentences.
 Medical texts are more complex. Chemical formulae, e.g “CH2-(OCH2CH2)n-”.
[43] http://www.statmt.org/wmt14/.
(63/84)
WMT2014 Medical Translation Task
By observing the text of medical text, we present a number of
detailed domain adaptation techniques and approaches:






Task Oriented Pre-processing.
Language Model Adaptation.
Translation Model Adaptation.
Numeric Adaptation.
Hyphenated Word Adaptation.
Combination above all methods.
Figure 14: Results and Rankings of Our System
Finally, 1st rank on three language pairs, and 2nd rank on
others.
(64/84)
BenTu System
Based these models (medical domain), we develop my first
online translator, BenTu, which is a domain-specific multi-tire
SMT system [44].
 Three layers: pre-processing, decoder and post-processing
 Easy to add new language pairs and domains
Figure 15: Framework of BenTu System
[44] The architecture is designed referring to PluTO project: Tinsley, John, Andy Way, and Paraic Sheridan. "PLuTO: MT for online patent
(65/84)
translation." Association for Machine Translation in the Americas, 2010..
BenTu System
Figure 16: User Interface of BenTu System
(66/84)
Part V: Conclusion
(67/84)
Thesis Contribution
To solve the problems in domain-specific SMT, we proposed
Data Selection methods as described.
o New data selection criterion
o Combination model
o Linguistically-augmented data selection
 Domain Focused Web-Crawling
o Integrated models for cross-language document alignment
o Combining topic classifier and perplexity for filtering
 Real-life domain-specific SMT based on a number of
adapted models are developed.

(68/84)
Total Contribution
Figure 17: My work in the past three years
(69/84)
Future Work



Data Selection
o Graphical model and label propagation
o Neural language model
Domain Focused Web-Crawling
o Improve the performance by mining the in-domain
dictionary.
Real-life domain-specific SMT
o Extend to more language pairs: Chinese, Japanese etc.
o Extend to more domains: science technology, laws and
news
(70/84)
My Publications
Journal Papers
1, Antonio Toral, Pavel Pecina, Longyue Wang, Josef van Genabith. 2014.
Linguistically-augmented Perplexity-based Data Selection for Language
Models. Computer Speech and Language (accepted). (IF=1.463)
2, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, and Junwen Xing.
2013. A Systematic Comparison of Data Selection Criteria for SMT Domain
Adaptation. The Scientific World Journal, vol. 2014, Article ID 745485, 10
pages. (IF=1.730)
3, Long-Yue WANG, Derek F. WONG, Lidia S. CHAO. 2012. TQDL: Integrated
Models for Cross-Language Document Retrieval. International Journal of
Computational Linguistics and Chinese Language Processing (IJCLCLP), pages
15-32. (THCI Core)
Conference Papers
4, Longyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco
Oliveira. 2014. Combining Domain Adaptation Approaches for Medical Text
Translation. In Proceedings of the Ninth Workshop on Statistical Machine
Translation. (ACL Anthology and EI)
(71/84)
My Publications
5, Yi Lu, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco
Oliveira. (2014) "Domain Adaptation for Medical Text Translation using Web
Resources". In Proceedings of the Ninth Workshop on Statistical Machine
Translation. (ACL Anthology and EI)
6, Yiming Wang, Longyue Wang, Xiaodong Zeng, Derek F. Wong, Lidia S.Chao,
Yi Lu. 2014. Factored Statistical Machine Translation for Grammatical Error
Correction”, In Proceedings of the Eighth Conference on Computational Natural
Language Learning (CoNLL 2014), pages 83-90. (ACL Anthology and EI)
7, Longyue Wang, Derek F. Wong, Lidia S. Chao, Junwen Xing, Yi Lu, Isabel
Trancoso. 2013. Edit Distance: A New Data Selection Criterion for SMT Domain
Adaptation. In Proceedings of Recent Advances in Natural Language
Processing, pages 727-732. (ACL Anthology and EI)
8, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yi Lu, Junwen Xing. 2013.
iCPE: A Hybrid Data Selection Model for SMT Domain Adaptation. In
Proceedings of the 12th China National Conference on Computational
Linguistics (12th CCL), Lecture Notes in Artificial Intelligence (LNAI) Springer
series, pages 280-290. (EI)
(72/84)
My Publications
9, Junwen Xing, Longyue Wang, Derek F. Wong, Lidia S. Chao, Xiaodong Zeng.
2013. UMChecker: A Hybrid System for English Grammatical Error Correction.
In Proceedings of the Seventeenth Conference on Computational Natural
Language Learning (CoNLL 2013), pages 34-42. (ACL Anthology and EI)
10, Longyue WANG, Shuo Li, Derek F. WONG, Lidia S. CHAO. 2012. A Joint
Chinese Named Entity Recognition and Disambiguation System. In Proceeding of
the 2th CIPS-SIGHAN Joint Conference on Chinese Language Processing
(CLP2012), pages 146-151. (ACL Anthology)
11, Longyue WANG, Derek F. WONG, Lidia S. CHAO, Junwen Xing. 2012.
CRFs-Based Chinese Word Segmentation for Micro-Blog with Small-Scale Data.
In Proceedings of the Second CIPSSIGHAN Joint Conference on Chinese
Language Processing (CLP2012), pages 51-57. (ACL Anthology)
12, Long-Yue Wang, Derek F. WONG, Lidia S. CHAO. 2012. An Experimental
Platform for Cross-Language Document Retrieval. The 2012 International
Conference on Applied Science and Engineering (ICASE2012), pages 33253329. (EI)
(73/84)
My Publications
13, Longyue Wang, Derek F. WONG, Lidia S. CHAO. 2012. An Improvement in
Cross-Language Document Retrieval Based on Statistical Models. The TwentyFourth Conference on Computational Linguistics and Speech Processing
(ROCLING 2012), pages 144-155. (ACL Anthology and EI)
14, Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo Quaresma, Francisco
Oliveira, Yi Lu, Shuo Li, Yiming Wang, Longyue Wang. 2014. UM-Corpus: A
Large English-Chinese Parallel Corpus for Statistical Machine Translation. In
Proceedings of the 9th Edition of its Language Resources and Evaluation
Conference (LREC2014), pages 1837-1842. (EI)
(74/84)
Thank You!
謝謝!
Obrigado!
(75/84)
(76/84)
Appendix
(77/84)
Related Work



Zhao et al. [10] firstly use this information retrieval
techniques to retrieve sentences from monolingual corpus to
build a LM, and then interpolate it with generalbackground LM.
Hildebrand et al. [11] extend it to sentence pairs, which are
used to train a domain-specific TM.
Lü et al. [12] further proposed re-sampling and reweighting methods for online and offline TM optimization.
[10] Bing Zhao, Matthias Eck, and Stephan Vogel. 2004. Language model adaptation for statistical machine translation with structured query
models. In Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Geneva,
Switzerland.
[11] Almut Silja Hildebrand, Matthias Eck, Stephan Vogel, and Alex Waibel. 2005. Adaptation of the translation model for statistical machine
translation information retrieval. In 10th Annual Conference of the European Association for Machine Translation (EAMT 2005). Budapest,
Hungary.
[12] Yajuan Lü, Jin Huang, and Qun Liu. 2007. Improving statistical machine translation performance by training data selection and
(78/84)
optimization. Proceedings of EMNLP-CoNLL. pp. 343–350..
Related Work



In language modeling, Gao et al. [13], Moore and Lewis
[14] have used perplexity-based scores adapt LMs.
Then it was firstly applied for SMT adaptation by Yasuda
et al. [15] and Foster et al. [16].
Axelrod et al. [17] further improve the performance of TM
adaptation by considering bilingual information.
[13] Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai-Fu Lee. 2002. Toward a unified approach to statistical language modeling for Chinese.
ACM Transactions on Asian Language Information Processing (TALIP). 1:3–33.
[14] Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. Proceedings of ACL: Short Papers. pp.
220–224.
[15] Keiji Yasuda and Eiichiro Sumita. 2008. Method for building sentence-aligned corpus from wikipedia. In 2008 AAAI Workshop on
Wikipedia and Artificial Intelligence (WikiAI08).
[16] George Foster, Cyril Goutte, and Roland Kuhn. 2010. Discriminative instance weighting for domain adaptation in statistical machine
translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 451–459. Association for
Computational Linguistics, Cambridge, Massachusetts.
[17] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. 2011. Domain adaptation via pseudo in-domain data selection. In: Proceedings of
(79/84)
EMNLP. pp. 355–362.
Related Work
After selection, we obtain pseudo in-domain sub-corpus and
in-domain one is available, mixture-modeling is to integrate
different language models or translation models.
 Foster and Kuhn [18] investigate linear and log-linear
interpolation for individual language models trained by
different corpora.
 Linear interpolation for SMT has been used a lot [19].
 Alternatively, the translation models can be added to the
global log-linear SMT model as features, with weights
optimized through minimum-error-rate training (MERT) [20].
[18] George Foster and Roland Kuhn. 2007. Mixture-model adaptation for SMT. In Proceedings of the Second Workshop on Statistical
Machine Translation, StatMT ’07, pages 128–135. Association for Computational Linguistics, Prague, Czech Republic.
[19] Graeme Blackwood, Adrià de Gispert, Jamie Brunning, and William Byrne. 2008. European language translation with weighted finite state
transducers: The CUED MT system for the 2008 ACL workshop on SMT. In Proceedings of the Third Workshop on Statistical Machine
Translation, pages 131–134. Association for Computational Linguistics,
Columbus, Ohio.
[20]Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen,
(80/84)
Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180.
Experimental Setup
Overall Running Time:
The environment is HPC Cluster Pearl. Computing Node CPU
Intel Xeon X5675, 24 cores, 180 GB.
 Data Selection:
Method
VSM
(GPU)
Perplexity
String-Diff.
(GPU)

2.5 million
5 million
7.5 million
10 million
8 hr
15 hr
29 hr
41 hr
20 min
25 min
30 min
40 min
22 hr
40 hr
62 hr
70 hr
2.5 million
4 hr
1 hr
5 million
13 hr
2 hr
7.5 million
23 hr
4 hr
10 million
32 hr
6 hr
SMT:
Task
Training
Tuning
(81/84)
Experimental Setup
Corpus Processing:
 Propose better data processing steps [29] for domain
adaptation task.
 For Chinese segmentation, we use in-house system [30]. For
other languages, we use European tokenizer [31].
 Linguistic information are extracted by Stanford CoreNLP
toolkits [32].
 Others such as case-processing (truecase), length-cleaning
(1-80) ect., we use Moses scripts.
[29] Longyue Wang, Yi Lu, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco Oliveira. (2014) "Combining Domain Adaptation
Approaches for Medical Text Translation". In Proceedings of the Ninth Workshop on Statistical Machine Translation.
[30] Longyue WANG, Derek F. WONG, Lidia S. CHAO, Junwen Xing. (2012). "CRFs-Based Chinese Word Segmentation for Micro-Blog
with Small-Scale Data." Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2012), pages 51–57.
[31] Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit. Vol. 5. pp. 79–86.
[32] Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford
CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics:
(82/84)
System Demonstrations, pp. 55-60
Experimental Setup
SMT:
 Moses decoder [33], a state-of-the-art open-source
phrase-based SMT system.
 The translation and the re-ordering model relied on “growdiag-final” symmetrized word-to-word alignments built
using GIZA++ [34].
 A 5-gram language model was trained using the IRSTLM
toolkit [35], exploiting improved modified Kneser-Ney
smoothing, and quantizing both probabilities and back-off
weights.
[33] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen,
Christine Moran et al. 2007. Moses: Open source toolkit for statistical ma-chine translation. Proceedings of ACL. pp. 177–180.
[34] Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics.
29:19–51.
[35] Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models.
(83/84)
Proceedings of Inter-speech. pp. 1618–1621.
Experimental Setup
Data Selection:
 For Cosine tf-idf and Edit-distance, we develop them on
GPU.
 For Perplexity-based methods, we perform SRILM toolkit
[36] to conduct 5-gram LMs with interpolated modified
Kneser-Ney discounting.
 We use end-to-end evaluation method: using BLEU [37] as
an evaluation metric to reflect the domain-specific
translation quality.
[36] Andreas Stolcke and others. 2002. SRILM-an extensible language modeling toolkit. Proceedings of the International Conference on
Spoken Language Processing. pp. 901–904.
[37] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation.
(84/84)
Proceedings of ACL. pp. 311–318.