Download Report

CWMT 2015 Machine Translation Evaluation Guidelines
CWMT 2015 Machine Translation Evaluation Group
Institute of Automation, Chinese Academy of Sciences
Institute of Computing Technology, Chinese Academy of Sciences
I.
Introduction
The 2015 (11th) China Workshop on Machine Translation (CWMT 2015) will be held at Hefei
Institutes of Physical Science, Chinese Academy of Sciences on September 23 – 25, 2015. CWMT
2015 will continue the ongoing series of machine translation (MT) evaluation campaigns in order to
promote active interactions among participants and help advance the state-of-the-art in MT
technology. We hope that both beginners and established research groups will participate in this
evaluation campaign.
Compared with the past evaluations campaigns, CWMT 2015 MT evaluation is characterized
as follows: first, adopting Double-blind Evaluation which can help us offer the utmost attention to
translation model itself rather than the pre- and post-processing like lexical analysis, syntactic
analysis and named entity translation for the first time, in which the encrypted bilingual data and
syntactic tree of one language pair will be presented to the participants; second, cancelling Gray-box
evaluation and manual evaluation adopted in the last evaluations campaign; third, the baseline
system(s) and the corresponding Gray-box files will not be provided.
The sponsor of CWMT 2015 machine translation evaluation is:
Chinese Information Processing Society of China
The organizers of this evaluation are:
Institute of Automation, Chinese Academy of Sciences
Institute of Computing Technology, Chinese Academy of Sciences
The cooperators of this evaluation include (arranged in alphabetical order):
Harbin Institute of Technology
Inner Mongolia University
Nanjing University
Northeastern University
Qinghai Normal University
Toshiba (China) research and development center
1
Xiamen University
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences
Xinjiang University
The resource providers of this evaluation include:
Datum Data Co., Ltd.
Harbin Institute of Technology
Inner Mongolia University
Institute of Automation, Chinese Academy of Sciences
Institute of Computing Technology, Chinese Academy of Sciences
Institute of Intelligent Machines, Chinese Academy of Sciences
Northeastern University
Northwest University of Nationalities
Peking University
Qinghai Normal University
Tibet University
Xiamen University
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences
Xinjiang University
The chairs of this evaluation are:
WANG Kun (Institute of Automation, Chinese Academy of Sciences)
JIANG Wenbin (Institute of Computing Technology, Chinese Academy of Sciences)
The committee members of this evaluation include:
CAO Hailong (Harbin Institute of Technology)
CHEN Yidong (Xiamen University)
HAI Yinhua (Inner Mongolia University)
HUANG Shujian (Nanjing University)
Maierhaba Aili (Xinjiang University)
Toudan Cairang (Qinghai Normal University)
XIAO Tong (Northeastern University)
YANG YaTing (Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of
Sciences)
ZHANG Dakun (Toshiba (China) Research and Development Center)
ZHANG Jiajun (Institute of Automation, Chinese Academy of Sciences)
ZHAO Hongmei (Institute of Computing Technology, Chinese Academy of Sciences)
2
For more information about CWMT 2015 and the MT evaluation tasks, please visit:
http://www.liip.cn/CWMT 2015/
http://nlp.ict.ac.cn/evalshow.php?id=2028
II.
Evaluation Tasks
CWMT 2015 MT evaluation campaign consists of six tasks involving overall 3 domains and 4
language pairs, as listed in Table 1.
Task ID
Task Name
Domain
CE
Chinese-to-English News Translation
News
EC
English-to-Chinese News Translation
News
MC
Mongolian-to-Chinese Daily Expression Translation
Daily Expressions
TC
Tibetan-to-Chinese Government Document Translation
Government Documents
UC
Uyghur-to-Chinese News Translation
News
DB
Double Blind Evaluation
******
Table 1 CWMT 2015 MT evaluation tasks
III. Evaluation Methods
1. Double Blind Evaluation
This double blind evaluation can help us offer the utmost attention to translation model itself
rather than the pre- and post-processing like lexical analysis, syntactic analysis and named entity
translation, in which the encrypted bilingual data and syntactic tree of one language pair will be
presented to the participants.
2. Evaluation Metrics
1) Automatic evaluation
Automatic metrics that will be used in CWMT 2015 MT evaluation include: BLEU-SBP,
BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT.
Note:
 All the scores of these metrics will be case-sensitive;
 BLEU-SBP will be the primary metric;
 The evaluation of Chinese translation will be based on Chinese characters instead of words.
The organizer will convert the full-width Chinese characters in the A3 area of GB2312 in
the Chinese translations to half-width characters, so that the participants do not need to
perform such conversion by themselves;
3
IV. Evaluation Procedure
CWMT 2015 MT evaluation will take the following four stages:
1. The training stage. The organizer releases the training data and development sets. The
participants train and tune their systems based on the data released.
2. The test stage. The organizer releases the source file of the test set. The participants run
their systems and submit the final translation results by the deadline indicated in Appendix A. Please
refer to the instruction of Appendix C for the format of the submission files.
3. The evaluation stage. The organizer evaluates the quality of the files submitted, reports the
evaluation results.
4. The reporting stage. The organizer opens the online automatic evaluation platform, on
which the participants can score their experiment results of the test set. The participants submit their
system’s technical reports (refer to Appendix D for instructions), prepare presentations for the
CWMT 2015 workshop and attend the event.
Please refer to Appendix A for the schedule of CWMT 2015.
V.
Evaluation Data Description
To support training, parameter tuning and evaluation of MT systems, the organizer will offer
several language resources including the training corpus, development sets and test sets (source
files).
1. The Training Data
In addition to the corpus offered in the past CWMT MT evaluation campaigns, CWMT 2015
MT evaluation campaign will update and add the following training data:





IMU Mongolian-Chinese Parallel Corpus (Version 2015)
QNU Tibetan-Chinese Parallel Corpus (Version 2015)
XTIPC Uyghur-Chinese Parallel Corpus (Version 2015)
CASIA English-Chinese Web Parallel Corpus (Version 2015)
ICTCAS English-Chinese Web Parallel Corpus (Version 2015)
Participants can obtain the training data of the tasks that they register to.
Please refer to Appendix E for the list of the training data released by the organizer.
2. The Development Sets
The organizer will provide both the source file and the reference file of the development set for
each task. The sizes and the providers of the development sets are listed in Table 2.
4
Task ID
Size
Provider
CE
1,000 sentences
Institute of Computing Technology, CAS
EC
1,000 sentences
Institute of Computing Technology, CAS
MC
1,000 sentences
Inner Mongolia University
TC
1,000 sentences
Qinghai Normal University
UC
DB
1,000 sentences
1,000 sentences
Xinjiang University
Institute of Automation, CAS
Table 2 Size and providers of the development sets
3. The Test Sets
The organizer will offer the source file of the test set for each task. The sizes and the
descriptions of the test sets are listed in Table 3.
Task ID
CE
EC
MC
Size
1,000 sentences
1,000 sentences
1,000 sentences
Provider
Institute of Automation, CAS
Institute of Automation, CAS
Inner Mongolia University
TC
1,000 sentences
Qinghai Normal University
UC
1,000 sentences
Xinjiang Technical Institute of Physics & Chemistry, CAS
DB
1,000 sentences
Institute of Automation, CAS
Table 3 Size and the descriptions of the test sets
Please refer to Appendix C for instructions regarding the format of the development and test
sets.
VI. Evaluation Conditions
1. MT technology
CWMT 2015 has no restriction in the MT approaches used, and participants can choose any
approach including rule-based MT, example-based MT, statistical MT, etc.
Participants are allowed to use system combination techniques, but are required to clarify: (1)
the combination technology, (2) the single systems involved and their performance in the technical
report. In the released evaluation results, the systems using combination techniques will be marked
so as to be distinguished from single systems.
2. Training Conditions
For statistical machine translation systems, two kinds of training conditions are allowed in
CWMT 2015, which are constrained training and unconstrained training.
5
1)
Constrained training
Under this condition, only data provided by the evaluation organizer can be used for system
development. Systems entered in the constrained training condition allow for direct comparisons of
different algorithmic approaches. System development must adhere to the following restrictions:
 The primary systems of participants must be developed under Constrained training
condition.
 Tools for bilingual translation (such as named entity translator, syllable-to-character
converter) must not use any additional resource. And the exceptions are tools for translating
numeral and time words.
 For any evaluation task, systems can only make use of the corpora related to this task.
Usage of the corpora of any other evaluation task is not acceptable, even the participant
takes participation in more than one task.
Tools for monolingual processing (such as lexical analyzer, parser and named entity recognizer)
are not subject to the above restrictions.
2)
Unconstrained training
Systems entered in the Unconstrained training condition may demonstrate the gains achieved
by adding data from other sources. System development must adhere to the following restrictions:
If participants use additional data, they should declare if the data can be accessed publicly. If
they can be accessed publicly, the participants should give the origin of the data; or if they can’t be
publicly accessed, the participants should describe the content and size of the data in detail in the
system description and the technical reports.
The contrast systems of participants can be developed under Unconstrained training condition.
A rule-based MT module or system is allowed to use hand-crafted translation knowledge
sources such as rules, templates, and dictionaries. Participants using rule-based MT system are
required to describe the size of the knowledge sources and the ways to construct and use the
knowledge sources in the system description and the technical reports.
VII. Final Submission
The final translation result(s) must be returned to the organizer by the deadline indicated in
Appendix A. Each participant should submit one final translation result as primary result and at
most three other translation result(s) as contrast results for each task that he registered to. Please
refer to the instruction of Appendix C for the format of the submission files.
In the reporting stage, the participants will be required to submit their system’s technical
reports to the organizer (refer to Appendix D for instructions).
6
VIII.Appendix
This document includes the following appendixes:
Appendix A: Evaluation Calendar
Appendix B: Registration Form
Appendix C: Format of MT Evaluation Data
Appendix D: Requirements of Technical Report
Appendix E: List of Resources Released by the Organizer
7
Appendix A: Evaluation Calendar
1
May 20, 2015
Registration deadline
2
June 5, 2015
Training and development data released
3
July 13, 2015
10:00AM GMT+8
CE, EC and UC tasks’ test data e-mailed to participants
4
July 17, 2015
17:30PM GMT+8
Deadline for submission of CE, EC and UC tasks’ translation
results
5
July 20, 2015
10:00AM GMT+8
DB, MC and TC tasks’ test data e-mailed to participants
6
July 24, 2015
17:30PM GMT+8
Deadline for submission of DB, MC and TC tasks’
translation results
7
August 11, 2015
Preliminary release of evaluation results to participants
8
August 11, 2015
Online scoring website open (close on September 25)
9
August 20 ,2015
Deadline for submitting technical report
10
August 26 ,2015
Reviews of technical reports sent to participants, who should
modify reports accordingly
11
August 30 ,2015
Deadline for submitting technical report camera-ready
12
September 23 to
September 25, 2015
CWMT 2015 workshop. Official public release of results.
Online scoring website closed
8
Appendix B: Registration Form
Any organization engaged in MT research or development can register for the CWMT 2015
evaluation. The participating sites of CWMT 2015 evaluation should fill the following form and
send it to the organizer by both email and post (or fax). In the post (or fax), there should be a
signature of the person in charge or a stamp of the participating organization.
Each participant must pay a registration fee to cover the expenditure of resource development,
workshop organization and evaluation organization. Note that the registration fee includes that of
one person attending the workshop (the fee of one person would be covered even they register
multiple translation tasks).
The registration fees of all tasks are listed as follows:
Task ID
CE/ EC
MC
TC
UC
DB
Registration Fee (RMB)
Research Institution
Industry
3000
6000
2000
4000
2000
4000
2000
4000
2000
4000
Note that any institution affiliated with a corporation belongs to the “industry” category.
As the “CE” and “EC” tasks share the same training data, the registration fee is the same
whether a participant registers for one or both tasks. The “DB” task would be free if the participant
register any other tasks.
The deadline for registration is May 20, 2015.
Please send the registration form to:
Name: Dr. Kun WANG
Email: [email protected]
Address: Institute of Automation, CAS. No.95 Zhongguancun East Road, Haidian, Beijing
100190, P. R. China
Post Code: 100190
Telephone: +86-10-82544588
Please send the registration fee to (both bank transfer and remittance through post office are
accepted):
Bank transfer (preferred):
9
Full name of the Bank: Industrial and commercial bank of China, Beijing municipal
branch, Haidian xiqu zhi hang
Beneficiary: Chinese Information Processing Society of China
Bank Account Number: 0200004509014415619
Remittance through Post Office:
Address: Chinese Information Processing Society of China, No. 4 South Fourth Street,
Zhong Guan Cun, Haidian District, Beijing 100190, P.R. China
Post Code: 100190
For any problem related with bank transfer and remittance, please contact:
Ms. Lin Lu
Telephone: +86-10-62562916
Participants outside mainland China should sign an agreement with the organizer if they want
to pay the registration fee in other currencies (e.g. USD). The exchange rate will be determined by
the sponsor according to the official standard rate at payment day.
10
Registration form for CWMT 2015 Machine Translation Evaluation
Organization Name
Address
Contact person
Telephone
Post code
Email
□ Chinese-to-English News Translation
□ English-to-Chinese News Translation
Evaluation Tasks
□ Double-Blind Evaluation
□ Mongolian-to-Chinese Daily Expression Translation
□ Tibetan-to-Chinese Government Document Translation
□ Uyghur-to-Chinese News Translation
The participating site agrees to commit to the following terms:
1. After receiving the evaluation data, the participating site will process the entire test
set following the evaluation guidelines, and submit the results including system
description and primary system’s results to the evaluation organizer before the
submission deadline.
2. The participating site agrees to submit a formal technical report, attend, and make a
presentation at the CWMT 2015 workshop.
3. The participating site confirms that it has the intellectual property of the
participating system. If any technology in the participating system is licensed from
other person or organization, it will be clearly described in the system description.
4. The registrant confirms that the data obtained in the evaluation, including the
11
training set, the development set, the test set, reference translations, and evaluation
tools, will be only used in research related to this evaluation. No other usage is
permitted.
5. The participating site agrees that the evaluation data will only be used within the
research group that takes part in the evaluation, and neither will it be distributed to by
any way (written, electronically, or by network), nor will it be used by any partner or
affiliated organizations of the participating site.
6. The participating site agrees to give credit to the resource providers by referring to
the resources being used (e.g., training data, development data, test data, reference
translations, and evaluation tools) in their publications and other research
accomplishments.
7. If a participating site violates terms 4-6, the evaluation sponsor and resource
providers have the right to request the participating site and the cooperators and/or
affiliation organizations using the resources without granted licenses to compensate
in 3-5 times of the cost of the distributed resources. If insufficient, the compensation
fee should be increased to be equal to the actual loss of related resource providers.
Signature of the person in charge or stamp of the participating site:
Date:
12
CWMT 2015 MACHINE TRANSLATION EVALUATION PARTICIPATING SITE
AGREEMENT
(Non-profit Agreement)
This agreement is made by and between:
Name of The Participating Site (hereinafter called “the participating site”), participating site of the
CWMT 2015 Machine Translation Evaluation, having its principal place of business at:
Address of the Participating Site
AND
Chinese Information Processing Society of China (hereinafter called “the sponsor”),
the sponsor of CWMT 2015 machine translation evaluation, having its principal place at:
No.4, Forth Southern Street, Zhongguancun, Beijing, China.
Whereby it is agreed as follows:
1. The sponsor provides the participating site with the training data and baseline system for the
CWMT 2015 machine translation evaluation.
2. The participating site should pay registration fee for the evaluation to the sponsor. The registration
fee includes the license fee of the training data resources, the registration fee of CWMT 2015 for one
person from the participating site, and part of the cost of the organization of the evaluation. The
registration fee is XXX USD.
Method of payment: bank transfer
In witness whereof, intending to be bound, the parties hereto have executed this AGREEMENT by
their duly authorized officers.
AUTHORISED BINDING SIGNATURES:
————————
———————
On behalf of Chinese Information Processing
On behalf of Name of the Participating Site
Society of China
Name:
Name:
Title:
Title:
Date:
Date:
13
Appendix C: Format of MT Evaluation Data
This appendix describes the format of the data released by the organizer and the result files that
the participants should submit.
All the files should be encoded in UTF-8 format. Among them, the development set (including
its reference), the test set and the final translation result files must be strict XML files (whose
formats are defined by the XML DTD described in section III) encoded in UTF-8 (with BOM), and
all the others are plain text files encoded in UTF-8 (without BOM).
I. Data released by the organizer
The organizer will release three kinds of data: training sets, development sets and test sets.
Here we take the “Chinese-to-English Translation” task as an example for illustration purposes.
1.
Training Set
The training data contains one sentence per line. The parallel corpus of each language pair is
made of a source file and a target file, which contain the source and target sentences respectively.
Figure 1 illustrates the data format of the parallel corpus.
战法训练有了新的突破
Tactical training made new breakthrough
第一章总则
Chapter I general rules
人民币依其面额支付
The renminbi is paid by denomination
……
……
Figure 1 Example of the parallel corpus
Note: In “Double-Blind Evaluation” task, the evaluation organizer will provide the corpora,
which has been preprocessed. The corpora for the other language pairs are not processed.
2.
Development Set and Test Set
There are source files and reference files in the development set and the test set.
（1）Source File
A source file contains one single srcset element, which has the following attributes:



setid: the dataset
srclang: the source language. One element of this set:{en, zh, mn, uy, ti, db}
trglang: the target language. One element of this set:{en, zh, mn, uy, ti, db}
A srcset element contains one or more DOC element(s), and each DOC element contains one
single attribute docid, which indicates the genre of the DOC.
Each DOC element contains several seg elements with attribute id.
14
One or more segments may be encapsulated inside other elements, such as p. Only the text
surrounded by seg elements is to be translated.
Figure 2 shows an example of the source file.
<?xml version="1.0" encoding="UTF-8"?>
<srcset setid="zh_en_news_trans" srclang="zh" trglang="en">
<DOC docid="news">

<seg id="1">sentence 1</seg>.
<seg id="2">sentence 2</seg>
……

……
</DOC>
</srcset>
Figure 2 Example of the source file
（2）Reference file
A reference file contains a refset element. Each refset element contains the following
attributes:
 setid: the dataset
 srclang: the source language. One element of this set:{en, zh, mn, uy, ti}
 trglang: the target language. One element of this set:{en, zh, mn, uy, ti}
Each refset element contains several DOC elements. Each DOC has two attributes:


docid: the genre of the DOC
site: the indicator for different references. One element of this set:{1, 2, 3, 4}
Figure 3 shows an example of the reference file.
15
<?xml version="1.0" encoding="UTF-8"?>
<refset setid="zh_en_news_trans" srclang="zh" trglang="en">
<DOC docid="news" sysid="ref" site="1">

<seg id="1">reference 11 </seg>
<seg id="2">reference 21</seg>
……

……
</DOC>
<DOC docid="news" sysid="ref" site="2">

<seg id="1">reference 21 </seg>
<seg id="2">reference 22</seg>
……

……
</DOC>
<DOC docid="news" sysid="ref" site="3">

<seg id="1">reference 31</seg>
<seg id="2">reference 32</seg>
……

……
</DOC>
<DOC docid="news" sysid="ref" site="4">

<seg id="1">reference 41 </seg>
<seg id="2">reference 42</seg>
……

……
</DOC>
</refset>
Figure 3 Example of the reference file
II. Data should be Submitted by the Participants
1. File Naming
Please name the submitted files according to the naming mode in the following table (we use “ce”,
“ict” and “2015” here as examples of Task ID, Participant ID and year of the test data respectively).
16
File
Naming mode
Example
final translation result
Task ID - year of the test data
- ce-2015-ict-primary-a.xml
Participant ID - Primary vs. contrast ce-2015-ict-contrast-c.xml
system - System ID.xml
2. Final translation result
 The final submission file contains a tstset element with the following attributes:



setid: the dataset
srclang: the source language. One element of this set:{en, zh, mn, uy, ti}
trglang: the target language. One element of this set: {en, zh, mn, uy, ti}
 The tstset element contain a system element with the following attributes:


site: the label of the participant
sysid: the identification of the MT system
The value of the system element is the description of the participating system including the
following information:
 Hardware and software environment: the operating system and its version, number of
CPUs, CPU type and frequency, system memory size, etc.
 Execution Time: the time from accepting the input to generating the output.
 Technology outline: an outline of the main technology and important parameters of
the participating system. If the system uses system combination techniques, single systems being
combined and the combination techniques should be described here.
 Training Data: a description of the training data and development data used for
system training.
 External Technology: a declaration of the external technologies which are used in the
participating system but not owned by the participating site, including: open-source code, free
software, shareware, and commercial software.
The content of each DOC element is exactly the same as that of the test set’s source file, which
is described before.
Here is an example of the final submission file in Figure 4.
<?xml version="1.0" encoding="UTF-8"?>
<tstset setid="zh_en_news_trans" srclang="zh" trglang="en">
<system site="unit name" sysid="system identification">
description information of the participating system
............
</system>
17
<DOC docid="document name" sysid="system identification">

<seg id="1">submit translation 1</seg>
<seg id="2">submit translation 2</seg>
……

……
</DOC>
</tstset>
Figure 4 Illustration of the final submission file
 Note:
 Please note that CWMT 2015 evaluation adopts strict XML file format. The main
difference between the XML file format and the NIST evaluation file format lies in the following: in
an XML file, if the following five characters occur in the text outside tags, they should be replaced
by escape sequences:
Character
Escape sequence
&
&
<
<
>
>
"
"
'
'
 As for Chinese encoding, the middle point in a foreign person name should be written
as "E2 80 A2" in UTF-8, for example, "托德·西蒙斯" ;
 As for English tokenization, the tokenization should be consistent with the
"normalizeText" function of the Perl script "mteval-v11b.pl" released by NIST.
III．Description of CWMT 2015 XML files’ document structure
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT srcset (DOC+)>
<!ATTLIST srcset setid CDATA #REQUIRED>
<!ATTLIST srcset srclang (en | zh | mn | uy | ti ) #REQUIRED>
<!ATTLIST srcset trglang (en | zh | mn | uy | ti ) #REQUIRED>
<!ELEMENT refset (DOC+)>
<!ATTLIST refset setid CDATA #REQUIRED>
18
<!ATTLIST refset srclang (en | zh | mn | uy | ti ) #REQUIRED>
<!ATTLIST refset trglang (en | zh | mn | uy | ti ) #REQUIRED>
<!ELEMENT tstset (DOC+)>
<!ELEMENT tstset (system+)>
<!ATTLIST tstset setid CDATA #REQUIRED>
<!ATTLIST tstset srclang (en | zh | mn | uy | ti ) #REQUIRED>
<!ATTLIST tstset trglang (en | zh | mn | uy |ti ) #REQUIRED>
<!ELEMENT system (#PCDATA) >
<!ATTLIST system site CDATA #REQUIRED >
<!ATTLIST system sysid CDATA #REQUIRED >
<!ELEMENT DOC ( p* )>
<!ATTLIST DOC docid CDATA #REQUIRED>
<!ATTLIST DOC site CDATA #IMPLIED>
<!ELEMENT p(seg*)>
<!ELEMENT seg (#PCDATA)>
<!ATTLIST seg id CDATA #REQUIRED>
19
Appendix D: Requirement of Technical Report
All participating sites should submit a technical report to the 11th China Workshop of Machine
Translation (CWMT 2015). The technological report should describe the technologies used in the
participating system(s) in detail, in order to inform the reader about how the reported results are
obtained. A good technological report should be detailed enough so that the reader could replicate
the work which is described in the report. The report should be no shorter than 5,000 Chinese
characters or 3,000 English words.
Generally, a technology report should provide the following information:
Introduction: Give the background information; introduce the evaluation tasks participated,
and the outline of the participating systems;
System: Describe the architecture and each module of the participating system in detail. The
technologies used in the system should be focused. If there is any open technology adopted, it
should be explicitly declared. If the technologies are developed by the participating site itself, that
should be described in detail. If the participating site uses system combination techniques, the single
systems as well as the combination technique should be described. If the participating site uses
hand-crafted translation knowledge sources such as rules, templates, and dictionaries, the size of the
knowledge sources and the ways to construct and use the knowledge sources should be described.
Data: Give detailed description of the data used in the system training and the processing of
the data.
Experiment: Give detailed description to the experiment process, the parameters and the
results obtained on the evaluation set. Analyze the results.
Conclusion:
( open )
20
Appendix E: Resource List Released by the Organizer
1.
The Chinese-English resources provided by the organizer
ChineseLDC
resource ID
Resource description
Languages
Domain
CAS-ICT&CAS-IA Chinese-English Sentence-Aligned
Bilingual Corpus (Extended version)
Institute of Computing Technology, CAS & Institute of
Automation, CAS
Chinese-to-English, English-to-Chinese
Multi-domain
The original corpus includes 3,384 bilingual text files,
which contain 209,486 Chinese-English sentence pairs,
where 3,098 files with 107,436 sentence pairs were
developed by Institute of Automation (IA), CAS, and the
other 250 files with 102,050 sentence pairs were
developed by Institute of Computing Technology (ICT),
CAS.
The current version is an extended version with
additional data which was also provided by IA and ICT,
and it contains overall 252,329 sentence pairs.
The resource is developed under the support of the
National Basic Research Program (973 Program). It is a
large-scale Chinese-English bilingual Corpus on
multi-domain and multi-style which is sentence-aligned.
PKU Chinese-English/Chinese-Japanese parallel corpus
(Chinese-English part)
Institute of Computational Linguistics, Peking
University
Chinese-to-English
Multi-domain
Size
200,000 Chinese-English sentence pairs
Name
Providers
Languages
Domain
CLDC-LAC-2
003-004
Size
Description
Name
Provider
CLDC-LAC-2
003-006
Description
Name
Provider
Languages
Domain
The corpus is supported by a subproject of the National
High Technology Research and Development Program
of China (863 Program) with the title of
"Chinese-English/Chinese-Japanese parallel
corpus"(Grant No. 2001AA114019).
XMU English-Chinese Movie’s Subtitle Corpus
Xiamen University
English-to-Chinese
Dialog
21
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
176,148 sentence pairs
Subtitles of movies
HIT-IR English-Chinese Sentence-Aligned Corpus
IR laboratory of Harbin Institute of Technology
English-to- Chinese
Multi-domain
100,000 sentence pairs
HIT-MT English-Chinese Sentence-Aligned Corpus
Machine Translation Group of Harbin Institute of
Technology
Chinese-to-English, English-to-Chinese
Multi-domain
52,227 sentence pairs
Datum English-Chinese Parallel Corpus (Part)
Datum Data Co., Ltd.
Chinese-to-English, English-to-Chinese
Multi-domain, including: textbooks for language
education, bilingual books, technological documents,
bilingual news, government white books, government
documents, bilingual resources on web, etc.
1,000,000 sentence pairs
It is a part of the “Bilingual / Multi-lingual Parallel
Corpus” developed by Datum Data Co., Ltd under the
support of the National High Technology Research and
Development Program of China (863 Program).
ICT Web Chinese-English Parallel Corpus (2013)
Institute of Computing Technology, CAS
Chinese-to-English, English-to-Chinese
Multi-domain
About 2,000,000 sentence pairs
The parallel corpus is automatically acquired from web.
All the processes, including parallel web page
discovering and verification, parallel text extraction,
sentence alignment, etc., are entirely automatic. The
accuracy of the corpus is about 95%.
This work was supported by the National Natural
Science Foundation of China (Grant No. 60603095).
ICT Web Chinese-English Parallel Corpus (2015)
Institute of Computing Technology, CAS
Chinese-to-English, English-to-Chinese
Multi-domain
22
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
2007-863-001
Size
More than 2,000,000 sentence pairs
The parallel corpus is automatically acquired from web.
All the processes, including parallel web page discoverin
g and verification, parallel text extraction, sentence align
ment, etc., are entirely automatic. The
Institute of Computing Technology has corrected this
corpus roughly. The accuracy of the corpus is greater
than 99%. Three sources of sentences were selected to
provide this corpus: 60% from the web, 20% from movie
subtitles, and the rest 20% from the English-to-Chinese
or Chinese-to-English dictionaries.
NEU Chinese-English Parallel Corpus
Natural Language Processing Group, Northeastern
University
Chinese-to-English, English-to-Chinese
Multi-domain
1,000,000 sentence pairs
The parallel corpus is automatically acquired from web.
Semi-automatic techniques are used to filter sentence
pairs of low quality.
CASIA Web Chinese-English Parallel Corpus (2015)
Institute of Automation, CAS
Chinese-to-English, English-to-Chinese
Multi-domain
About 1,000,000 sentence pairs
The parallel corpus is automatically acquired from web.
All the processes, including parallel web page
discovering and verification, parallel text extraction,
sentence alignment, etc., are entirely automatic.
SSMT2007 Machine Translation Evaluation Data
(a part of Chinese-English & English-Chinese MT
evaluation data)
Institute of Computing Technology, Chinese Academy
of Sciences
Chinese-to-English, English-to-Chinese
News
This is the test data of SSMT 2007 MT Evaluation,
which contain data of 2 translation directions
(Chinese-English and English-Chinese) in news domain.
The source file of Chinese-English data contains 1,002
Chinese sentences with 42,256 Chinese characters. The
source file of English-Chinese data contains 955 English
sentences with 23,627 English words. There are 4
reference translations made by human experts for each
23
test sentence.
Description
Name
Provider
Languages
2005-863-001
Domain
Size
Description
Name
Provider
Languages
2004-863-001
Domain
Size
Description
Name
2003-863-001
Provider
Languages
HTRDP(“863 Program”) 2005 Machine Translation
Evaluation Data (a part of Chinese-English &
English-Chinese MT evaluation data)
Institute of Computing Technology, Chinese Academy
of Sciences
Chinese-to-English, English-to-Chinese
The data contains two genres: one is dialog data from
Olympics-related domains, which includes game reports,
weather forecasts, traffic and hotels, travel, foods, etc,
and the other one is text data from news domain.
The source files of dialog and text data in
Chinese-to-English and English-to-Chinese directions
contain about 460 sentence pairs respectively. The total
number of source sentences is about 1,840. There are 4
reference translations made by human experts for each
source sentence.
The test data of the 2005 “863 Program” machine
translation evaluation.
HTRDP (“863 Program”) 2004 Machine Translation
Evaluation Data (a part of Chinese-English &
English-Chinese MT evaluation data)
Institute of Computing Technology, Chinese Academy
of Sciences
Chinese-to-English, English-to-Chinese
Two data genres: one is text data, the other is dialog
data. The data covers general domain and
Olympic-related domains which include game reports,
weather forecasts, traffic and hotels, travel, foods, etc.
The source files of Chinese-to-English direction contain
dialog data of 400 sentences and text data of 308
sentences. The source files of English-to-Chinese
direction contain dialog data of 400 sentences and text
data of 310 sentences. There are 4 reference translations
made by human experts for each source sentence.
The test data for the 2004 “863 Program” machine
translation evaluation.
HTRDP (“863 Program”) 2003 Machine Translation
Evaluation Data (A part of Chinese-English &
English-Chinese MT evaluation data)
Institute of Computing Technology, Chinese Academy
of Sciences
Chinese-to-English, English-to-Chinese
24
Domain
The data covers Olympic-related domains which include
game reports, weather forecasts, traffic and hotels, travel,
foods, etc.
Size
The source files of Chinese-to-English direction contain
dialog data of 437 sentences and text data of 169
sentences, and the source files of English-to-Chinese
direction contain dialog data of 496 sentences and text
data of 322 sentences. There are 4 reference translations
made by human experts for each source sentence.
Description
The test data for the 2003 “863 Program” machine
translation evaluation.
2. The Mongolian-Chinese resources provided by the organizer (no repeated
sentence pairs among different data)
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
IMU Mongolian-Chinese Parallel Corpus (Version 2013)
Inner Mongolia University
Chinese-to-Mongolian, Mongolian-to-Chinese
Government documents, laws, rules, daily conversation,
literature
104,975 sentence pairs. Including:
1) 67,274 sentence pairs for CWMT 2011 MT evaluation,
covering domains such as daily conversation, literature,
government documents, laws and rules;
2) 37,701 newly added sentence pairs for CWMT 2013
MT evaluation, including 17,516 sentence pairs from
news domain, 10,394 sentence pairs from government
documents, 5,052 sentence pairs from text books and
4,739 sentence pairs from a Mongolian-to-Chinese
dictionary.
Encoded in UTF-8 (without BOM)
IMU Mongolian-Chinese Parallel Corpus (Version 2015)
Inner Mongolia University
Chinese-to-Mongolian, Mongolian-to-Chinese
Government documents, laws, rules, daily conversation,
literature
Name
53,578 sentence pairs.
Including:
5,012 sentence pairs from movies, 12,835 sentence pairs
from government documents, 1,872 sentence pairs from
books; 5,780 sentence pairs about Mongol sacrificial, and
28,079 sentence pairs from news.
IIM Mongolian-Chinese Parallel Corpus
Provider
Languages
Institute of Intelligent Machines, CAS
Mongolian-to-Chinese
Size
25
Domain
Size
News
Description
Encoded in UTF-8 (without BOM)
1,682 sentence pairs.
3. The Tibetan-Chinese resources provided by the organizer (no repeated sentence
pairs among different data)
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
Name
Provider
QHNU Tibetan-Chinese Parallel Corpus (Version 2013)
Qinghai Normal University
Tibetan-to-Chinese, Chinese-to-Tibetan
Government document
50,000 sentence pairs
Note: after the organizer deleted those sentence pairs that
were found in other training data, the size of this corpus is
now 30,000 sentence pairs.
The sentence alignment accuracy of the corpus is over
99%. The construction of the corpus was supported by
NSFC (Grant No. 61063033) and 973 Program (Grant No.
2010CB334708).
QHNU Tibetan-Chinese Parallel Corpus (Version 2015)
Qinghai Normal University
Tibetan-to-Chinese, Chinese-to-Tibetan
Government document
20,000 sentence pairs
The sentence alignment accuracy of the corpus is over
99%. The construction of the corpus was supported by
NSFC (Grant No. 61063033).
Yang Jin Tibetan-Chinese Parallel Corpus
Artificial Intelligence Institute, Xiamen University &
Language Technology Institute, Northwest University of
Nationalities
Chinese-to-Tibetan
Multi-domain
52,000 sentence pairs
1) The sources of the corpus include publications, a
Tibetan-Chinese Dictionary, and Tibetan-Chinese Web
Text. The corpus was automatically aligned and corrected
manually.
2) The alignment accuracy is100%
3) The research was supported by NSSFC (Grant No.
05AYY001) and HTRDP (Grant No. 2006AA010107)
NUN-TU-XMU Tibetan-Chinese Parallel Corpus (2012)
Language Technology Institute, Northwest University of
26
Languages
Domain
Size
Description
Nationalities & Tibet University & Artificial Intelligence
Institute, Xiamen University
Chinese-to-Tibetan
Political writings, law
24,000 sentence pairs
It selects material from Chinese law and regulation files
during 2008 to 2009 and government reports from 2011 to
2012. All the source materials have been scanned,
recognized, checked and processed manually.
4. The Uyghur-Chinese resources provided by the organizer (no repeated sentence
pairs among different data)
Name
Provider
Languages
Domain
Size
Description
Name
Provider
Languages
Domain
Size
Description
XJU Uyghur-Chinese Parallel Corpus (Version 2013)
Xinjiang University
Chinese-to-Uyghur
News
80,000 sentence pairs
XTIPC Uyghur-Chinese Parallel Corpus (Version 2015)
Xinjiang Technical Institute of Physics & Chemistry, CAS
Chinese-to-Uyghur
News
60,000 sentence pairs
Newly added about 30,000 sentence pairs based on that of
Version 2013.
95% of sentence pairs are from news (2007~2014) and the
others are from laws and government reports.
27
5. Other training data resources not provided by the organizer
Name
Provider
Languages
Domain
Size
Reuters corpus
Reuters
English
News
RCV1 Reuters Corpus, Volume 1, English language,
1996-08-20 to 1997-08-19 (Release date 2000-11-03,
Format version 1, correction level 0)
This is distributed on two CDs and contains about
810,000 Reuters, English Language News stories. It
requires about 2.5 GB for storage of the uncompressed
files.
The Reuters Corpus can be obtained from:
http://trec.nist.gov/data/reuters/reuters.html
Description
Name
Provider
Languages
Domain
It will take about 1 month to be posted from USA to
China.
Only the Volume 1 of the Reuters Corpus is permitted to
be used as the training corpus for English language
model in the evaluation. The Volume 2 is not permitted
to be used in the evaluation.
SogouCA
Sogou Labs.
Chinese
News
Size
The corpus contains URL and text data collected from 18
news channels on the web from May to June 2008,
which cover the domains of Olympics, sports, IT,
domestic news and international news. The total size is
1.03 GB after compression.
Description
The SogouCA Corpus can be obtained from:
http://www.sogou.com/labs/dl/ca.html. Participants can also
obtain the corpus from the organizer.
The SogouCA corpus is permitted to be used as the
training data of Chinese language model in the
evaluation.
28