CWMT 2015 Machine Translation Evaluation Guidelines CWMT 2015 Machine Translation Evaluation Group Institute of Automation, Chinese Academy of Sciences Institute of Computing Technology, Chinese Academy of Sciences I. Introduction The 2015 (11th) China Workshop on Machine Translation (CWMT 2015) will be held at Hefei Institutes of Physical Science, Chinese Academy of Sciences on September 23 – 25, 2015. CWMT 2015 will continue the ongoing series of machine translation (MT) evaluation campaigns in order to promote active interactions among participants and help advance the state-of-the-art in MT technology. We hope that both beginners and established research groups will participate in this evaluation campaign. Compared with the past evaluations campaigns, CWMT 2015 MT evaluation is characterized as follows: first, adopting Double-blind Evaluation which can help us offer the utmost attention to translation model itself rather than the pre- and post-processing like lexical analysis, syntactic analysis and named entity translation for the first time, in which the encrypted bilingual data and syntactic tree of one language pair will be presented to the participants; second, cancelling Gray-box evaluation and manual evaluation adopted in the last evaluations campaign; third, the baseline system(s) and the corresponding Gray-box files will not be provided. The sponsor of CWMT 2015 machine translation evaluation is: Chinese Information Processing Society of China The organizers of this evaluation are: Institute of Automation, Chinese Academy of Sciences Institute of Computing Technology, Chinese Academy of Sciences The cooperators of this evaluation include (arranged in alphabetical order): Harbin Institute of Technology Inner Mongolia University Nanjing University Northeastern University Qinghai Normal University Toshiba (China) research and development center 1 Xiamen University Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Xinjiang University The resource providers of this evaluation include: Datum Data Co., Ltd. Harbin Institute of Technology Inner Mongolia University Institute of Automation, Chinese Academy of Sciences Institute of Computing Technology, Chinese Academy of Sciences Institute of Intelligent Machines, Chinese Academy of Sciences Northeastern University Northwest University of Nationalities Peking University Qinghai Normal University Tibet University Xiamen University Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences Xinjiang University The chairs of this evaluation are: WANG Kun (Institute of Automation, Chinese Academy of Sciences) JIANG Wenbin (Institute of Computing Technology, Chinese Academy of Sciences) The committee members of this evaluation include: CAO Hailong (Harbin Institute of Technology) CHEN Yidong (Xiamen University) HAI Yinhua (Inner Mongolia University) HUANG Shujian (Nanjing University) Maierhaba Aili (Xinjiang University) Toudan Cairang (Qinghai Normal University) XIAO Tong (Northeastern University) YANG YaTing (Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences) ZHANG Dakun (Toshiba (China) Research and Development Center) ZHANG Jiajun (Institute of Automation, Chinese Academy of Sciences) ZHAO Hongmei (Institute of Computing Technology, Chinese Academy of Sciences) 2 For more information about CWMT 2015 and the MT evaluation tasks, please visit: http://www.liip.cn/CWMT 2015/ http://nlp.ict.ac.cn/evalshow.php?id=2028 II. Evaluation Tasks CWMT 2015 MT evaluation campaign consists of six tasks involving overall 3 domains and 4 language pairs, as listed in Table 1. Task ID Task Name Domain CE Chinese-to-English News Translation News EC English-to-Chinese News Translation News MC Mongolian-to-Chinese Daily Expression Translation Daily Expressions TC Tibetan-to-Chinese Government Document Translation Government Documents UC Uyghur-to-Chinese News Translation News DB Double Blind Evaluation ****** Table 1 CWMT 2015 MT evaluation tasks III. Evaluation Methods 1. Double Blind Evaluation This double blind evaluation can help us offer the utmost attention to translation model itself rather than the pre- and post-processing like lexical analysis, syntactic analysis and named entity translation, in which the encrypted bilingual data and syntactic tree of one language pair will be presented to the participants. 2. Evaluation Metrics 1) Automatic evaluation Automatic metrics that will be used in CWMT 2015 MT evaluation include: BLEU-SBP, BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT. Note: All the scores of these metrics will be case-sensitive; BLEU-SBP will be the primary metric; The evaluation of Chinese translation will be based on Chinese characters instead of words. The organizer will convert the full-width Chinese characters in the A3 area of GB2312 in the Chinese translations to half-width characters, so that the participants do not need to perform such conversion by themselves; 3 IV. Evaluation Procedure CWMT 2015 MT evaluation will take the following four stages: 1. The training stage. The organizer releases the training data and development sets. The participants train and tune their systems based on the data released. 2. The test stage. The organizer releases the source file of the test set. The participants run their systems and submit the final translation results by the deadline indicated in Appendix A. Please refer to the instruction of Appendix C for the format of the submission files. 3. The evaluation stage. The organizer evaluates the quality of the files submitted, reports the evaluation results. 4. The reporting stage. The organizer opens the online automatic evaluation platform, on which the participants can score their experiment results of the test set. The participants submit their system’s technical reports (refer to Appendix D for instructions), prepare presentations for the CWMT 2015 workshop and attend the event. Please refer to Appendix A for the schedule of CWMT 2015. V. Evaluation Data Description To support training, parameter tuning and evaluation of MT systems, the organizer will offer several language resources including the training corpus, development sets and test sets (source files). 1. The Training Data In addition to the corpus offered in the past CWMT MT evaluation campaigns, CWMT 2015 MT evaluation campaign will update and add the following training data: IMU Mongolian-Chinese Parallel Corpus (Version 2015) QNU Tibetan-Chinese Parallel Corpus (Version 2015) XTIPC Uyghur-Chinese Parallel Corpus (Version 2015) CASIA English-Chinese Web Parallel Corpus (Version 2015) ICTCAS English-Chinese Web Parallel Corpus (Version 2015) Participants can obtain the training data of the tasks that they register to. Please refer to Appendix E for the list of the training data released by the organizer. 2. The Development Sets The organizer will provide both the source file and the reference file of the development set for each task. The sizes and the providers of the development sets are listed in Table 2. 4 Task ID Size Provider CE 1,000 sentences Institute of Computing Technology, CAS EC 1,000 sentences Institute of Computing Technology, CAS MC 1,000 sentences Inner Mongolia University TC 1,000 sentences Qinghai Normal University UC DB 1,000 sentences 1,000 sentences Xinjiang University Institute of Automation, CAS Table 2 Size and providers of the development sets 3. The Test Sets The organizer will offer the source file of the test set for each task. The sizes and the descriptions of the test sets are listed in Table 3. Task ID CE EC MC Size 1,000 sentences 1,000 sentences 1,000 sentences Provider Institute of Automation, CAS Institute of Automation, CAS Inner Mongolia University TC 1,000 sentences Qinghai Normal University UC 1,000 sentences Xinjiang Technical Institute of Physics & Chemistry, CAS DB 1,000 sentences Institute of Automation, CAS Table 3 Size and the descriptions of the test sets Please refer to Appendix C for instructions regarding the format of the development and test sets. VI. Evaluation Conditions 1. MT technology CWMT 2015 has no restriction in the MT approaches used, and participants can choose any approach including rule-based MT, example-based MT, statistical MT, etc. Participants are allowed to use system combination techniques, but are required to clarify: (1) the combination technology, (2) the single systems involved and their performance in the technical report. In the released evaluation results, the systems using combination techniques will be marked so as to be distinguished from single systems. 2. Training Conditions For statistical machine translation systems, two kinds of training conditions are allowed in CWMT 2015, which are constrained training and unconstrained training. 5 1) Constrained training Under this condition, only data provided by the evaluation organizer can be used for system development. Systems entered in the constrained training condition allow for direct comparisons of different algorithmic approaches. System development must adhere to the following restrictions: The primary systems of participants must be developed under Constrained training condition. Tools for bilingual translation (such as named entity translator, syllable-to-character converter) must not use any additional resource. And the exceptions are tools for translating numeral and time words. For any evaluation task, systems can only make use of the corpora related to this task. Usage of the corpora of any other evaluation task is not acceptable, even the participant takes participation in more than one task. Tools for monolingual processing (such as lexical analyzer, parser and named entity recognizer) are not subject to the above restrictions. 2) Unconstrained training Systems entered in the Unconstrained training condition may demonstrate the gains achieved by adding data from other sources. System development must adhere to the following restrictions: If participants use additional data, they should declare if the data can be accessed publicly. If they can be accessed publicly, the participants should give the origin of the data; or if they can’t be publicly accessed, the participants should describe the content and size of the data in detail in the system description and the technical reports. The contrast systems of participants can be developed under Unconstrained training condition. A rule-based MT module or system is allowed to use hand-crafted translation knowledge sources such as rules, templates, and dictionaries. Participants using rule-based MT system are required to describe the size of the knowledge sources and the ways to construct and use the knowledge sources in the system description and the technical reports. VII. Final Submission The final translation result(s) must be returned to the organizer by the deadline indicated in Appendix A. Each participant should submit one final translation result as primary result and at most three other translation result(s) as contrast results for each task that he registered to. Please refer to the instruction of Appendix C for the format of the submission files. In the reporting stage, the participants will be required to submit their system’s technical reports to the organizer (refer to Appendix D for instructions). 6 VIII.Appendix This document includes the following appendixes: Appendix A: Evaluation Calendar Appendix B: Registration Form Appendix C: Format of MT Evaluation Data Appendix D: Requirements of Technical Report Appendix E: List of Resources Released by the Organizer 7 Appendix A: Evaluation Calendar 1 May 20, 2015 Registration deadline 2 June 5, 2015 Training and development data released 3 July 13, 2015 10:00AM GMT+8 CE, EC and UC tasks’ test data e-mailed to participants 4 July 17, 2015 17:30PM GMT+8 Deadline for submission of CE, EC and UC tasks’ translation results 5 July 20, 2015 10:00AM GMT+8 DB, MC and TC tasks’ test data e-mailed to participants 6 July 24, 2015 17:30PM GMT+8 Deadline for submission of DB, MC and TC tasks’ translation results 7 August 11, 2015 Preliminary release of evaluation results to participants 8 August 11, 2015 Online scoring website open (close on September 25) 9 August 20 ,2015 Deadline for submitting technical report 10 August 26 ,2015 Reviews of technical reports sent to participants, who should modify reports accordingly 11 August 30 ,2015 Deadline for submitting technical report camera-ready 12 September 23 to September 25, 2015 CWMT 2015 workshop. Official public release of results. Online scoring website closed 8 Appendix B: Registration Form Any organization engaged in MT research or development can register for the CWMT 2015 evaluation. The participating sites of CWMT 2015 evaluation should fill the following form and send it to the organizer by both email and post (or fax). In the post (or fax), there should be a signature of the person in charge or a stamp of the participating organization. Each participant must pay a registration fee to cover the expenditure of resource development, workshop organization and evaluation organization. Note that the registration fee includes that of one person attending the workshop (the fee of one person would be covered even they register multiple translation tasks). The registration fees of all tasks are listed as follows: Task ID CE/ EC MC TC UC DB Registration Fee (RMB) Research Institution Industry 3000 6000 2000 4000 2000 4000 2000 4000 2000 4000 Note that any institution affiliated with a corporation belongs to the “industry” category. As the “CE” and “EC” tasks share the same training data, the registration fee is the same whether a participant registers for one or both tasks. The “DB” task would be free if the participant register any other tasks. The deadline for registration is May 20, 2015. Please send the registration form to: Name: Dr. Kun WANG Email: [email protected] Address: Institute of Automation, CAS. No.95 Zhongguancun East Road, Haidian, Beijing 100190, P. R. China Post Code: 100190 Telephone: +86-10-82544588 Please send the registration fee to (both bank transfer and remittance through post office are accepted): Bank transfer (preferred): 9 Full name of the Bank: Industrial and commercial bank of China, Beijing municipal branch, Haidian xiqu zhi hang Beneficiary: Chinese Information Processing Society of China Bank Account Number: 0200004509014415619 Remittance through Post Office: Address: Chinese Information Processing Society of China, No. 4 South Fourth Street, Zhong Guan Cun, Haidian District, Beijing 100190, P.R. China Post Code: 100190 For any problem related with bank transfer and remittance, please contact: Ms. Lin Lu Telephone: +86-10-62562916 Participants outside mainland China should sign an agreement with the organizer if they want to pay the registration fee in other currencies (e.g. USD). The exchange rate will be determined by the sponsor according to the official standard rate at payment day. 10 Registration form for CWMT 2015 Machine Translation Evaluation Organization Name Address Contact person Telephone Post code Email □ Chinese-to-English News Translation □ English-to-Chinese News Translation Evaluation Tasks □ Double-Blind Evaluation □ Mongolian-to-Chinese Daily Expression Translation □ Tibetan-to-Chinese Government Document Translation □ Uyghur-to-Chinese News Translation The participating site agrees to commit to the following terms: 1. After receiving the evaluation data, the participating site will process the entire test set following the evaluation guidelines, and submit the results including system description and primary system’s results to the evaluation organizer before the submission deadline. 2. The participating site agrees to submit a formal technical report, attend, and make a presentation at the CWMT 2015 workshop. 3. The participating site confirms that it has the intellectual property of the participating system. If any technology in the participating system is licensed from other person or organization, it will be clearly described in the system description. 4. The registrant confirms that the data obtained in the evaluation, including the 11 training set, the development set, the test set, reference translations, and evaluation tools, will be only used in research related to this evaluation. No other usage is permitted. 5. The participating site agrees that the evaluation data will only be used within the research group that takes part in the evaluation, and neither will it be distributed to by any way (written, electronically, or by network), nor will it be used by any partner or affiliated organizations of the participating site. 6. The participating site agrees to give credit to the resource providers by referring to the resources being used (e.g., training data, development data, test data, reference translations, and evaluation tools) in their publications and other research accomplishments. 7. If a participating site violates terms 4-6, the evaluation sponsor and resource providers have the right to request the participating site and the cooperators and/or affiliation organizations using the resources without granted licenses to compensate in 3-5 times of the cost of the distributed resources. If insufficient, the compensation fee should be increased to be equal to the actual loss of related resource providers. Signature of the person in charge or stamp of the participating site: Date: 12 CWMT 2015 MACHINE TRANSLATION EVALUATION PARTICIPATING SITE AGREEMENT (Non-profit Agreement) This agreement is made by and between: Name of The Participating Site (hereinafter called “the participating site”), participating site of the CWMT 2015 Machine Translation Evaluation, having its principal place of business at: Address of the Participating Site AND Chinese Information Processing Society of China (hereinafter called “the sponsor”), the sponsor of CWMT 2015 machine translation evaluation, having its principal place at: No.4, Forth Southern Street, Zhongguancun, Beijing, China. Whereby it is agreed as follows: 1. The sponsor provides the participating site with the training data and baseline system for the CWMT 2015 machine translation evaluation. 2. The participating site should pay registration fee for the evaluation to the sponsor. The registration fee includes the license fee of the training data resources, the registration fee of CWMT 2015 for one person from the participating site, and part of the cost of the organization of the evaluation. The registration fee is XXX USD. Method of payment: bank transfer In witness whereof, intending to be bound, the parties hereto have executed this AGREEMENT by their duly authorized officers. AUTHORISED BINDING SIGNATURES: ———————— ——————— On behalf of Chinese Information Processing On behalf of Name of the Participating Site Society of China Name: Name: Title: Title: Date: Date: 13 Appendix C: Format of MT Evaluation Data This appendix describes the format of the data released by the organizer and the result files that the participants should submit. All the files should be encoded in UTF-8 format. Among them, the development set (including its reference), the test set and the final translation result files must be strict XML files (whose formats are defined by the XML DTD described in section III) encoded in UTF-8 (with BOM), and all the others are plain text files encoded in UTF-8 (without BOM). I. Data released by the organizer The organizer will release three kinds of data: training sets, development sets and test sets. Here we take the “Chinese-to-English Translation” task as an example for illustration purposes. 1. Training Set The training data contains one sentence per line. The parallel corpus of each language pair is made of a source file and a target file, which contain the source and target sentences respectively. Figure 1 illustrates the data format of the parallel corpus. 战法训练有了新的突破 Tactical training made new breakthrough 第一章总则 Chapter I general rules 人民币依其面额支付 The renminbi is paid by denomination …… …… Figure 1 Example of the parallel corpus Note: In “Double-Blind Evaluation” task, the evaluation organizer will provide the corpora, which has been preprocessed. The corpora for the other language pairs are not processed. 2. Development Set and Test Set There are source files and reference files in the development set and the test set. (1)Source File A source file contains one single srcset element, which has the following attributes: setid: the dataset srclang: the source language. One element of this set:{en, zh, mn, uy, ti, db} trglang: the target language. One element of this set:{en, zh, mn, uy, ti, db} A srcset element contains one or more DOC element(s), and each DOC element contains one single attribute docid, which indicates the genre of the DOC. Each DOC element contains several seg elements with attribute id. 14 One or more segments may be encapsulated inside other elements, such as p. Only the text surrounded by seg elements is to be translated. Figure 2 shows an example of the source file. <?xml version="1.0" encoding="UTF-8"?> <srcset setid="zh_en_news_trans" srclang="zh" trglang="en"> <DOC docid="news"> <p> <seg id="1">sentence 1</seg>. <seg id="2">sentence 2</seg> …… </p> …… </DOC> </srcset> Figure 2 Example of the source file (2)Reference file A reference file contains a refset element. Each refset element contains the following attributes: setid: the dataset srclang: the source language. One element of this set:{en, zh, mn, uy, ti} trglang: the target language. One element of this set:{en, zh, mn, uy, ti} Each refset element contains several DOC elements. Each DOC has two attributes: docid: the genre of the DOC site: the indicator for different references. One element of this set:{1, 2, 3, 4} Figure 3 shows an example of the reference file. 15 <?xml version="1.0" encoding="UTF-8"?> <refset setid="zh_en_news_trans" srclang="zh" trglang="en"> <DOC docid="news" sysid="ref" site="1"> <p> <seg id="1">reference 11 </seg> <seg id="2">reference 21</seg> …… </p> …… </DOC> <DOC docid="news" sysid="ref" site="2"> <p> <seg id="1">reference 21 </seg> <seg id="2">reference 22</seg> …… </p> …… </DOC> <DOC docid="news" sysid="ref" site="3"> <p> <seg id="1">reference 31</seg> <seg id="2">reference 32</seg> …… </p> …… </DOC> <DOC docid="news" sysid="ref" site="4"> <p> <seg id="1">reference 41 </seg> <seg id="2">reference 42</seg> …… </p> …… </DOC> </refset> Figure 3 Example of the reference file II. Data should be Submitted by the Participants 1. File Naming Please name the submitted files according to the naming mode in the following table (we use “ce”, “ict” and “2015” here as examples of Task ID, Participant ID and year of the test data respectively). 16 File Naming mode Example final translation result Task ID - year of the test data - ce-2015-ict-primary-a.xml Participant ID - Primary vs. contrast ce-2015-ict-contrast-c.xml system - System ID.xml 2. Final translation result The final submission file contains a tstset element with the following attributes: setid: the dataset srclang: the source language. One element of this set:{en, zh, mn, uy, ti} trglang: the target language. One element of this set: {en, zh, mn, uy, ti} The tstset element contain a system element with the following attributes: site: the label of the participant sysid: the identification of the MT system The value of the system element is the description of the participating system including the following information: Hardware and software environment: the operating system and its version, number of CPUs, CPU type and frequency, system memory size, etc. Execution Time: the time from accepting the input to generating the output. Technology outline: an outline of the main technology and important parameters of the participating system. If the system uses system combination techniques, single systems being combined and the combination techniques should be described here. Training Data: a description of the training data and development data used for system training. External Technology: a declaration of the external technologies which are used in the participating system but not owned by the participating site, including: open-source code, free software, shareware, and commercial software. The content of each DOC element is exactly the same as that of the test set’s source file, which is described before. Here is an example of the final submission file in Figure 4. <?xml version="1.0" encoding="UTF-8"?> <tstset setid="zh_en_news_trans" srclang="zh" trglang="en"> <system site="unit name" sysid="system identification"> description information of the participating system ............ </system> 17 <DOC docid="document name" sysid="system identification"> <p> <seg id="1">submit translation 1</seg> <seg id="2">submit translation 2</seg> …… </p> …… </DOC> </tstset> Figure 4 Illustration of the final submission file Note: Please note that CWMT 2015 evaluation adopts strict XML file format. The main difference between the XML file format and the NIST evaluation file format lies in the following: in an XML file, if the following five characters occur in the text outside tags, they should be replaced by escape sequences: Character Escape sequence & & < < > > " " ' ' As for Chinese encoding, the middle point in a foreign person name should be written as "E2 80 A2" in UTF-8, for example, "托德·西蒙斯" ; As for English tokenization, the tokenization should be consistent with the "normalizeText" function of the Perl script "mteval-v11b.pl" released by NIST. III.Description of CWMT 2015 XML files’ document structure <?xml version="1.0" encoding="UTF-8"?> <!ELEMENT srcset (DOC+)> <!ATTLIST srcset setid CDATA #REQUIRED> <!ATTLIST srcset srclang (en | zh | mn | uy | ti ) #REQUIRED> <!ATTLIST srcset trglang (en | zh | mn | uy | ti ) #REQUIRED> <!ELEMENT refset (DOC+)> <!ATTLIST refset setid CDATA #REQUIRED> 18 <!ATTLIST refset srclang (en | zh | mn | uy | ti ) #REQUIRED> <!ATTLIST refset trglang (en | zh | mn | uy | ti ) #REQUIRED> <!ELEMENT tstset (DOC+)> <!ELEMENT tstset (system+)> <!ATTLIST tstset setid CDATA #REQUIRED> <!ATTLIST tstset srclang (en | zh | mn | uy | ti ) #REQUIRED> <!ATTLIST tstset trglang (en | zh | mn | uy |ti ) #REQUIRED> <!ELEMENT system (#PCDATA) > <!ATTLIST system site CDATA #REQUIRED > <!ATTLIST system sysid CDATA #REQUIRED > <!ELEMENT DOC ( p* )> <!ATTLIST DOC docid CDATA #REQUIRED> <!ATTLIST DOC site CDATA #IMPLIED> <!ELEMENT p(seg*)> <!ELEMENT seg (#PCDATA)> <!ATTLIST seg id CDATA #REQUIRED> 19 Appendix D: Requirement of Technical Report All participating sites should submit a technical report to the 11th China Workshop of Machine Translation (CWMT 2015). The technological report should describe the technologies used in the participating system(s) in detail, in order to inform the reader about how the reported results are obtained. A good technological report should be detailed enough so that the reader could replicate the work which is described in the report. The report should be no shorter than 5,000 Chinese characters or 3,000 English words. Generally, a technology report should provide the following information: Introduction: Give the background information; introduce the evaluation tasks participated, and the outline of the participating systems; System: Describe the architecture and each module of the participating system in detail. The technologies used in the system should be focused. If there is any open technology adopted, it should be explicitly declared. If the technologies are developed by the participating site itself, that should be described in detail. If the participating site uses system combination techniques, the single systems as well as the combination technique should be described. If the participating site uses hand-crafted translation knowledge sources such as rules, templates, and dictionaries, the size of the knowledge sources and the ways to construct and use the knowledge sources should be described. Data: Give detailed description of the data used in the system training and the processing of the data. Experiment: Give detailed description to the experiment process, the parameters and the results obtained on the evaluation set. Analyze the results. Conclusion: ( open ) 20 Appendix E: Resource List Released by the Organizer 1. The Chinese-English resources provided by the organizer ChineseLDC resource ID Resource description Languages Domain CAS-ICT&CAS-IA Chinese-English Sentence-Aligned Bilingual Corpus (Extended version) Institute of Computing Technology, CAS & Institute of Automation, CAS Chinese-to-English, English-to-Chinese Multi-domain The original corpus includes 3,384 bilingual text files, which contain 209,486 Chinese-English sentence pairs, where 3,098 files with 107,436 sentence pairs were developed by Institute of Automation (IA), CAS, and the other 250 files with 102,050 sentence pairs were developed by Institute of Computing Technology (ICT), CAS. The current version is an extended version with additional data which was also provided by IA and ICT, and it contains overall 252,329 sentence pairs. The resource is developed under the support of the National Basic Research Program (973 Program). It is a large-scale Chinese-English bilingual Corpus on multi-domain and multi-style which is sentence-aligned. PKU Chinese-English/Chinese-Japanese parallel corpus (Chinese-English part) Institute of Computational Linguistics, Peking University Chinese-to-English Multi-domain Size 200,000 Chinese-English sentence pairs Name Providers Languages Domain CLDC-LAC-2 003-004 Size Description Name Provider CLDC-LAC-2 003-006 Description Name Provider Languages Domain The corpus is supported by a subproject of the National High Technology Research and Development Program of China (863 Program) with the title of "Chinese-English/Chinese-Japanese parallel corpus"(Grant No. 2001AA114019). XMU English-Chinese Movie’s Subtitle Corpus Xiamen University English-to-Chinese Dialog 21 Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain 176,148 sentence pairs Subtitles of movies HIT-IR English-Chinese Sentence-Aligned Corpus IR laboratory of Harbin Institute of Technology English-to- Chinese Multi-domain 100,000 sentence pairs HIT-MT English-Chinese Sentence-Aligned Corpus Machine Translation Group of Harbin Institute of Technology Chinese-to-English, English-to-Chinese Multi-domain 52,227 sentence pairs Datum English-Chinese Parallel Corpus (Part) Datum Data Co., Ltd. Chinese-to-English, English-to-Chinese Multi-domain, including: textbooks for language education, bilingual books, technological documents, bilingual news, government white books, government documents, bilingual resources on web, etc. 1,000,000 sentence pairs It is a part of the “Bilingual / Multi-lingual Parallel Corpus” developed by Datum Data Co., Ltd under the support of the National High Technology Research and Development Program of China (863 Program). ICT Web Chinese-English Parallel Corpus (2013) Institute of Computing Technology, CAS Chinese-to-English, English-to-Chinese Multi-domain About 2,000,000 sentence pairs The parallel corpus is automatically acquired from web. All the processes, including parallel web page discovering and verification, parallel text extraction, sentence alignment, etc., are entirely automatic. The accuracy of the corpus is about 95%. This work was supported by the National Natural Science Foundation of China (Grant No. 60603095). ICT Web Chinese-English Parallel Corpus (2015) Institute of Computing Technology, CAS Chinese-to-English, English-to-Chinese Multi-domain 22 Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain 2007-863-001 Size More than 2,000,000 sentence pairs The parallel corpus is automatically acquired from web. All the processes, including parallel web page discoverin g and verification, parallel text extraction, sentence align ment, etc., are entirely automatic. The Institute of Computing Technology has corrected this corpus roughly. The accuracy of the corpus is greater than 99%. Three sources of sentences were selected to provide this corpus: 60% from the web, 20% from movie subtitles, and the rest 20% from the English-to-Chinese or Chinese-to-English dictionaries. NEU Chinese-English Parallel Corpus Natural Language Processing Group, Northeastern University Chinese-to-English, English-to-Chinese Multi-domain 1,000,000 sentence pairs The parallel corpus is automatically acquired from web. Semi-automatic techniques are used to filter sentence pairs of low quality. CASIA Web Chinese-English Parallel Corpus (2015) Institute of Automation, CAS Chinese-to-English, English-to-Chinese Multi-domain About 1,000,000 sentence pairs The parallel corpus is automatically acquired from web. All the processes, including parallel web page discovering and verification, parallel text extraction, sentence alignment, etc., are entirely automatic. SSMT2007 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data) Institute of Computing Technology, Chinese Academy of Sciences Chinese-to-English, English-to-Chinese News This is the test data of SSMT 2007 MT Evaluation, which contain data of 2 translation directions (Chinese-English and English-Chinese) in news domain. The source file of Chinese-English data contains 1,002 Chinese sentences with 42,256 Chinese characters. The source file of English-Chinese data contains 955 English sentences with 23,627 English words. There are 4 reference translations made by human experts for each 23 test sentence. Description Name Provider Languages 2005-863-001 Domain Size Description Name Provider Languages 2004-863-001 Domain Size Description Name 2003-863-001 Provider Languages HTRDP(“863 Program”) 2005 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data) Institute of Computing Technology, Chinese Academy of Sciences Chinese-to-English, English-to-Chinese The data contains two genres: one is dialog data from Olympics-related domains, which includes game reports, weather forecasts, traffic and hotels, travel, foods, etc, and the other one is text data from news domain. The source files of dialog and text data in Chinese-to-English and English-to-Chinese directions contain about 460 sentence pairs respectively. The total number of source sentences is about 1,840. There are 4 reference translations made by human experts for each source sentence. The test data of the 2005 “863 Program” machine translation evaluation. HTRDP (“863 Program”) 2004 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data) Institute of Computing Technology, Chinese Academy of Sciences Chinese-to-English, English-to-Chinese Two data genres: one is text data, the other is dialog data. The data covers general domain and Olympic-related domains which include game reports, weather forecasts, traffic and hotels, travel, foods, etc. The source files of Chinese-to-English direction contain dialog data of 400 sentences and text data of 308 sentences. The source files of English-to-Chinese direction contain dialog data of 400 sentences and text data of 310 sentences. There are 4 reference translations made by human experts for each source sentence. The test data for the 2004 “863 Program” machine translation evaluation. HTRDP (“863 Program”) 2003 Machine Translation Evaluation Data (A part of Chinese-English & English-Chinese MT evaluation data) Institute of Computing Technology, Chinese Academy of Sciences Chinese-to-English, English-to-Chinese 24 Domain The data covers Olympic-related domains which include game reports, weather forecasts, traffic and hotels, travel, foods, etc. Size The source files of Chinese-to-English direction contain dialog data of 437 sentences and text data of 169 sentences, and the source files of English-to-Chinese direction contain dialog data of 496 sentences and text data of 322 sentences. There are 4 reference translations made by human experts for each source sentence. Description The test data for the 2003 “863 Program” machine translation evaluation. 2. The Mongolian-Chinese resources provided by the organizer (no repeated sentence pairs among different data) Name Provider Languages Domain Size Description Name Provider Languages Domain IMU Mongolian-Chinese Parallel Corpus (Version 2013) Inner Mongolia University Chinese-to-Mongolian, Mongolian-to-Chinese Government documents, laws, rules, daily conversation, literature 104,975 sentence pairs. Including: 1) 67,274 sentence pairs for CWMT 2011 MT evaluation, covering domains such as daily conversation, literature, government documents, laws and rules; 2) 37,701 newly added sentence pairs for CWMT 2013 MT evaluation, including 17,516 sentence pairs from news domain, 10,394 sentence pairs from government documents, 5,052 sentence pairs from text books and 4,739 sentence pairs from a Mongolian-to-Chinese dictionary. Encoded in UTF-8 (without BOM) IMU Mongolian-Chinese Parallel Corpus (Version 2015) Inner Mongolia University Chinese-to-Mongolian, Mongolian-to-Chinese Government documents, laws, rules, daily conversation, literature Name 53,578 sentence pairs. Including: 5,012 sentence pairs from movies, 12,835 sentence pairs from government documents, 1,872 sentence pairs from books; 5,780 sentence pairs about Mongol sacrificial, and 28,079 sentence pairs from news. IIM Mongolian-Chinese Parallel Corpus Provider Languages Institute of Intelligent Machines, CAS Mongolian-to-Chinese Size 25 Domain Size News Description Encoded in UTF-8 (without BOM) 1,682 sentence pairs. 3. The Tibetan-Chinese resources provided by the organizer (no repeated sentence pairs among different data) Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description Name Provider QHNU Tibetan-Chinese Parallel Corpus (Version 2013) Qinghai Normal University Tibetan-to-Chinese, Chinese-to-Tibetan Government document 50,000 sentence pairs Note: after the organizer deleted those sentence pairs that were found in other training data, the size of this corpus is now 30,000 sentence pairs. The sentence alignment accuracy of the corpus is over 99%. The construction of the corpus was supported by NSFC (Grant No. 61063033) and 973 Program (Grant No. 2010CB334708). QHNU Tibetan-Chinese Parallel Corpus (Version 2015) Qinghai Normal University Tibetan-to-Chinese, Chinese-to-Tibetan Government document 20,000 sentence pairs The sentence alignment accuracy of the corpus is over 99%. The construction of the corpus was supported by NSFC (Grant No. 61063033). Yang Jin Tibetan-Chinese Parallel Corpus Artificial Intelligence Institute, Xiamen University & Language Technology Institute, Northwest University of Nationalities Chinese-to-Tibetan Multi-domain 52,000 sentence pairs 1) The sources of the corpus include publications, a Tibetan-Chinese Dictionary, and Tibetan-Chinese Web Text. The corpus was automatically aligned and corrected manually. 2) The alignment accuracy is100% 3) The research was supported by NSSFC (Grant No. 05AYY001) and HTRDP (Grant No. 2006AA010107) NUN-TU-XMU Tibetan-Chinese Parallel Corpus (2012) Language Technology Institute, Northwest University of 26 Languages Domain Size Description Nationalities & Tibet University & Artificial Intelligence Institute, Xiamen University Chinese-to-Tibetan Political writings, law 24,000 sentence pairs It selects material from Chinese law and regulation files during 2008 to 2009 and government reports from 2011 to 2012. All the source materials have been scanned, recognized, checked and processed manually. 4. The Uyghur-Chinese resources provided by the organizer (no repeated sentence pairs among different data) Name Provider Languages Domain Size Description Name Provider Languages Domain Size Description XJU Uyghur-Chinese Parallel Corpus (Version 2013) Xinjiang University Chinese-to-Uyghur News 80,000 sentence pairs XTIPC Uyghur-Chinese Parallel Corpus (Version 2015) Xinjiang Technical Institute of Physics & Chemistry, CAS Chinese-to-Uyghur News 60,000 sentence pairs Newly added about 30,000 sentence pairs based on that of Version 2013. 95% of sentence pairs are from news (2007~2014) and the others are from laws and government reports. 27 5. Other training data resources not provided by the organizer Name Provider Languages Domain Size Reuters corpus Reuters English News RCV1 Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 (Release date 2000-11-03, Format version 1, correction level 0) This is distributed on two CDs and contains about 810,000 Reuters, English Language News stories. It requires about 2.5 GB for storage of the uncompressed files. The Reuters Corpus can be obtained from: http://trec.nist.gov/data/reuters/reuters.html Description Name Provider Languages Domain It will take about 1 month to be posted from USA to China. Only the Volume 1 of the Reuters Corpus is permitted to be used as the training corpus for English language model in the evaluation. The Volume 2 is not permitted to be used in the evaluation. SogouCA Sogou Labs. Chinese News Size The corpus contains URL and text data collected from 18 news channels on the web from May to June 2008, which cover the domains of Olympics, sports, IT, domestic news and international news. The total size is 1.03 GB after compression. Description The SogouCA Corpus can be obtained from: http://www.sogou.com/labs/dl/ca.html. Participants can also obtain the corpus from the organizer. The SogouCA corpus is permitted to be used as the training data of Chinese language model in the evaluation. 28
© Copyright 2025