Weather Forecast Corpus and a Sample Dictionary Abstract

Weather Forecast Corpus and a Sample Dictionary
Siwarak Paengpho and Jirapa Vitayapirak
Proceedings of the International Conference: Doing Research in Applied Linguistics
Abstract
This study aims to present word lists into general, academic, and technical and analyzes word classes, collocations,
phrases, and abbreviations in the Weather Forecast Corpus (WFC). The corpus consists of a 555,818-word of
weather forecast texts. The concordance software, Wordsmith Tools was used in processing the data by using word
frequency and percentage. The results showed that the vocabulary in the WFC was highly general. In the WFC,
the general vocabulary occurred 76.37 %, academic vocabularies 5.86 %, technical vocabularies 5.30 %, and other
vocabularies 12.58 % of tokens. Some general words found as the technical and many words in the corpus are
general words when they collocate with other words, they become multi-word phrasal items in the meteorological
meaning. These results lead to design a sample meteorological dictionary which can illustrate some guidelines for
English teachers and students in Thai Meteorological Institute in English courses.
Introduction
Nowadays, English is a universal language or lingua franca that plays a crucial role in
pedagogy at higher education in Thailand, especially English for science and technology. Thai
Meteorological Institute (TMI) is a higher educational institute under administration of Thai
Meteorological Department (TMD). This institute offers many courses in meteorology for
meteorologists in TMD and meteorological students who need to prepare themselves for their
careers in the future as meteorological officers or meteorologists. English for meteorology is
required for all students. Thus, all meteorologists and meteorological students in TMI have to
study English for Specific Purposes (ESP).
ESP is a major activity in English language teaching around the world. It is an enterprise
involving education, training and practice. It draws upon three major realms of knowledge:
language, pedagogy and the student’s or participants’ specialist areas of interest such as
pharmacology, gastronomy, photography, meteorology, and so on (Robinson, 1991: 1, 18).
However, one important problem for ESP learners seems to be English lexical
knowledge: reading, writing, and translating. Vocabularies are the core of learning procedure in
terms of their meanings and usages. When ESP learners encounter difficult vocabularies such as
technical terminology, they often look them up in technical dictionaries.
As technology changes, computers have been used in vocabulary studies and in
lexicography for several decades (Landau, 2001). A corpus-based study is widespread in learning
specialized vocabulary and making technical dictionary from authentic texts. In other words, the
corpus-based study is another way for using computer readable forms of texts for the purpose
of linguistic research and making dictionaries. There are several advantages in using specialized
corpora including searching, sorting, retrieving, and calculating linguistic data with high speed
and accuracy (Mallikamas, 1991).
In the past, a technical dictionary had to be carefully defined manually by experts in that
particular field. At the present time, modern technical dictionaries have been revolutionized by
the introduction of corpus-based techniques and they are now usually based upon huge corpora
of English (e.g. Bank of English (COBUILD), British National Corpus (BNC)), from which
words, forms, spellings, meanings and grammatical behavior are extracted, thus allowing
lexicographers to appeal directly to the observed facts of language (Trask, 1999: 166). Such
progress enables lexicographers to appeal directly to the observed facts of language. Cleary, the
adoption of computer innovation using corpus-based analysis can be made possible in the field
of meteorology.
Interestingly, little previous research had been carried on the design of specialized
corpora. None of the specialized corpora so far has been applied for the bilingual (English and
Thai) meteorological terminologies. In Thailand, the corpus-based analysis has never been
applied in any of the courses or research on English for Meteorology in Thai Meteorological
88 Institute and no meteorological dictionary for Thais was based on corpus research and compiled
by the corpus-based analysis which met the needs of the users. As a result, the bilingual
meteorological dictionaries have not been well-developed and very few references in this area
are available.
This study thus applies corpus-based analysis in studying English weather forecast texts
in order to provide insights into the language of weather forecast using authentic texts. It could
also be used as a guideline to design a meteorological dictionary and teach English for
meteorological students.
Objectives
1) To establish the Weather Forecast Corpus (WFC).
2) To analyze word classes, collocations, compound nouns, and abbreviations.
3) To classify word lists in the corpus into general, academic, and technical
vocabularies.
4) To design a sample of meteorological dictionary based on the technical vocabularies
found in the Weather Forecast Corpus (WFC).
Materials and Methods
1) Data Collection
The English weather forecasts texts were collected from three main sources: 1) websites:
Thai Meteorological Department’s website; World Meteorological Organization’s website; 2)
newspapers: Bangkok Post; and 3) documents in working units at Thai Meteorological
Department. The categories of English weather forecasts texts were classified by using the
weather forecast text classification from Thai Meteorological Department (TMD, 2007: Online).
These weather forecast texts were divided into two main categories, i.e. Thailand weather
forecasts (WF) and meteorological documents used in Thai Meteorological Department working
unit (MD). Thailand weather forecasts (WF) were divided into seven sub-categories: Weather
Advisory (WF1), Daily Forecast (WF2), Shipping Forecast (WF3), Weekly Forecast (WF4),
Monthly Forecast (WF5), Sunrise-Sunset and Moonrise-Moonset Forecast (WF6), and Water
Level Forecast (WF7). In the category of the meteorological documents (MD) were divided into
six sub-categories: Weather News (MD1), Weather Report (MD2), Meteorological Journal
(MD3), Meteorological Magazine (MD4), Meteorological Textbook and Manual (MD5), and
Other Meteorological Documents (MD6). The samples were 555,818 words of the weather
forecast texts were randomly selected by using proportional stratified random sampling
procedures. The frequency and percentage of text coverage were used to select and classify the
data in the corpus.
2) Research Instruments
In this study, a concordance software, Wordsmith Tools (WST) Version 3.0 (Scott, 1998)
was used to process the data for making indexes and word lists, counting word frequencies,
comparing different usages of each word, analyzing keywords, and finding phrases or idioms
(word-clusters) throughout the whole corpus. Additionally, the guidelines on the technical
vocabulary selection were chiefly based on the American Meteorological Society (AMS)’s Online
Glossary of Meteorology (2000) which was used to define English technical vocabularies, the
Glossary of Meteorological Terms (1979), and Online Glossary of Meteorology (2007) which
were written by Thai Meteorological Department (TMD) were also used as the references to
define Thai technical vocabularies.
3) Data Analysis Procedures
Firstly, all samples were collected and scanned as plain text files (*.txt). There were 13
files of subtypes of weather forecast texts in the WFC. Next, the optical character recognition
89 (OCR) software was used in the scanning procedure, to accept scanned alphabets and represent
the texts electronically. Afterwards, all text files were transferred into Microsoft Words 2007
before being saved as document files (*.doc). At this stage, the spell checker was used to check
the spell of all words in the texts. The Wordlist Tool was also used to compute the frequency of
running words or tokens and word types, including type/token ratio. Finally, the obtained
results from both word frequency counts and an analysis of collected information were
interpreted.
Results and Discussion
The total number of the word frequency list consists of 555,818 tokens or 13,172 word
types, and the relative of proportion between types and tokens was 2.37 or 1:29 as shown in
Figure 1 below.
Figure 1 Statistical detail of each sub-type of weather forecast texts in the WFC
The samples used in this study were those found more than 55 times which consisted of
593 lemmas. All 593 lemmas were divided into 4 groups: general vocabularies, academic
vocabularies, technical vocabularies, and others (those unfitted to be under any of the 4 main
groups according to the study of Bauman and Culligan (1995) and Coxhead (2000)). Then, the
word frequency list was lemmatized. The obtained results could be illustrated in Figure 2:
Figure 2 Proportion of 4 groups of vocabularies used in this study
Figure 2 shows a proportion of 4 groups of counted vocabularies used in the WFC.
From the total of 268,600 tokens in 4 groups of vocabularies, it was found that 205,125 tokens
(or 76.37 %) were classified as general vocabularies while 13,876 tokens (or 5.86 %) were sorted
90 as academic vocabularies. Besides, 14,252 tokens (or 5.30 %) were defined as technical
vocabularies whereas 33,792 tokens (or 12.58 %) were categorized as others.
Among 4 groups of vocabularies in this study, the general group was the highest
proportion while the technical group has the lowest. In addition, the proportion of the academic
group was lower than that of the other group. Interestingly, we found that some words
categorized as the general, the academic, and the other vocabularies could also provide technical
meanings in different contexts. This finding confirmed with the idea of Chung and Nation
(2003) that many technical words could be created from common words, including words from
the GSL and the AWL. Hence, it may not be absolutely correct to just identify some certain
words in the corpus as either the GSL or the AWL only. However, some general or other words
in the corpus were found as the technical ones: high, pressure, low, cold, cool, storm, station, ice,
shower, moderate, isolated, calm, etc. A single word can be a part of a multi-word lexical unit as
well, for example, ‘station temperature’, ‘sea surface temperature’, ‘extreme minimum
temperature’, ‘torrential rain, ‘fairly widespread rain’, ‘widespread thundershowers’, ‘isolated
thundershowers’, ‘high pressure system’, ‘low pressure cell’, ‘heat wave’, ‘moderate sea’, etc.
In term of word classes, two main types of word classes were found in the corpus,
namely closed class and open class. The words in the closed class were found as the top ten
word frequency lists in the WFC such as ‘the’, ‘and’, ‘of’, ‘in’, ‘to’, and ‘with’ as shown in Figure
3 below.
Figure 3 The top 10 word frequency lists in the WFC
Some words in the WFC have more than one function. The overlaps between nouns
and full verbs, nouns and auxiliary verbs, nouns and abbreviations, auxiliary verbs and
abbreviations, and pronouns and abbreviations were found in the WFC and could be identified
by using the concordance data from the corpus as displayed in an example in Figure 4 below.
Figure 4 Concordance sample of ‘am’ in the WFC
In this study, it was found 751 times of abbreviations ‘am’ (for Ante Meridiem) and 1
time as a verb. Therefore, the words ‘am’ in the WFC often were used as abbreviation ‘Ante
Meridiem’. Since the verb ‘am’ rather use in spoken texts but in the WFC is the written text.
91 Regarding collocations, both lexical and grammatical collocations were found in this study.
Some technical noun phrases containing words from the GSL, AWL, and other vocabularies
could be technically used with meteorological meanings. From these examples, it can be seen
that words from the GSL, AWL, and other vocabularies are not limited to provide common
meanings to all academic texts. Some are used together with words in their groups or other
groups and become meteorological meaning. To conclude, many words from the GSL, AWL,
other vocabularies in the WFC can provide meteorological meanings as technical noun phrases
when collocated with words in the other vocabularies as shown an example in Figure 5.
Figure 5 Collocations sample of ‘temperature’ in the WFC.
As can be seen in Figure 5, the noun ‘temperature’ often co-occurs with adjectives and
nouns such as minimum temperature, virtual temperature, sea surface temperature, maximum
temperature, and soil temperature. These noun phrases were created from the GSL and AWL,
but they can provide meteorological meanings, namely when ‘temperature’ collocated with
words ‘virtual’, ‘maximum’, ‘minimum’, ‘sea’ and ‘surface’, and ‘soil’ became technical noun
phrases in meteorology.
Interestingly, a number of low frequent words in the WFC are technical vocabularies
which should not be ignored in the specialized corpus since low frequent words may sometimes
be technical vocabularies. Furthermore, separating any single words from a certain phrase could
impair the accuracy of their meanings when used in the phrase. Therefore, to get more actual
proportions of categories of vocabularies, single words and multi-word items should be
analyzed separately, and both of them should be identified by checking against not only the
general vocabularies and the academic ones but also their meanings according to the context
they are in.
In case of the abbreviations, it takes 10.85 % or 30,784 tokens (74 lemmas) in this study.
They could be categorized into 6 types of abbreviations, i.e. 1) clippings, 2) acronyms, 3)
initialism, 4) contractions, 5) substitutions, and 6) symbols. Table 1 shows the top 10
abbreviations in the WFC.
Table 1 The Top 10 Abbreviations in the WFC
No. Rank
1
2
3
4
5
6
7
8
9
10
6
11
14
46
47
63
67
70
10
75
Abbreviation Frequency Percentage
°C
4,244
km
3,119
hr
2,765
Precip
1,239
°F
1,212
P.M./p.m. 849
AZ
790
*A.M./a.m. 751
m
727
UTC
727
0.76
0.56
0.50
0.22
0.22
0.15
0.14
0.14
0.13
0.13
92 Meaning
Celsius
Kilometer
Hour
Precipitation
Fahrenheit
Post Meridiem
Azimuth
Ante Meridiem
Meter
Universal Time Coordinated
1) The Design of a Sample of Meteorological Dictionary
A good bilingual dictionary usually give information about the meaning, pronunciation,
word classes or part of speech (POS), word grammar, collocations, example sentences, and
synonyms (Redman and Edward, 1997: 8). Ellendersen (2007) pointed out that information on
word classes, morphology, and syntax ought to be included in Language for Specific Purposes
(LSP) dictionary.
2) The Corpus Inputs to the Dictionary
The corpus is a primary source of information about the way words behave. It forms the
basis of the way words combine with each other (syntactically and collocationally). It also
provides information about word frequencies, grammatical patterns, and collocations. It is the
main source of the example sentences shown in the dictionary (Macmillan: Online, 2009). Some
information in the dictionary includes signs and symbols, lists of abbreviations, foreign words
and phrases which were found in the corpus. From the evidence of the WFC, information
about the word frequencies is very important for vocabulary grading and selection. Therefore,
the occurrence frequencies of the technical vocabularies, compound nouns, and abbreviations
with high frequencies in this study are proposed to be contained in a sample of meteorological
dictionary. The sample of meteorological dictionary should consist of the following entries: the
English headword, abbreviation, grammatical information, Thai pronunciation, Thai synonym,
Thai definition, English synonym, example of usage, and illustration as shown in Figure 6.
Headword Abbreviation Grammatical information Pronunciation Thai Synonym
È
È
È
È
È
Cumulonimbus (Cb.)
n.
/คิวมูโลนิมบัส/
“เมฆฟาคะนอง
Å Illustration Thai Definition: เมฆที่มีลักษณะเปนเมฆกอนใหญรูปรางคลายภูเขาใหญ มียอดเมฆแผออกเปนรูปรางคลายทั่ง ฐานเมฆต่าํ
มีสีดํามืด เปนเมฆหนา มืดทึบ มีฟาแลบ ฟารอง อาจอยูก ระจัดกระจายหรือรวมกันอยู มักมีฝนตกลงมา
Ex. After sunset, the Cumulonimbus clouds are often being transferred over the sea in
Norway and that was obvious on 3rd September. Å Example of Use
S. Cumulonimbus cloud, Thundercloud, Cumulonimbus incus, Å English Synonym
Cumulonimbus calvus, Cumulonimbus with mammatus,
Figure 6 A sample of meteorological dictionary entry
Conclusions
The study aimed to compile the Weather Forecast Corpus of English vocabularies, to
analyze word classes, collocations, compound nouns, and abbreviations, and to identify word
lists into general, academic, and technical groups in weather forecast corpus in order to design a
sample meteorological dictionary. The texts used in weather forecasting tasks were divided into
2 main types of products: 1) weather forecasts and 2) meteorological documents. The texts were
then separated into 13 sub-types. The study results showed all word lists (groups of vocabularies)
in terms of word frequency and percentage. According to statistical analysis of the WFC, the
93 whole corpus consists of 555,818 tokens, the total word types in the WFC were 13,172, and the
relative of proportion between types and tokens was 2.37 or 1:29.
This research, however, is only a tentative study on the technical words in synoptic
meteorology. It is just a discipline under meteorology. Future studies in other discipline of
meteorology (such as aeronautical meteorology, agricultural meteorology, hydrological
meteorology, maritime meteorology, biometeorology, astrometeorology, military meteorology,
medical meteorology, and highways meteorology, etc.) ought to be done in order to achieve the
reliable results. The meteorological dictionary using in Thailand should be compiled by means
of corpus-based analysis to select the representative headwords. Further studies on
meteorological symbols (both letters and numbers) as well as multiword units in meteorological
texts should be carried out. In conclusion, using a corpus to design technical dictionaries is a
way to perform with the confidence that technical dictionaries could become the better tool to
satisfy the users’ demands for technical vocabularies.
Acknowledgements
I am genuinely grateful to Assoc. Dr. Jirapa Vitayapirak, my advisor for her invaluable
guidance, precious time, beneficial comments, and patience in editing this paper. I would also
like to thank Mr. Mana Tosatjawong for giving advice on software use as well as their sharing all
relevant experience, knowledge, and resources.
References
Aarts, B., & McMahon, A. (2006). The handbook of English linguistics. United Kingdom: Blackwell.
American Meteorological Society. (2000). Glossary of meteorology. Retrieved from
http://amsglossary.allenpress.com/glossary.
Bauman, J., & Culligan,
B. (1995). The General Service List. Retrieved from
http://Jbauman.com/gsl.html.
Chung, T. M. and Nation, P. (2003). Technical vocabulary in specialised texts. Reading in a
Foreign Language 15 (2): 103-116.
Coxhead, A. (2000). A New Academic Word List. TESOL Quarterly, 34, 213-238.
Ellendersen, J. (2007). Grammar in dictionaries of languages for special purposes. Retrieved from
http://pure.au.dk/portal-asb-student/files/1462/000161028-161028.pdf
Gouws, R. H., & Prinsloo, D. J. (2005). Left-expanded article structures in Bantu with special
reference to Isizulu and Sepedi. International Journal of Lexicography 18(1): 25-46.
Landau, B. (2001). Perceptual units and their mapping with language. New York: Elsevier.
Macmillan, 2009. How was the corpus used in creating the dictionary?. Retrieved from
http://www.macmillandictionary.com/about.html
Mallikamas, P. (1999). Application of Corpora in Language Teaching. Thai TESOL BULLETIN,
February, 1-17.
Nation, I.S.P. (2001). Learning vocabulary in another language. Cambridge: Cambridge University
Press.
Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. In N. Schmitt
and M. McCarthy (Ed.), Vocabulary: description, acquistion and pedagogy
(pp. 6-19).
Redman, S., & Edwards, L. (2003). English vocabulary in use (pre-intermediate and intermediate). (2nd ed.).
United Kingdom: Cambridge University Press.
Robinson, P. C. (1991). ESP today: a practitioner’s guide. New York: Prentice Hall.
Scot, M . (1998). Wordsmith Tools. Oxford: Oxford University Press.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Thai Meteorological Department. (1979). Glossary of meteorological terms. Bangkok: Thai
Meteorological Department Press.
94 Thai
Meteorological Department. (2007). Glossary of Meteorology. Retrieved from
http://www.tmd.go.th/met_dict.php
Thai Meteorological Department. (2006). Thai Meteorological Department’ s Visions. Retrieved from
http://www.tmd.go.th/en/aboutus/vision.pp
Trask, R.L. (1999). Language: the basics. (2nd ed.). Routledge
West, M. (1953). A General Service List of English Words. London: Longman, Green.
The Authors
Siwarak Paengpho was born in Nakhonratchasima, Thailand in 1984, education : 2002-2004
Bachelor’s degree in Political Science Ramkhumhaeng University, 2008-present M.A. student
in King Mongkut’s Institute of Technology Ladkrabang, work experience: 2004-present a
meteorological officer in the Northeastern Meteorological Center, at PakChong
Agrometeorological Station.
Jirapa Vitayapirak is a professor at Faculty of Industrial Education, King Mongkut’s Institute of
Technology Ladkrabang (KMITL)
95