Examining the GloWbe corpus for lexicographic evidence in SgE, MyE and Hk English Vincent B Y Ooi [email protected] National University of Singapore Outline p A. Using the Internet as a lexicographic resource p B. Summary and application of Davies and Fuchs’ (2015) findings to Singapore, Malaysia and Hong Kong p C. Findings beyond Davies and Fuchs Singapore, Malaysian and Hong Kong English p D. Evaluation of the GloWbe corpus for lexicographic evidence A. Using the Internet as a lexicographic resource p (Fuertes-Olivera 2012) “I am using the concept of corpus in a lexicographical way, i.e., a lexicographical corpus is any collection of texts where lexicographers can find inspiration for completing the dictionary structures they need when making a real dictionary… I will focus on ways of exploiting and exploring the Internet as a lexicographical corpus, i.e., the virtual space in which lexicographers can easily access data they might need.” A. Sinclair (2004) On the Web… p “The World Wide Web is not a corpus, because its dimensions are unknown and constantly changing, and because it has not been designed from a linguistic perspective. At present it is quite mysterious, because the search engines, through which the retrieval programs operate, are all different, none of them are comprehensive, and it is not at all clear what population is being sampled. “ A. Sinclair, on curating the Web p “It is important to know precisely what is actually copied or downloaded from a web page. This is not always obvious, and quite often it is not at all the document that is required…The cheerful anarchy of the Web thus places a burden of care on a user, and slows down the process of corpus building. The organisation and discipline has to be put in by the corpus builder.” B. English World-Wide 36:1 (Feb 2015) “The Internet as a lexicographical resource” can be reified, especially for those interested in varieties of English, in the 1.9 billion-word Global Web-based English Corpus (GloWbe) p Davies and Fuchs (2015) – February issue of English World-Wide p Responses to D&F by: i) Christian Mair –Nigerian English etc. ii) Joybrato Mukherjee, for South Asian English iii) Pam Peters, for Australian English etc. iv) Gerald Nelson, for the ICE corpus p B. Davies and Fuchs on the ICE corpus § § Ice corpus Ø 1m words each (600,000 S; 400,000 W) Ø 14 varieties of English (all of which GloWbe also covers) – a total of merely 12.2million words However, ICE is limited in size Ø Enough data for high frequency syntactic constructions only Ø Not so useful for lexical variation which needs more data examples (for lexicographic evidence) B. Motivation for GloWbe § Need for a larger corpus to study World Englishes § The GloWbe corpus Ø 1.9b words Ø 20 different countries (6 inner circle, 14 outer circle) Ø Notice that Expanding Circle countries (e.g. Japan, China, Korea) are excluded Ø Strength: Compare the frequency of a word, phrase or grammatical construction across these different varieties of English à mapping different varieties of English B. D&F on data collection § Genre balance Ø Between formal & informal language (like ICE corpora) Ø ~60% from informal blogs, ~40% from other formal genres & text types § Accuracy in identifying dialect Ø Google “Advanced Search”, limited search by the region (ß-we’ll revisit this later) B. Size by country (VO – note the uneven sizes) B. D&F – Lexical variation freak out B. D&F – Concordance for freak out B. D&F – Lexical variation fortnight p More British English than U.S. English B. D&F – Lexical variation banjaxed p Irish English (‘ruined’, ‘screwed up’) B. D&F – Lexical variation eve teas p “Public sexual harassment” B. D&F – Lexical variation handphone p Mobile / cell phone B. D&F lexical variation: equipments B. D&F lexical variation: equipments B. D&F phraseology [keep in] view B. D&F phraseology [discuss] about B. D& F (be) different to B. D& F (be) different than B. D&F – had + {gotten/got} B. D&F – the quotative “like” construction (May Wong on HkE) B. Singular/plural agreement: Each of them is/are (“innovative” plural) B. The “way” construction (not typically HkE – May Wong) C. Singapore: killer litter C. Dictionary entry for killer litter (this is not Singlish) p killer litter /…./ noun (uncount; Singapore and Malaysian English) p Killer litter is something heavy, eg a television, that is disposed of by being thrown from the higher storeys of a building, putting passers-by below at risk of injury: The throwing of killer litter is irresponsible and highly dangerous. C. GloWbe: killer litter C. GloWbe concordance:killer litter C. Google Advanced search C. Google adv search: killer litter (Au) C. GloWbe: lepak (Malaysian and Singapore English; not in HkE) C. Concordance of lepak (MyE; SgE) C. Google adv search: lepak (MyE) C. Oxforddictionaries.com: lepak C. GloWbe: shroff C. Concordance for shroff C. shroff in oxforddictionaries.com C. Dictionary defn for shroff/shroffing C. Measuring diglossia – kiasu most prototypical “Singlish” item Measuring diglossia – kiasu for Sri Lanka?! (TCEED2 Appendix entry for kiasu) p kiasu / … /: adjective p (of a person) afraid to lose out. kiasu in Oxforddictionaries.com D. Evaluating GloWbe for lexicographic evidence § § What does GloWbe represent? “Whatever is found on the web…[so] it may include very little from certain genres, such as students’ academic writing, fiction and business letters.” (D&F responding to Nelson) “Blogs are not the same as spontaneous spoken conversation” (D&F) This may pose an issue for capturing informal/colloquial Malaysian, Singapore and Hong Kong English. [“Singlish”, for instance, is inherently spoken in nature]. But, still, GloWbe is remarkable in capturing quite a number of the sociocultural features characteristic of the informal varieties, e.g. kiasu is the most prototypical Singlish item. D. Evaluating GloWbe for lexicographic evidence p Mair asks whether blogs constitute a recognizable genre in the first place. p While this is true, the 60% proportion of blogs may mean that everyday topics and everyday values are represented – in the personal blog (but D&F haven’t disclosed the proportion of different types of blogs, e.g. travel blog, etc) There’s also the question of ‘blog death’ – so it would be good to know how the sampling is done. D. Evaluating GloWbe for lexicographic evidence p Gerry Nelson and J Mukherjee suggest that some writers from a particular country domain may not actually be from the country in question. D&F say that they provide the original URLs for each of the 1.8 million pages. Users may want to examine the original pages in doubtful cases. D. Evaluating GloWbe for lexicographic evidence p In conclusion, GloWbe is useful as a welcome and additional “toolbox” for researchers of world Englishes. It should be triangulated with the ICE corpus and other sources of data available. p In conclusion, GloWbe still allows us to confirm many of our intuitions and provisional findings on varieties of English. For Stephanie Horch (Mair’s student), “GloWbe is the best source of data: free, fast, vast.” D. Evaluating GloWbe for lexicographic evidence § Disadvantages Ø No actual spoken material Ø Particular website is from a particular country, but did not check for speaker § Davies & Fuchs encourage us to use the various corpora available in a combinational & complementary way (my emphasis!) References p p p p p p p Bolton K. 2003. Chinese Englishes: A Sociolinguistic History. Cambridge: Cambridge University Press. Davies M, and R Fuchs. 2015. Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English corpus (GloWbe), In English World-Wide 36:1, pp1-29. Fuertes-Olivera, P. 2012. Lexicography and the Internet as a (Re-)source. In Lexicographica 28:1. Kilgarriff, A and G Grefenstette. 2003. Web as corpus. URL: http://www.kilgarriff.co.uk/Publications/2003-KilgGrefenstetteWACIntro.pdf Sinclair, J. 2004a. Corpus and text – basic principles. In Developing Linguistic Corpora: A Guide to Good Practice. URL: http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ chapter1.htm Sinclair, J. 2004b. Appendix – how to build a corpus. http://www.ahds.ac.uk/creating/guides/linguistic-corpora/ appendix.htm Thank You for your kind attention!
© Copyright 2024