Analysing lexical density and lexical diversity in university students’ written discourse Begoña Clavel-Arroitia Carmen Gregori-Signes (Universitat de València -IULMA) 1. Introduction 2. Lexical Richness 3. Corpus description 4. The study 5. Results and discussion 6. Conclusions Introduction Lexical Frequency Profile Lexical Density Lexical Richness Ure, 1971; McKee, Malvern, Richards 2000; Menard 1983,;Vermeer 2000,;Johansson 2008 A) Lexical Density Lexical density is the percentage of lexical words in the text, i.e., nouns, verbs, adjectives, adverbs. A text is considered `dense` if it contains many lexical words relative to the total number of words., i.e., lexical and functional words TTR (type token ratio measure) LD= Number of lexical tokens x 100 _____________________ Total number of tokens Problems with lexical density (LD) Length LEXICAL DENSITY Topic Task B) Lexical Frequency Profile (Laufer & Nation 1995) Lexical Frequency Profile (LFP) shows the percentage of words a learner uses at different vocabulary frequency levels, i.e., the relative proportion of words from different frequency levels. word tokens word families word types Word types & Word tokens Types are word-forms and tokens are occurrences of word-forms. 'The cat sat on the mat’: -two tokens of the type 'the’ - one token of each of the types 'cat', 'sat', 'on', and 'mat’ Word families Word is defined in the programme as a base form with its inflected and derived forms: i.e., a word family: Push Pushed Pushes Pushing Lexical Frequency Profile The advantages of LFP over other measures, according to Laufer and Nation (1995:313ff) can be summarized as follows: ü LFP is largely independent on syntax and text cohesiveness (LD is not) ü LFP provides lists of different types of words, defined by frequency level. (LS for example only differentiates between frequent and sophisticated.) ü LFP is independent of learning material, reading lists etc., since it is defined in terms of frequency. ü LFP discriminates between subjects who use less or more frequent vocabulary. ü LFP is free of subjective decisions regarding what topic, subtopic, elaboration and thematic unit are. Corpus description CASTLE Research Project: Corpus analysis of the written production of Spanish university students (CASTLE-UV-INVPRECOMP12-80754; CASTLE-GV/2014/022) LONGDALE Corpus CASTLE Corpus University of Lovain Dr. Granger CASTLE project Antecedents • Existence of a number of learner corpora in Spain, but there are: ·∙ no English learner corpus at the University of Valencia (around 60,000 students) ·∙ few English leaner corpora by bilingual students in Spain (Barcelona, SanEago, Madrid), and of students with Catalan as L1 (Barcelona) ·∙ fewer longitudinal corpora with the above features • DuraEon of first phase of the project: Oct-‐May 2013. CASTLE project ObjecEves: ü To compile a corpus made of composiEons from students of the Degree of English Studies in Valencia. ü To keep a track of the students’ linguisEc evoluEon starEng in the 2012/13 academic year. ü Source: courses English Language 1-‐8 (B1 to C2 level) ü Four-‐step sampling: samples 1 and 2 were collected at the start and end of the first semester; samples 3 and 4 were collected at the start and end of the second semester. Each sample had around 400 words. We then had 1.300 texts, c. 500,000 tokens. Now we have 488 texts, c. 192.388 tokens (second ediEon of the project). Our corpus: Corpus A • First year students (B1 level) • Corpus A (1) and Corpus A (2): 27 essays in each one • First task Corpus B • First year students (B1 level) • Corpus A (1) and Corpus A (2): 27 essays in each one • Third task (same students as Corpus A) Corpus C • Fourth year students (C1+ level) • 27 essays • Fourth task TASKS Writing 1. 11th October. Task: “Last week you spent the day in another town with a group of friends. Write a letter to an English speaking friend telling him/her: Where you went; Who you went with; What you did.” Writing 3. 24th November. Task: “When applying for a course study (e.g. Erasmus stay) at a university or college, students are expected to write a personal statement in addition to completing an application form. You are applying for one of the universities included in our Erasmus programme. Write your own personal statement.” Writing 4. 4th January. It was one of the writing tasks found in the textbook used in the course. Students had to write a letter proposing a city as the venue to hold an international conference. Aims of the study Contrast values of lexical density & lexical frequency profile as possible measures of lexical richness by the same students Test different software that measure both to use them in the longitudinal study in progress (RANGE & Textalyser) Subjects First year students at university who are officially registered in a Preliminary English Test (PET) course (A2 level moving to B1 level) The writing tasks were designed to fit the requirements of the PET (B1 in the CEFR) official examinations. Analysis: Data processing Lexical density tests with Textalyser. ¡ http://textalyser.net/ Lexical Frequency Profile with RANGE. ¡ http://www.victoria.ac.nz/lals/about/staff/paul-nation Software RANGE (Paul Nation) RANGE is used to compare the vocabulary of up to 32 different texts at the same time. a range or distribution figure (how many texts the word occurs in) a headword frequency figure (the total number of times the actual headword type appears in all the texts) For each word a family frequency figure (the total number of times the word and its family members occur in all the texts) a frequency figure for each of the texts the word occurs in Textalyser Research objectives a) Is the LFP the same in corpus A & B? b) Are there significant differences between Corpus A & B and the students enrolled in C1-C2 course (Corpus C)? c) Is RANGE/ TEXTALYSER fit for purpose? is it adequate to detect improvement in lexis between students who supposedly have a different level (A2---C1)? d) Does RANGE/ TEXTALYSER help us confirm the level of our students and will allow us to establish a point of departure for the longitudinal study? RESULTS OF THE ANALYSIS RANGE Range of lines Range of nº words CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 13-‐24 12-‐24 12-‐32 12-‐30 24-‐50 164-‐264 177-‐294 171-‐333 172-‐322 277-‐518 RESULTS OF THE ANALYSIS RANGE Corpus A(1): WORD LIST One TOKENS/% 5108/85.06 TYPES/% 614/55.12 FAMILIES 436 Two 337/ 5.61 147/13.20 126 Three 39/ 0.65 26/ 2.33 24 not in the lists 521/ 8.68 327/29.35 ????? 6005 1114 586 Total RESULTS OF THE ANALYSIS RANGE Corpus A(2): WORD LIST TOKENS/% TYPES/% FAMILIES One 5245/84.26 653/54.87 461 Two 359/ 5.77 154/12.94 129 33/ 0.53 23/ 1.93 22 588/ 9.45 360/30.25 ????? 6225 1190 612 Three not in the lists Total RESULTS OF THE ANALYSIS RANGE Corpus B(1): WORD LIST TOKENS/% TYPES/% FAMILIES one 5471/88.41 632/63.39 417 two 253/ 4.09 108/10.83 89 three 176/ 2.84 104/10.43 83 not in the lists 288/ 4.65 153/15.35 ????? 6188 997 589 Total RESULTS OF THE ANALYSIS RANGE Corpus B(2): WORD LIST TOKENS/% TYPES/% FAMILIES one 5433/87.43 655/61.91 431 two 271/ 4.36 129/12.19 100 three 186/ 2.99 106/10.02 87 not in the lists 324/ 5.21 168/15.88 ????? 6214 1058 618 Total RESULTS OF THE ANALYSIS RANGE Corpus C: WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/6.04 199/12.38 147 three 568/5.75 168/10.45 118 not in the lists 979/9.92 431/26.80 ???? 9873 1608 789 Total RESULTS OF THE ANALYSIS RANGE Corpus C (UWL): WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/ 6.04 199/12.38 147 three 235/2.38 71/4.42 71 1312/13.29 528/32.84 ???? 9873 1608 742 not in the lists Total RESULTS OF THE ANALYSIS RANGE Corpus C (UWL): WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/ 6.04 199/12.38 147 three 235/2.38 71/4.42 71 1312/13.29 528/32.84 ???? 9873 1608 742 not in the lists Total RESULTS OF THE ANALYSIS RANGE Corpus C: academic list vs. UWL three 568/5.75 168/10.45 118 three 235/2.38 71/4.42 71 RESULTS OF THE ANALYSIS RANGE Compared results: TOKENS One CorpusA(1) 5108/85.06% Two 337/ 5.61% 359/ 5.77% 253/ 4.09% 271/ 4.36% 596/ 6.04% Three 39/ 0.65% 33/ 0.53% 176/ 2.84% 186/ 2.99% 568/ 5.75% 521/ 8.68 % 588/ 9.45% 288/ 4.65% 324/ 5.21% 979/ 9.92% Not in the lists CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 5245/84.26% 5471/88.41% 5433/87.43% 7730/78.29% RESULTS OF THE ANALYSIS RANGE Compared results: TOKENS One CorpusA(1) 5108/85.06% Two 337/ 5.61% 359/ 5.77% 253/ 4.09% 271/ 4.36% 596/ 6.04% Three 39/ 0.65% 33/ 0.53% 176/ 2.84% 186/ 2.99% 568/ 5.75% 521/ 8.68 % 588/ 9.45% 288/ 4.65% 324/ 5.21% 979/ 9.92% Not in the lists CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 5245/84.26% 5471/88.41% 5433/87.43% 7730/78.29% RESULTS OF THE ANALYSIS RANGE TYPES One Two Three Not in the lists CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 614/55.12% 653/54.87% 632/63.39% 655/61.91% 810/50.37% 147/13.20% 154/12.94% 108/10.83% 129/12.19% 26/ 2.33% 23/ 1.93% 199/12.38% 104/10.43% 106/10.02% 168/10.45% 327/29.35% 360/30.25% 153/15.35% 168/15.88% 431/26.80% RESULTS OF THE ANALYSIS RANGE TYPES One Two Three Not in the lists CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 614/55.12% 653/54.87% 632/63.39% 655/61.91% 810/50.37% 147/13.20% 154/12.94% 108/10.83% 129/12.19% 26/ 2.33% 23/ 1.93% 199/12.38% 104/10.43% 106/10.02% 168/10.45% 327/29.35% 360/30.25% 153/15.35% 168/15.88% 431/26.80% RESULTS OF THE ANALYSIS RANGE FAMILIES CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC one 436 461 417 431 524 two 126 129 89 100 147 three 24 22 83 87 118 total 586 612 589 618 789 RESULTS OF THE ANALYSIS RANGE FAMILIES CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC one 436 461 417 431 524 two 126 129 89 100 147 three 24 22 83 87 118 total 586 612 589 618 789 RESULTS OF THE ANALYSIS RANGE Groups Corpus C Lists List 2 List 3 Words bus hotels internaEonal entertainment perfect ideal reputaEon cheap conference transport accommodaEon Range 23 23 18 18 15 14 13 12 25 23 22 Frequency 27 26 35 19 22 14 13 14 55 44 37 faciliEes annual located 18 16 12 29 17 18 RESULTS OF THE ANALYSIS RANGE Groups Corpus C Lists List 2 List 3 Words bus hotels interna`onal entertainment perfect ideal reputaEon cheap conference transport accommodaEon Range 23 23 18 18 15 14 13 12 25 23 22 Frequency 27 26 35 19 22 14 13 14 55 44 37 faciliEes annual located 18 16 12 29 17 18 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 finally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 finally 5 6 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 finally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 finally 5 6 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 finally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 finally 5 6 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS RANGE Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS TEXTALYSER Groups Lexical Density Readability (Gunning-‐ Fog Index) : (6-‐easy 20-‐ hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS TEXTALYSER Groups Lexical Density Readability (Gunning-‐ Fog Index) : (6-‐easy 20-‐ hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS TEXTALYSER Groups Lexical Density Readability (Gunning-‐ Fog Index) : (6-‐easy 20-‐ hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS TEXTALYSER Groups Word Occurrences Frequency Rank city 207 3.8% 1 you 127 2.3% 2 valencia 82 1.5% 3 our 77 1.4% 4 your 57 1.1% 5 CorpusC RESULTS OF THE ANALYSIS TEXTALYSER Groups Word Occurrences Frequency Rank city 207 3.8% 1 you 127 2.3% 2 valencia 82 1.5% 3 our 77 1.4% 4 your 57 1.1% 5 CorpusC RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusA(1) Word Occurrences Frequency Rank you 98 3.3% 1 went 85 2.9% 2 really 50 1.7% 3 friends 39 1.3% 4 weekend 36 1.2% 5 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusA(1) Word Occurrences Frequency Rank you 98 3.3% 1 went 85 2.9% 2 really 50 1.7% 3 friends 39 1.3% 4 weekend 36 1.2% 5 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusB(1) Word Occurrences Frequency Rank english 92 3% 1 university 72 2.4% 2 course 54 1.8% 3 like 51 1.7% 4 you 41 1.4% 5 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusB(2) Word Occurrences Frequency Rank english 98 3.2% 1 university 84 2.8% 2 like 47 1.5% 3 your 42 1.4% 4 you 34 1.1% 5 RESULTS OF THE ANALYSIS TEXTALYSER Groups CorpusB(2) Word Occurrences Frequency Rank english 98 3.2% 1 university 84 2.8% 2 like 47 1.5% 3 your 42 1.4% 4 you 34 1.1% 5 CONCLUSIONS ü Both RANGE and TEXTALYSER are useful to show similarities between the two same-level corpora (A & B). ü Similar values indicate that the students- coming from many different backgrounds- have a similar level in their first year. ü Both corpora show the same progression between writing 1 and 2. And there is a difference between corpora A & B and corpora C. ü These results should be studied qualitatively and compared with other subcorpora. ü Preliminary but…. A good start to try with the rest of the subcorpora that we are collecting. [email protected] [email protected]
© Copyright 2024