Download Report

Analysing lexical density and lexical
diversity in university students’ written
discourse
Begoña Clavel-Arroitia
Carmen Gregori-Signes
(Universitat de València -IULMA)
1.  Introduction
2.  Lexical Richness
3.  Corpus description
4.  The study
5. Results and discussion
6. Conclusions
Introduction
Lexical
Frequency
Profile
Lexical
Density
Lexical
Richness
Ure, 1971; McKee, Malvern, Richards 2000;
Menard 1983,;Vermeer 2000,;Johansson 2008
A) Lexical Density
Lexical density is the percentage of lexical words
in the text, i.e., nouns, verbs, adjectives, adverbs.
A text is considered `dense` if it contains many
lexical words relative to the total number of
words., i.e., lexical and functional words
TTR (type token ratio measure)
LD=
Number of lexical tokens x 100
_____________________
Total number of tokens
Problems with lexical density (LD)
Length
LEXICAL
DENSITY
Topic
Task
B) Lexical Frequency Profile (Laufer & Nation 1995)
Lexical Frequency Profile (LFP) shows the percentage of words
a learner uses at different vocabulary frequency levels, i.e., the
relative proportion of words from different frequency levels.
word tokens
word
families
word types
Word
types
&
Word
tokens
Types are word-forms and tokens
are occurrences of word-forms.
'The cat sat on the mat’:
-two tokens of the
type 'the’
- one token of each
of the types 'cat',
'sat', 'on', and 'mat’
Word families
Word is defined in the programme as a base form with
its inflected and derived forms: i.e., a word family:
Push
Pushed
Pushes
Pushing
Lexical Frequency Profile
  The advantages of LFP over other measures, according to Laufer and
Nation (1995:313ff) can be summarized as follows:
ü  LFP is largely independent on syntax and text cohesiveness (LD is not)
ü  LFP provides lists of different types of words, defined by frequency
level. (LS for example only differentiates between frequent and
sophisticated.)
ü  LFP is independent of learning material, reading lists etc., since it is
defined in terms of frequency.
ü  LFP discriminates between subjects who use less or more frequent
vocabulary.
ü  LFP is free of subjective decisions regarding what topic, subtopic,
elaboration and thematic unit are.
Corpus description
CASTLE Research Project: Corpus analysis of the written
production of Spanish university students (CASTLE-UV-INVPRECOMP12-80754; CASTLE-GV/2014/022) LONGDALE
Corpus
CASTLE
Corpus
University of
Lovain
Dr. Granger
CASTLE project
Antecedents •  Existence of a number of learner corpora in Spain, but there are: ·∙ no English learner corpus at the University of Valencia (around 60,000 students) ·∙ few English leaner corpora by bilingual students in Spain (Barcelona, SanEago, Madrid), and of students with Catalan as L1 (Barcelona) ·∙ fewer longitudinal corpora with the above features •  DuraEon of ﬁrst phase of the project: Oct-‐May 2013. CASTLE project
ObjecEves: ü  To compile a corpus made of composiEons from students of the Degree of English Studies in Valencia. ü  To keep a track of the students’ linguisEc evoluEon starEng in the 2012/13 academic year. ü  Source: courses English Language 1-‐8 (B1 to C2 level) ü  Four-‐step sampling: samples 1 and 2 were collected at the start and end of the ﬁrst semester; samples 3 and 4 were collected at the start and end of the second semester. Each sample had around 400 words. We then had 1.300 texts, c. 500,000 tokens. Now we have 488 texts, c. 192.388 tokens (second ediEon of the project). Our corpus: Corpus A
•  First year students (B1 level) •  Corpus A (1) and Corpus A (2): 27 essays in each one •  First task Corpus B
•  First year students (B1 level) •  Corpus A (1) and Corpus A (2): 27 essays in each one •  Third task (same students as Corpus A) Corpus C
•  Fourth year students (C1+ level) •  27 essays •  Fourth task TASKS
Writing 1. 11th October.
Task: “Last week you spent the day in another town with a group of friends. Write
a letter to an English speaking friend telling him/her: Where you went; Who you
went with; What you did.”
Writing 3. 24th November.
Task: “When applying for a course study (e.g. Erasmus stay) at a university
or college, students are expected to write a personal statement in addition
to completing an application form. You are applying for one of the
universities included in our Erasmus programme. Write your own personal
statement.”
Writing 4. 4th January.
It was one of the writing tasks found in the textbook used in the course.
Students had to write a letter proposing a city as the venue to hold an
international conference.
Aims of the study
Contrast values of lexical
density & lexical frequency
profile as possible measures
of lexical richness by the
same students
Test different software that
measure both to use them in
the longitudinal study in
progress
(RANGE & Textalyser)
Subjects
  First year students at university who are officially
registered in a Preliminary English Test (PET)
course (A2 level moving to B1 level)
  The writing tasks were designed to fit the
requirements of the PET (B1 in the CEFR) official
examinations.
Analysis: Data processing
  Lexical density tests with Textalyser.
¡ 
http://textalyser.net/
  Lexical Frequency Profile with RANGE.
¡ 
http://www.victoria.ac.nz/lals/about/staff/paul-nation
Software RANGE (Paul Nation)
  RANGE is used to compare the vocabulary of up to 32
different texts at the same time.
a range or distribution figure
(how many texts the word
occurs in)
a headword frequency figure
(the total number of times the
actual headword type appears
in all the texts)
For each word
a family frequency figure
(the total number of times the
word and its family members
occur in all the texts)
a frequency figure for
each of the texts the
word occurs in
Textalyser
Research objectives
  a) Is the LFP the same in corpus A & B?
  b) Are there significant differences between Corpus A & B
and the students enrolled in C1-C2 course (Corpus C)?
  c) Is RANGE/ TEXTALYSER fit for purpose? is it adequate
to detect improvement in lexis between students who
supposedly have a different level (A2---C1)?
  d) Does RANGE/ TEXTALYSER help us confirm the level of
our students and will allow us to establish a point of
departure for the longitudinal study?
RESULTS OF THE ANALYSIS
RANGE
Range of lines Range of nº words CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 13-‐24 12-‐24 12-‐32 12-‐30 24-‐50 164-‐264 177-‐294 171-‐333 172-‐322 277-‐518 RESULTS OF THE ANALYSIS
RANGE
Corpus A(1): WORD LIST One TOKENS/% 5108/85.06 TYPES/% 614/55.12 FAMILIES 436 Two 337/ 5.61 147/13.20 126 Three 39/ 0.65 26/ 2.33 24 not in the lists 521/ 8.68 327/29.35 ????? 6005 1114 586 Total RESULTS OF THE ANALYSIS
RANGE
Corpus A(2): WORD LIST TOKENS/% TYPES/% FAMILIES One 5245/84.26 653/54.87 461 Two 359/ 5.77 154/12.94 129 33/ 0.53 23/ 1.93 22 588/ 9.45 360/30.25 ????? 6225 1190 612 Three not in the lists Total RESULTS OF THE ANALYSIS
RANGE
Corpus B(1): WORD LIST TOKENS/% TYPES/% FAMILIES one 5471/88.41 632/63.39 417 two 253/ 4.09 108/10.83 89 three 176/ 2.84 104/10.43 83 not in the lists 288/ 4.65 153/15.35 ????? 6188 997 589 Total RESULTS OF THE ANALYSIS
RANGE
Corpus B(2): WORD LIST TOKENS/% TYPES/% FAMILIES one 5433/87.43 655/61.91 431 two 271/ 4.36 129/12.19 100 three 186/ 2.99 106/10.02 87 not in the lists 324/ 5.21 168/15.88 ????? 6214 1058 618 Total RESULTS OF THE ANALYSIS
RANGE
Corpus C: WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/6.04 199/12.38 147 three 568/5.75 168/10.45 118 not in the lists 979/9.92 431/26.80 ???? 9873 1608 789 Total RESULTS OF THE ANALYSIS
RANGE
Corpus C (UWL): WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/ 6.04 199/12.38 147 three 235/2.38 71/4.42 71 1312/13.29 528/32.84 ???? 9873 1608 742 not in the lists Total RESULTS OF THE ANALYSIS
RANGE
Corpus C (UWL): WORD LIST TOKENS/% TYPES/% FAMILIES one 7730/78.29 810/50.37 524 two 596/ 6.04 199/12.38 147 three 235/2.38 71/4.42 71 1312/13.29 528/32.84 ???? 9873 1608 742 not in the lists Total RESULTS OF THE ANALYSIS
RANGE
Corpus C: academic list vs. UWL three 568/5.75 168/10.45 118 three 235/2.38 71/4.42 71 RESULTS OF THE ANALYSIS
RANGE
Compared results:
TOKENS One CorpusA(1) 5108/85.06%
Two 337/ 5.61% 359/ 5.77%
253/ 4.09% 271/ 4.36% 596/ 6.04% Three 39/ 0.65%
33/ 0.53% 176/ 2.84%
186/ 2.99% 568/ 5.75% 521/ 8.68 %
588/ 9.45%
288/ 4.65%
324/ 5.21%
979/ 9.92% Not in the lists CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 5245/84.26% 5471/88.41% 5433/87.43% 7730/78.29% RESULTS OF THE ANALYSIS
RANGE
Compared results:
TOKENS One CorpusA(1) 5108/85.06%
Two 337/ 5.61% 359/ 5.77%
253/ 4.09% 271/ 4.36% 596/ 6.04% Three 39/ 0.65%
33/ 0.53% 176/ 2.84%
186/ 2.99% 568/ 5.75% 521/ 8.68 %
588/ 9.45%
288/ 4.65%
324/ 5.21%
979/ 9.92% Not in the lists CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 5245/84.26% 5471/88.41% 5433/87.43% 7730/78.29% RESULTS OF THE ANALYSIS
RANGE
TYPES One Two Three Not in the lists CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 614/55.12% 653/54.87% 632/63.39% 655/61.91% 810/50.37% 147/13.20% 154/12.94% 108/10.83% 129/12.19%
26/ 2.33%
23/ 1.93%
199/12.38% 104/10.43% 106/10.02%
168/10.45% 327/29.35% 360/30.25% 153/15.35% 168/15.88%
431/26.80% RESULTS OF THE ANALYSIS
RANGE
TYPES One Two Three Not in the lists CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC 614/55.12% 653/54.87% 632/63.39% 655/61.91% 810/50.37% 147/13.20% 154/12.94% 108/10.83% 129/12.19%
26/ 2.33%
23/ 1.93%
199/12.38% 104/10.43% 106/10.02%
168/10.45% 327/29.35% 360/30.25% 153/15.35% 168/15.88%
431/26.80% RESULTS OF THE ANALYSIS
RANGE
FAMILIES CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC one 436 461 417 431 524 two 126 129 89 100 147 three 24 22 83 87 118 total 586 612 589 618 789 RESULTS OF THE ANALYSIS
RANGE
FAMILIES CorpusA(1) CorpusA(2) CorpusB(1) CorpusB(2) CorpusC one 436 461 417 431 524 two 126 129 89 100 147 three 24 22 83 87 118 total 586 612 589 618 789 RESULTS OF THE ANALYSIS
RANGE
Groups Corpus C Lists List 2 List 3 Words bus hotels internaEonal entertainment perfect ideal reputaEon cheap conference transport accommodaEon Range 23 23 18 18 15 14 13 12 25 23 22 Frequency 27 26 35 19 22 14 13 14 55 44 37 faciliEes annual located 18 16 12 29 17 18 RESULTS OF THE ANALYSIS
RANGE
Groups Corpus C Lists List 2 List 3 Words bus hotels interna`onal entertainment perfect ideal reputaEon cheap conference transport accommodaEon Range 23 23 18 18 15 14 13 12 25 23 22 Frequency 27 26 35 19 22 14 13 14 55 44 37 faciliEes annual located 18 16 12 29 17 18 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 ﬁnally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 ﬁnally 5 6 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 ﬁnally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 ﬁnally 5 6 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus A(1) List 2 lunch 13 13 dinner 12 15 List 3 ﬁnally 6 6 Corpus A(2) List 2 lunch 9 9 dinner 9 9 hotel 8 16 lot 8 15 restaurant 8 10 trip 8 10 clock 8 9 park 8 8 List 3 ﬁnally 5 6 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS
RANGE
Groups Lists Words Range Frequency Corpus B(1) List 2 improve 18 23 lot 14 21 List 3 culture 8 10 furthermore 9 9 research 7 7 Corpus B(2) List 2 improve 16 25 lot 11 15 program 9 14 sincerely 8 8 List 3 culture 14 18 academic 7 8 cultures 6 7 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups Lexical Density Readability (Gunning-‐
Fog Index) : (6-‐easy 20-‐
hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups Lexical Density Readability (Gunning-‐
Fog Index) : (6-‐easy 20-‐
hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups Lexical Density Readability (Gunning-‐
Fog Index) : (6-‐easy 20-‐
hard) 11.7 Readability (Alterna`ve) beta : (100-‐easy 20-‐hard, op`mal 60-‐70) 35.8 CorpusC 28.6% CorpusA(1) 36.5% 6.2 67.3 CorpusA(2) 36.6% 6.4 68 CorpusB(1) 30.7% 10.6 40 CorpusB(2) 32.2% 10.4 41.9 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups Word Occurrences Frequency Rank city 207 3.8% 1 you 127 2.3% 2 valencia 82 1.5% 3 our 77 1.4% 4 your 57 1.1% 5 CorpusC RESULTS OF THE ANALYSIS
TEXTALYSER
Groups Word Occurrences Frequency Rank city 207 3.8% 1 you 127 2.3% 2 valencia 82 1.5% 3 our 77 1.4% 4 your 57 1.1% 5 CorpusC RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusA(1) Word Occurrences Frequency Rank you 98 3.3% 1 went 85 2.9% 2 really 50 1.7% 3 friends 39 1.3% 4 weekend 36 1.2% 5 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusA(1) Word Occurrences Frequency Rank you 98 3.3% 1 went 85 2.9% 2 really 50 1.7% 3 friends 39 1.3% 4 weekend 36 1.2% 5 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusA(2) Word Occurrences Frequency Rank you 113 3.6% 1 went 80 2.5% 2 really 41 1.3% 3 see 35 1.1% 4 tell 35 1.1% 4 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusB(1) Word Occurrences Frequency Rank english 92 3% 1 university 72 2.4% 2 course 54 1.8% 3 like 51 1.7% 4 you 41 1.4% 5 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusB(2) Word Occurrences Frequency Rank english 98 3.2% 1 university 84 2.8% 2 like 47 1.5% 3 your 42 1.4% 4 you 34 1.1% 5 RESULTS OF THE ANALYSIS
TEXTALYSER
Groups CorpusB(2) Word Occurrences Frequency Rank english 98 3.2% 1 university 84 2.8% 2 like 47 1.5% 3 your 42 1.4% 4 you 34 1.1% 5 CONCLUSIONS
ü  Both RANGE and TEXTALYSER are useful to show similarities
between the two same-level corpora (A & B).
ü  Similar values indicate that the students- coming from many different
backgrounds- have a similar level in their first year.
ü  Both corpora show the same progression between writing 1 and 2. And
there is a difference between corpora A & B and corpora C.
ü  These results should be studied qualitatively and compared with other
subcorpora.
ü  Preliminary but…. A good start to try with the rest of the subcorpora
that we are collecting.
[email protected]
[email protected]