AUTOMATED VOCABULARY EVALUATION IN A LEARNER CORPUS

When learners of English have to write texts in academic English, and in preparation for written parts of English examinations in particular, students need feedback when they submit their written works. Besides tips concerning content, coherence, and cohesion, grammatical range and accuracy, all of which are standardly included in instructor’s comments, students need advice on directions towards lexical improvements. However, such feedback requires a huge amount of time and effort on the part of instructor, whose workload is heavy enough to make any extra effort undesirable. A learner corpus may help in this, as its wealth of student texts allows researchers to make use of the many samples of student writing by applying certain computer tools. This paper reports the development of a system of automated lexical inspection of student works. Initially, we used essays in the corpus to work out which formal parameters in the essays demonstrate in what ways essays that have been evaluated highly by the examination experts can be distinguished, and then we applied those parameters in the process of automated inspection, after which we proceeded to checking the correlation between the inspection results and the traditional grading. Finally, after a system of lexical inspection of student essays was established, which paves way to the development of automated lexical feedback in order to orient students in how to improve the quality of their writing.


INTRODUCTION
The role of access to a learner corpus has proved to increase efficiency of L2 acquisition for learners as well as teaching efficiency for EFL instructors [1; 2]). This paper presents a computer tool for a learner corpus designed at the computer linguistics department of the Higher School of Economics for both categoties of users.
REALEC, Russian annotated Learner Corpus, set up at the School of Linguistics, is the first collection of English texts written by Russian students learning English easily available in the open access at (http://www.realec.org/). All errors made by Russian students in their academic writing in English are pointed out to them with special tags by expert annotators (EFL instructors, as a rule). The annotation process is controlled by the research team responsible for consistency in tagging, as well as development of the learner corpus. One of the directions of the development is to look at the lexical features used in student essays. Our approach in this research was to find such lexical features in the essays scored highly by experts which will be different from those features in the essays scored with the lowest grades.

Methods and related research
The essays for the research were those from the past examination in IELTS format administered to 2nd-year Bachelor students at the Higher School of Economics. Each student had to write two essays -the description of an illustration showing results of a particular research, about 150 words long, and an argumentative essay, about 250 words long, both within the period of one hour. After the examination, the essays were evaluated by EFL experts and assigned a grade for either task in the percentage points. For the purposes of the experiment, two groups of essays were selected -those that the experts graded at 75% or higher, and the those with grades of 30% or lower. We designed a special procedure of lexical inspection, i.e. computing the performance of a set of lexical features commonly applied in automated evaluation of a written text. The adjacent fieldcomparisons of student texts with authentic academic texts -were reported by Canadian researchers from University of Grenoble-II Benoi t Lemaire and Philippe Dessus in their work which presents Apex, a system for automatic assessment of a student essay based on the use of Latent Semantic [3]. The procedure was called REALECInspector, and the essays in either group were subjected to it. The results of this inspection in two groups were compared with each other. The code for REALEC Inspector was developed from the automated readability system suggested by Konstantin Druzhkin [4; 5].
The choice of the parameters to be included in evaluation is discussed in [6][7][8][9]. For the purposes of our experiment the comparisons were drawn across the following features: -Length of words; -Length of sentences; -Distribution of words across the Common European Framework scale levels (A1-C2); -Frequency of each word in the Corpus of Contemporary American English; -Use of academic vocabulary from the two lists -the Coxhead Academic Word List in [10; 11]) and in list included in the Corpus of Contemporary American English COCA); -Repetitions; -Use of linking words; -Use of collocations (as attested by the presence on the Pearson academic collocation list).
The objective of our experiment was to establish the correlation between the grades that were given by experts and the automated evaluation of lexical content on the basis of certain criteria.
Our hypothesis was that the criteria applied in the developed application would be sufficient for a valid preliminary evaluation of lexical variability of written papers.

DISCUSSION
The works in the experiment consisted, on the one hand, of 45 sets (2 essays in each set -an argumentative essay and a description of a diagram or diagrams), -those that were marked by experts at 75% (out of 100%) and higher, and in the second group 900 sets (the same types of essays) marked below 75%. The results of lexical inspection application in the two groups were analysed in comparison. It was revealed that certain characteristics (for example, the average sentence length, the number of words from academic vocabulary lists) are significantly different in the two groups, which proves that grades assigned by experts do have certain correlations with parameters that can be evaluated by a software application. For one, there are more words on average in "good" texts.
The following are the main numerical results of the comparative analysis: There are 188 words on average in a diagram description in a "good" essay versus exactly the same number of words in a diagram description from the big collection of essays.
There are 294 words on average in a "good" argumentative essay versus 268 words on average in an argumentative essay from the big collection.
On average, sentences are longer in "good" essays: there are 21.15 words in a "good" diagram description versus 18.4 words in a diagram description from the entire collection, and 20.34 words in a "good" argumentative essay versus 18.1 words in an argumentative essay from the entire collection. The maximum sentence length is longer in "good" texts: it is 37.5 words in a "good" diagram description versus 33.14 words in a diagram description from the entire collection, and 37.4 words in a "good" argumentative essay versus 35.75 words in an argumentative essay from the big collection. All in all, the mean sentence length positively correlates with the quality of learner writing, while the extremes in the length of a sentence rather relate to poor quality in an essay.
At the same time, the average word length was approximately the same for both "good" and "average" essays in the first collection of essays, both argumentative and diagram descriptions. However, with the increase in the number of essays scored highly, the positive correlation of this feature brought us to understanding that the word length is a factor to consider when giving feedback to students. The same stands true for the longest words in the papers and the number of word repetitions.
The number of linking words in "good" diagram descriptions was recognized as a positive factor, though not overwhelmingly so -3.6 versus 3.23. However, "good" argumentative essays demonstrate a significant difference: 8.97 versus 6.33.
The number of collocations from Pearson's list (with repetitions) for diagram descriptions was 1.35 in "good" texts versus 0.4 in "ordinary" descriptions, and 1.62 in "good" argumentative essays versus 0.71 in "average" argumentative essays.
The number of collocations without repetitions was for diagram descriptions -0.88 in "good" ones versus 0.71 in "ordinary" descriptions, and for argumentative essays -1.46 in "good" versus 0.67 in "average".
In general, "good" essays have more CEFR scale words at each level, but not many more. This is rather due to the fact that the good papers have more words altogether. So, these figures will not be considered as a separate feature for a positive feedback.
The same stands true for COCA frequencies. The "good" essays on the whole have more words at each level.
At the same time, there are notably more words from academic vocabulary lists in "good" essays: for diagram descriptions it is 43 pieces of academic vocabulary on average versus 36 (with repetitions), and 28 versus 22 (without repetitions), and for argumentative essays, the average number of academic vocabulary items is 70 versus 56 (with repetitions), and 51 versus 40 (without repetitions). Table below gives the synopsis of the significant differences between essays scored highly and the rest of the essays. The results of the comparisons between "good" and average essays have allowed us to set up an automated application called REALEC Inspector, which presents automated feedback to authors of learner texts uploaded to the corpus (argumentative essays or description of diagrams) and provides some statistical information based on the comparison of the essay's formal features with the average figures for an essay of this type collected in REALEC, as well as offers some recommendations for improvements.
The stages of work with this application are the following.
There is an input window on its homepage with the "inspect" button to open the page for the lexical analysis.
The first thing on the page that appears after pressing Inspect button is the essay itself. Then comes the short statistics on: - In this list of short statistics, each line may be open toreveal det ailed lists, comments,or the necessary diagrams. For the histogram of CEFR words distribution (Fig. 1), Word Family Framework was used (the possibility to use English Vocabulary Profile instead has been reserved as well), and eachword is lemmatized with the help of NLTK. Stopwords (153 on the list) are excluded. Words that the system was unable to relate with a particular CEFR level are categorized as "Unclassified" (some misspelled words are among them).

Fig. 1. Distribution of words in a short learner text by CEFR levels
For "Number of words from the COCA frequency lists" the author of the text gets the list of words from the essay that are among the 500 most frequent words in COCA, and then those that are among the 3000 most frequent words in COCA. Stop-words are again excluded. The statistics is presented in the following way: The occurrence of academic words is the statistics drawn from the list which is a combination of two -the Academic Word List Coxhead and the Corpus of Contemporary American English. As a result, if a word belongs to either of these lists, it will be counted. The result record is returned in this way: Academic words: 71 (51 unique) After the number of words that have been used more than once in the essay the author will see the word that was repeated the highest number of times added to the statistics, namely: Word repetitions: 44 ('children ' 6) The next lines presents an impirtant index of the number of linking tools in the text, which looks like the following: The inspector then gives the number and the list of collocations from the essay if they are on the Pearson Academic Collocation List (see Fig. 2 below).  If one clicks on any lines with statistics, a diagram or detailed list are presented. Fig. 3 gives the example of the distribution of average sentence length in the corpus. and one can get a similar diagram for the distribution of average word length. The red line on the diagram marks the average index in the essay under inspection for the author to compare with other essays. The comparison can also be numerical, as percentage is given here as well. For example, if under the diagram with the average sentence length there is a figure of 90%, it means the average sentence length is longer than in 905 of all essays in the corpus. More often than not, it is a feature of a good essay, as it implies that sentences are more sophisticated than in the majority of essays. On the contrary, low percentage number is the result of oversimplified sentence structure and has to bepointed out to the author as deficiency.
The last stage of inspection implies the use of syntactic parser UD pipe, which will be described in detail in a separate publication, but we need to mention it here, because the combination of the statistical analysis described above and the results of parsing account for final recommendations to the author, which -in case of mostly positive -may looks like this: You have sufficiently complex sentences in your essay. Keep it up!