Word frequency and text complexity: an eye-tracking study of young Russian readers
- Authors: Laposhina A.N.1, Lebedeva M.Y.1, Berlin Khenis A.A.1
-
Affiliations:
- Pushkin State Russian Language Institute
- Issue: Vol 26, No 2 (2022): Computational Linguistics and Discourse Complexology
- Pages: 493-514
- Section: Articles
- URL: https://journals.rudn.ru/linguistics/article/view/31335
- DOI: https://doi.org/10.22363/2687-0088-30084
Cite item
Full Text
Abstract
Although word frequency is often associated with the cognitive load on the reader and is widely used for automated text complexity assessment, to date, no eye-tracking data have been obtained on the effectiveness of this parameter for text complexity prediction for the Russian primary school readers. Besides, the optimal ways for taking into account the frequency of individual words to assess an entire text complexity have not yet been precisely determined. This article aims to fill these gaps. The study was conducted on a sample of 53 children of primary school age. As a stimulus material, we used 6 texts that differ in the classical Flesch readability formula and data on the frequency of words in texts. As sources of the frequency data, we used the common frequency dictionary based on the material of the Russian National Corpus and DetCorpus - the corpus of literature addressed to children. The speed of reading the text aloud in words per minute averaged over the grades was employed as a measure of the text complexity. The best predictive results of the relative reading time were obtained using the lemma frequency data from the DetCorpus. At the text level, the highest correlation with the reading speed was shown by the text coverage with a list of 5,000 most frequent words, while both sources of the lists - Russian National Corpus and DetCorpus - showed almost the same correlation values. For a more detailed analysis, we also calculated the correlation of the frequency parameters of specific word forms and lemmas with three parameters of oculomotor activity: the dwell time, fixations count, and the average duration of fixations. At the word-by-word level, the lemma frequency by DetCorpus demonstrated the highest correlation with the relative reading time. The results we obtained confirm the feasibility of using frequency data in the text complexity assessment task for primary school children and demonstrate the optimal ways to calculate frequency data.
Keywords
Full Text
Table 1. Main linguistic parameters of the texts used in the experiment (FD RNC is a frequency dictionary based on Russian National Corpus, DetCorpus is a corpus of literature addressed to children)
Параметр текста | Text 1. Tractor | Text 2. Umka | Text 3. In the grass | Text 4. Mouse | Text 5. Flowers | Text 6. Dog |
FRE (Oborneva) | 49 | 78 | 75 | 66 | 15 | 80 |
Text coverage by the list 5000 (FD RNC) | 84% | 85% | 35% | 89% | 62% | 92% |
Average word length | 6.4 | 4.8 | 5.6 | 5.6 | 6.8 | 4.1 |
Text coverage by the list 5000 (DetCorpus) | 81% | 83% | 61% | 93% | 69% | 96% |
Percent of words with | 8% | 14% | 43% | 4% | 21% | 2% |
Percent of words with | 11% | 12% | 22% | 4% | 24% | 0% |
Average log word frequency | 4.5 | 4.2 | 3.6 | 4.7 | 3.8 | 4.9 |
Average log word frequency | 4.2 | 4.6 | 3.8 | 4.8 | 4 | 4.9 |
Table 2. Word-by-word values of word length, frequency and eye movement parameters (FD RNC is a frequency dictionary based on Russian National Corpus, DetCorpus is a corpus of literature addressed to children)
Word form | мальчики | гладиолусов | себе | современный |
Lemma | мальчик | гладиолус | себя | современный |
Length of word form in characters | 8 | 11 | 4 | 11 |
Length of word form in syllables | 2 | 4 | 2 | 4 |
Lemma frequency by FD RNC, ipm | 188 | 0 | 2272 | 236 |
Lemma frequency by DetCorpus, ipm | 597 | 1.1 | 2243 | 14 |
Word form frequency by FD RNC, ipm | 19 | 0 | 90 | 33 |
Word form frequency by DetCorpus, ipm | 91 | 0.4 | 86 | 4 |
Dwell time, % | 0.026 | 0.089 | 0.019 | 0.032 |
Fixation duration, ms | 257 | 288 | 255 | 250 |
Fixation count | 3.22 | 9.15 | 2.46 | 4.43 |
Pic. 1. An example of the analyzed data of oculomotor activity
Fig. 2. Average reading speed of the texts by students of grades 1–3
Table 3. Correlation analysis of oculomotor activity parameters with word frequency parameters (Spearman correlation, bold values have p-value <0.05)
Parameter | Average reading speed |
Average word length | -0.83 |
FRE(Oborneva) | 0.66 |
Text coverage by the list 5000 (FD RNC) | 0.89 |
Text coverage by the list 5000 (DetCorpus) | 0.89 |
Percent of words with ipm < 5 (FD RNC) | -0.77 |
Percent of words with ipm < 5 (DetCorpus) | -0.83 |
Average log word frequency (FD RNC) | 0.78 |
Average log word frequency (DetCorpus) | 0.85 |
Table 4. Correlation analysis of oculomotor activity parameters and linguistic parameters of word forms (Spearman correlation, bold values have a p-value <0.05)
Parametr | Dwell time | Fixation duration | Fixation count |
Length of word form in characters | 0.53 | -0.02 | 0.73 |
Length of word form in syllables | 0.36 | -0.09 | 0.55 |
Lemma frequency by FD RNC, ipm | 0.55 | 0.49 | 0.46 |
Lemma frequency by DetCorpus, ipm | 0.59 | 0.42 | 0.54 |
Word form frequency by FD RNC, ipm | 0.58 | 0.47 | 0.53 |
Word form frequency by DetCorpus, ipm | 0.58 | 0.42 | 0.53 |
Table 5. An example of the output of the Textometr tool (Russian as a native language section) for the texts from the experiment
Text | Structural complexity | Lexical complexity | Estimated age |
Text 1. Tractor | 4 | 3 | 9–10 years |
Text 2. Umka | 3 | 3 | 9–10 years |
Text 3. In the grass | 2 | 7 | 9–10 years |
Text 4. Mouse | 3 | 1 | 7–8 years |
Text 5. Flowers | 9 | 6 | 13–15 years |
Text 6. Dog | 2 | 1 | 7–8 years |
About the authors
Antonina N. Laposhina
Pushkin State Russian Language Institute
Email: ANLaposhina@pushkin.institute
ORCID iD: 0000-0003-0693-7657
leading expert of the Laboratory of Cognitive and Linguistic Studies
6 Akademika Volgina street, Moscow, 117485, RussiaMaria Yu. Lebedeva
Pushkin State Russian Language Institute
Email: MULebedeva@pushkin.institute
ORCID iD: 0000-0002-9893-9846
holds a PhD in Philology and is a leading researcher of the Laboratory of Cognitive and Linguistic Studies, Associate Professor of the Department of Methods of Teaching Russian as a Foreign Language
6 Akademika Volgina street, Moscow, 117485, RussiaAlexandra A. Berlin Khenis
Pushkin State Russian Language Institute
Author for correspondence.
Email: alexa.munxen@gmail.com
ORCID iD: 0000-0003-2034-1526
specialist of the Laboratory of Cognitive and Linguistic Studies
6 Akademika Volgina street, Moscow, 117485, RussiaReferences
- Иомдин Б.Л., Морозов Д.А. Кто поймет «Незнайку»? Автоматическое определение сложности текстов для детей // Русская речь. 2021. № 5. С. 55-68. [Iomdin, Boris L. & Dmitry A. Morozov. 2021. Who can understand “Dunno”? Automatic assessment of text complexity in children’s literature. Russian Speech 5. 55-68 (In Russ.)]. https://doi.org/10.31857/S013161170017239-1
- Корнеев А.А., Ахутина Т.В., Матвеева Е.Ю. Особенности чтения третьеклассников с разным уровнем развития навыка: анализ движений глаз // Вестник Московского университета. Серия 14. Психология. 2019. № 2. С. 64-87. [Korneev, Aleksei A., Tatiana V. Akhutina & Ekaterina Yu. Matveeva. 2019. Reading in third graders with different state of the skill: An eye-tracking study. Vestnik Moskovskogo Universiteta. Seriya 14. Psikhologiya 2. 64-87. (In Russ.)]. https://doi.org/10.11621/vsp.2019.02.64
- Криони Н.К., Никин А.Д., Филиппова А.В. Автоматизированная система анализа сложности учебных текстов // Вестник Уфимского государственного авиационного технического университета. 2008. № 11 (1). С. 101-107. [Krioni, Nikolai K., Aleksei D. Nikin & Anastasia V. Filippova. 2008. Automated system for analyzing the complexity of educational texts. Bulletin of the Ufa State Aviation Technical University 11(1). 101-107. (In Russ.)].
- Лапошина А.Н., Веселовская Т.С., Лебедева М.Ю., Купрещенко О.Ф. Лексический состав текстов учебников русского языка для младшей школы: корпусное исследование // Компьютерная лингвистика и интеллектуальные технологии: по материалам международной конференции «Диалог 2019». 2019. T. 18 (25). С. 351-363. [Laposhina, Antonina N., Тatiana S. Veselovskaya, Maria U. Lebedeva & Olga F. Kupreshchenko. 2019. Lexical analysis of the Russian language textbooks for primary school: Corpus study. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2019”18. 351-363. (In Russ.)].
- Мартынова Е.В., Солнышкина М.И., Мерзлякова А.Ф., Гизатулина Д.Ю. Лексические параметры учебного текста (на материале текстов учебного корпуса русского языка) // Филология и культура. 2020. № 3 (61). С. 72-80. [Martynova, Ekaterina V., Marina I. Solnyshkina, Amina F. Merzlyakova & Diana Yu. Gizatulina. 2020. Lexical parameters of the academic text (based on the texts of the academic corpus of the Russian language). Philology and Culture 3. 72-80. (In Russ.)]. https://doi.org/10.26907/2074-0239-2020-61-3-72-80
- Мизернов И.Ю., Гращенко Л.А. Анализ методов оценки сложности текста. // Новые информационные технологии в автоматизированных системах. 2015. № 18. С. 572-581. [Mizernov, I. Yu. & L. A. Grashchenko. 2015. Analysis of methods for assessing text complexity. New Information Technologies in Automated Systems 18. 572-581. (In Russ.)].
- Микк Я.А. О факторах понятности учебного текста: автореф. дис. … канд. пед. наук. Тарту, 1970. 22 с. [Mikk, Ya.A. 1970. Factors of educational text clarity. Abstract of Pedagogy Cand. Diss. Tartu. (In Russ.)].
- Оборнева И.В. Автоматизированная оценка сложности учебных текстов на основе статистических параметров: дис... канд. пед. наук: 13.00.02. М., 2006. 165 с. [Oborneva, Irina V. 2006. Automated estimation of complexity of educational texts on the basis of statistical parameters. Pedagogy Cand. Diss. Moscow. (In Russ.)].
- Солнышкина М.И., Кисельников А.С. Сложность текста: этапы изучения в отечественном прикладном языкознании. // Вестник Томского государственного университета. Филология. 2015. № 6 (38). С. 86-99. [Solnyshkina, Marina I. & Alexander S. Kiselnikov. 2015. Text complexity: Study phases in Russian linguistics. Tomsk State University Journal of Philology 6. 86-99. (In Russ.)]. https://doi.org/10.17223/19986645/38/7
- Шпаковский Ю.Ф. Разработка количественной методики оценки трудности восприятия учебных текстов для высшей школы // Научно-технический вестник информационных технологий, механики и оптики. 2008. № 1 (83). С. 110-117. [Shpakovsky, Yury F. 2008. Development of a quantitative methodology for assessing the difficulty of perceiving educational texts for higher education. Scientific and Technical Bulletin of Information Technologies, Mechanics and Optics 1(83). 110-117. (In Russ.)].
- Chall, Jeanne S. & Edgar Dale. 1995. Readability Revisited: The New Dale-Chall Readability Formula. Cambridge, MA: Brookline Books.
- Chen, Xiaobin & Detmar Meurers. 2016. Characterizing text difficulty with word frequencies. In Joel Tetreault, Jill Burstein, Claudia Leacock & Helen Yannakoudakis (eds.), Proceedings of the 11th workshop on innovative use of nlp for building educational applications, 84-94. San Diego: Association for Computational Linguistics.
- Clifton, Jr. Charles, Adrian Staub & Keith Rayner. 2007. Eye movements in reading words and sentences. In Roger P. G. van Gompel, Martin H. Fischer, Wayne S. Murray & Robin L. Hill (eds.), Eye movements: A window on mind and brain, 341-371. Elsevier. https://doi.org/10.1016/B978-008044980-7/50017-3
- Dorofeeva, Svetlana V., Victoria Reshetnikova, Margarita Serebryakova, Daria Goranskaya, Tatiana V. Akhutina & Olga Dragoy. 2019. Assessing the validity of the standardized assessment of reading skills in Russian and verifying the relevance of available normative data. The Russian Journal of Cognitive Science 6(1). 4-24.
- DuBay, William H. 2007. Smart Language: Readers, Readability, and the Grading of Text. Costa Mesa, California: Impact Information.
- Farris-Trimble, Ashley & Bob McMurray. 2018. Morpho-phonological regularities influence the dynamics of real-time word recognition: Evidence from artificial language learning. Laboratory Phonology 9(1). 1-34. https://doi.org/10.5334/labphon.41
- Francois, Tomas & Cedrick Fairon. 2012. An ’AI readability’ formula for French as a foreign language. Proceedings of the EMNLP and CoNLL 2012, Jeju Island, Korea, 12-14 July 2012. 466-477.
- Glazkova, Anna, Yury Egorov & Maxim Glazkov. 2021. A comparative study of feature types for age-based text classification. In Analysis of Images, Social Networks and Texts. AIST 2020. Lecture Notes in Computer Science 12602. 120-134.
- Graesser, Arthur C., Danielle S. McNamara, Zhiqang Cai, Mark Conley, Haiying Li & James Pennebaker. 2014. Coh-Metrix measures text characteristics at multiple levels of language and discourse. The Elementary School Journal 115. 210-229.
- Griffin, Zenzi M. & Daniel H. Spieler. 2006. Observing the what and when of language production for different age groups by monitoring speakers’ eye movements. Brain and Language 99(3). 272-288.
- Henderson, John M., Aleksander Pollatsek & Keith Rayner. 1989. Covert visual attention and extrafoveal information use during object identification. Perception & Psychophysics 45. 196-208. https://doi.org/10.3758/BF03210697
- Jian, Yu-Cin & Hwawei Ko. 2017. Influences of text difficulty and reading ability on learning illustrated science texts for children: An eye movement study. Computers & Education 113. 263-279.
- Lexile. 2007. The Lexile Framework for Reading: Theoretical Framework and Development. Technical Report. MetaMetrics, Inc., Durham, NC
- Luke, Steven G., John M. Henderson & Fernanda Ferreira. 2015. Children’s eye-movements during reading reflect the quality of lexical representations: An individual differences approach. Journal of Experimental Psychology: Learning, Memory, and Cognition 41(6). 1675-1683. https://doi.org/10.1037/xlm0000133
- Raney, Gary E. & Keith Rayner. 1995. Word frequency effects and eye movements during two readings of a text. Canadian Journal of Experimental Psychology 49. 151-172.
- Rau, Anne K., Kristina Moll & Karin Landerl. The transition from sublexical to lexical processing in a consistent orthography: An eye-tracking study. Scientific Studies of Reading 18. 224-233. https://doi.org/10.1080/10888438.2013.857673
- Rau, Anne K., Kristina Moll, Margaret J. Snowling & Karin Landerl. 2015. Effects of orthographic consistency on eye movement behavior: German and English children and adults process the same words differently. Journal of Experimental Child Psychology 130. 92-105. https://doi.org/10.1016/j.jecp.2014.09.012.
- Rayner, Keith. 1998. Eye movements in reading and information processing: 20 years of research. Psychological Bulletin 124. 372-422. https://doi.org/10.1037/0033-2909.124.3.372
- Rayner, Keith, Timothy J. Slattery, Denis Drieghe & Simon P. Liversedge. 2011. Eye movements and word skipping during reading: Effects of word length and predictability. Journal of Experimental Psychology: Human Perception and Performance 37(2). 514-528.
- Rello, Luz, Ricardo Baeza-Yates, Laura Dempere-Marco & Horacio Saggion. 2013. Frequent words improve readability and short words improve understandability for people with dyslexia. In Paula Kotzé & Gary Marsden (eds.), Human-Computer interaction - INTERACT 2013. Lecture notes in computer science vol 8120, 203-219. Berlin/Heidelberg: Springer. https://doi.org/10.1007/978-3-642-40498-6_15
- Reynolds, Robert. 2016. Insights from Russian second language readability classification: Complexity-dependent training requirements, and feature evaluation of multiple categories. Proceedings of the 11th Workshop on the Innovative Use of NLP for Building Educational Applications, San Diego, CA 2016. 289-300.
- Sato, Satoshi. 2014. Text Readability and Word Distribution in Japanese. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) 2014. 2811-2815.
- Schwarm, Sarah E. & Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL ’05), USA, 2005. 523-530.
- Solovyev, Valery, Vladimir Ivanov & Marina Solnyshkina. 2018. Assessment of reading difficulty levels in Russian academic texts: Approaches and metrics. Journal of Intelligent & Fuzzy Systems 34. 3049-3058.
- Tiffin-Richards, Simon P. & Sasha Schroeder. 2015. Children's and adults' parafoveal processes in German: Phonological and orthographic effects. Journal of Cognitive Psychology 27. 531-548. https://doi.org/10.1080/20445911.2014.999076
- White, Sarah J., Denis Drieghe, Simon P Liversedge & Adrian Staub. 2018. The word frequency effect during sentence reading: A linear or nonlinear effect of log frequency? Quarterly Journal of Experimental Psychology 71(1). 46-55. https://doi.org/10.1080/17470218.2016.1240813
- Ляшевская О.Н., Шаров С.А. Частотный словарь современного русского языка (на материалах Национального корпуса русского языка). М.: Азбуковник. 2009. [Lyashevskaya, Olga N. & Sergey A. Sharoff. 2009. Modern Russian Frequency Dictionary (based on the data from the Russian National Corpus). Moscow: Azbukovnik. (In Russ.)]