Russian Journal of Linguistics

2687-00882686-8024

Peoples’ Friendship University of Russia named after Patrice Lumumba (RUDN University)

31328

10.22363/2687-0088-30145

Articles

Статьи

Articles

Research Article

ReaderBench: Multilevel analysis of Russian text characteristics

ReaderBench: многоуровневый анализ характеристик текста на русском языке

https://orcid.org/0000-0002-7994-9950

Corlatescu

Dragos

Корлатеску

Драгош

Teaching Assistant and a PhD student

ассистент и аспирант

dragos.corlatescu@upb.ro

https://orcid.org/0000-0002-0380-6814

Ruseti

Ștefan

Русети

Штефан

Lecturer

преподаватель

stefan.ruseti@upb.ro

https://orcid.org/0000-0002-4815-9227

Dascalu

Mihai

Даскалу

Михай

Full Professor

профессор

mihai.dascalu@upb.ro

University Politehnica of BucharestПолитехнический университет Бухареста

Academy of Romanian ScientistsАкадемия румынских ученых

29062022

262

Computational Linguistics and Discourse Complexology

Компьютерная лингвистика и дискурсивная комплексология

34237029062022

2022

Corlatescu D., Ruseti Ș., Dascalu M.

Корлатеску Д., Русети Ш., Даскалу М.

Corlatescu D., Ruseti Ș., Dascalu M.

https://creativecommons.org/licenses/by-nc/4.0

https://journals.rudn.ru/linguistics/article/view/31328

This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the “nmod” dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out cross-validation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.

В статье представлена новая версия платформы ReaderBench с открытым исходным кодом. В настоящее время Readerbench поддерживает многоуровневый анализ параметров текстов на русском языке, интегрируя при этом как индексы текстовой сложности, так и современные языковые модели, в частности, BERT. Оценка предлагаемого алгоритма обработки проводилась на корпусе русских текстов двух языковых уровней, используемых при обучении русскому языку как иностранному (A - базовый пользователь и B - независимый пользователь). Наши эксперименты показали, что (а) индексы сложности текстов различных уровней по Общеевропейской шкале, рассчитываемые при помощи ReaderBench, статистически значимы (по критерию Краскела-Уоллиса), при этом количество существительных на уровне предложения оказалось наилучшим предиктором сложности; б) a наша нейронная модель, сочетающая индексы сложности текста и контекстуализированные вложения, при перекрестной валидации достигла точности 92,36 % и превзошла базовый уровень BERT. ReaderBench может использоваться разработчиками учебных материалов для оценки и ранжирования текстов в зависимости от их сложности, а также более широкой аудиторией для оценки сложности восприятия текста в различных областях, включая юриспруденцию, естествознание или политику.

ReaderBench frameworktext complexity indiceslanguage modelneural architecturemultilevel text analysisassessing text difficulty

фреймворк ReaderBenchиндексы сложности текстаязыковая модельнейронная архитектурамногоуровневый анализ текстаоценка сложности текста

This research was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS – UEFISCDI, project number TE 70 PN-III-P1-1.1-TE-2019-2209, ATES – “Automated Text Evaluation and Simplification”.

Abadi, Martin. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) Savannah, GA, USA: {USENIX} Association. 265-283.

Akhtiamov, Raouf B. 2019. Dictionary of abstract and concrete words of the Russian language: A methodology for creation and application. Journal of Research in Applied Linguistics. Saint Petersburg, Russia: Springer. 218-230.

Bansal, S. 2014. Textstat. Retrieved September 1st, 2021. URL: https://github.com/shivam5992/textstat (accessed 26.05.2022).

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3(4-5). 993-1022.

BNC Consortium. 2007. British national corpus. Oxford Text Archive Core Collection.

Boguslavsky, Igor, Leonid Iomdin & Victor Sizov. 2004. Multilinguality in ETAP-3: Reuse of lexical resources. In Proceedings of the Workshop on Multilingual Linguistic Resources. Geneva, Switzerland: COLING. 1-8.

Brysbaert, Marc, Boris New & Emmanuel Keuleers. 2012. Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods 44(4). 991-997.

Brysbaert, Marc, Amy Beth Warriner & Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods 46(3). 904-911.

Choi, Joon Suh & Scott A. Crossley. 2020. ARTE: Automatic Readability Tool for English. NLP Tools for the Social Sciences. linguisticanalysistools.org. Retrieved September 1st, 2021. URL: https://www.linguisticanalysistools.org/arte.html (accessed 26.05.2022).

10.

Churunina, Anna A., Ehl'zara Gizzatullina-Gafiyatova, Artem Zaikin & Marina I. Solnyshkina. 2020. Lexical Features of Text Complexity: The case of Russian academic texts. In SHS Web of Conferences. Nizhny Novgorod, Russia: EDP Sciences.

11.

Coltheart, Max. 1981. The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A 33(4). 497-505.

12.

Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer & Hervé Jégou. 2018. Word translation without parallel data. In 6th International Conference on Learning Representations. Vancouver, BC, Canada: OpenReview.net.

13.

Crossley, Scott A., Franklin Bradfield & Analynn Bustamante. 2019. Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research 11(2). 251-270.

14.

Crossley, Scott A., Kristopher Kyle, Jodi Davenport & Danielle S. McNamara. 2016. Automatic assessment of constructed response data in a Chemistry Tutor. In International Conference on Educational Data Ining. Raleigh, North Carolina, USA: International Educational Data Mining Society. 336-340.

15.

Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability: Instructions. Educational Research Bulletin 27(1). 37-54.

16.

Dascalu, Mihai. 2014. Analyzing Discourse and Text Complexity for Learning and Collaborating, Studies in Computational Intelligence (534). Switzerland: Springer.

17.

Dascalu, Mihai, Philippe Dessus, Stefan Trausan-Matu & Maryse Bianco. 2013. ReaderBench, an environment for analyzing text complexity and reading strategies. In H. Chad Lane, Kalina Yacef, Jack Mostow & Philip Pavlik (eds.), 16th Int. Conf. on Artificial Intelligence in Education (AIED 2013), 379-388. Memphis, TN, USA: Springer.

18.

Dascalu, Mihai, Danielle S. McNamara, Stefan Trausan-Matu & Laura K. Allen. 2018. Cohesion Network Analysis of CSCL Participation. Behavior Research Methods 50(2). 604-619. https://doi.org/10.3758/s13428-017-0888-4

19.

Dascalu, Mihai, Lucia Larise Stavarache, Stefan Trausan-Matu & Philippe Dessus. 2014. Reflecting comprehension through French textual complexity factors. In 26th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2014). 615-619. Limassol, Cyprus: IEEE.

20.

Dascalu, Mihai, Wim Westera, Stefan Ruseti, Stefan Trausan-Matu & Hub J. Kurvers. 2017. ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch. In Anne E. Baker, Xiangen Hu, Ma. Mercedes T. Rodrigo, Benedict du Boulay, Ryan Baker (eds.), 18th Int. Conf. on Artificial Intelligence in Education (AIED 2017), 52-63. Wuhan, China: Springer.

21.

Davies, Mark. 2010. The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing 25(4). 447-464.

22.

Delvin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, USA: Association for Computational Linguistics. 4171-4186.

23.

Flesch, Rudolf F. 1949. Art of Readable Writing.

24.

Gabitov, Azat, Marina Solnyshkina, Liliya Shayakhmetova, Liliya Ilyasova & Saida Adobarova. 2017. Text complexity in Russian textbooks on social studies. Revista Publicando 4(13 (2)). 597-606.

25.

Gifu, Daniela, Mihai Dascalu, Stefan Trausan-Matu & Laura K. Allen. 2016. Time evolution of writing styles in Romanian language. In 28th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2016). San Jose, CA: IEEE. 1048-1054.

26.

Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse & Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2). 193-202.

27.

Guryanov, Igor, Iskander Yarmakeev, Aleksandr Kiselnikov & Iena Harkova. 2017. Text complexity: Periods of study in Russian linguistics. Revista Publicando 4(13 (2)). 616-625.

28.

Gutu-Robu, Gabriel, Maria-Dorinela Sirbu, Ionut S Cristian Paraschiv, Mihai Dascălu, Philippe Dessus & Stefan Trausan-Matu. 2018. Liftoff - ReaderBench introduces new online functionalities. Romanian Journal of Human - Computer Interaction 11(1). 76-91.

29.

Honnibal, Montani & I. Montani. 2017. Spacy 2: Natural language understanding with bloom embeddings. Convolutional Neural Networks and Incremental Parsing 7(1).

30.

Hopkins, Kenneth D. & Douglas L. Weeks. 1990. Tests for normality and measures of skewness and kurtosis: Their place in research reporting. Educational and Psychological Measurement 50(4). 717-729.

31.

Kincaid, J. Peter, Robert P. Fishburne Jr., Richard L. Rogers & Brad S. Chissom. 1975. Derivation of New Readability Formulas: (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Air Station Memphis: Chief of Naval Technical Training.

32.

Kozea. 2016. Pyphen. Retrieved September 1st, 2021. URL: https://pyphen.org/ (accessed 20.05.2022).

33.

Kruskal, William H. & Allen W. Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47(260). 583-621.

34.

Kuperman, Victor, Hans Stadthagen-Gonzalez & Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44(4). 978-990.

35.

Kuratov, Yuri & Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213.

36.

Kyle, Kristopher. 2016. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication.

37.

Kyle, Kristopher, Scott A. Crossley & Cynthia Berger. 2018. The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods 50(3). 1030-1046.

38.

Kyle, Kristopher, Scott A. Crossley & Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18(2). 154-170.

39.

Kyle, Kristopher, Scott A. Crossley & Youjin J. Kim. 2015. Native language identification and writing proficiency. International Journal of Learner Corpus Research 1(2). 187-209.

40.

Landauer, Thomas K., Peter W. Foltz & Darrell Laham. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25(2/3). 259-284.

41.

LanguageTool. 2021. Language Tool. Retrieved September 1st, 2021. URL: https://languagetool.org/ (accessed 20.05.2022).

42.

Loukachevitch, Natalia V., G. Lashevich, Anastasia A. Gerasimova, Vyacheslav V. Ivanov. Boris V. Dobrov. 2016. Creating Russian wordnet by conversion. In Computational Linguistics and Intellectual Technologies: Annual conference Dialogue 2016. Moscow, Russia. 405-415.

43.

Mc Laughlin, G. H. 1969. SMOG grading-a new readability formula. Journal of Reading 12(8). 639-646.

44.

Mccarthy, Kathryn, Danielle Siobhan, Marina I. Solnyshkina, Fanuza Kh. Tarasove & Roman V. Kupriyanov. 2019. The Russian language test: Towards assessing text comprehension. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2: Yazykoznanie 18(4). 231-247.

45.

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representation in Vector Space. In Workshop at ICLR. Scottsdale, AZ.

46.

Myint. 2014. language-check. Retrieved September 1st, 2021. URL: https://github.com/myint/language-check (accessed 23.05.2022).

47.

Pearson, Karl. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58. 240-242.

48.

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12. 2825-2830.

49.

Quispesaravia, Andre, Walter Perez, Marco Sobrevilla Cabezudo & Fernando Alva-Manchego. 2016. Coh-Metrix-Esp: A complexity analysis tool for documents written in Spanish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 4694-4698.

50.

Rehurek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA. 45-50.

51.

Roscoe, Rod, Laura K. Allen, Jennifer L. Weston & Scott A. Crossley. 2014. The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition 34. 39-59.

52.

Sadoski, Mark, Ernest T. Goetz & Maximo Rodriguez. 2000. Engaging texts: Effects of concreteness on comprehensibility, interest, and recall in four text types. Journal of Educational Psychology 92(1). 85.

53.

Sakhovskiy, Andrey, Valery D. Solovyev & Marina Solnyshkina. 2020. Topic modeling for assessment of text complexity in Russian textbooks. In 2020 Ivannikov Ispras Open Conference (ISPRAS). Moscow, Russia: IEEE. 102-108.

54.

Schmid, Helmut, Marco Baroni, Erika Zanchetta & Achim Stein. 2007. Il sistema ‘tree-tagger arricchito’-The enriched TreeTagger system. IA Contributi Scientifici 4(2). 22-23.

55.

Senter, R.J. & E.A. Smith. 1967. Automated readability index: CINCINNATI UNIV OH.

56.

Shannon, Claude E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27(3). 379-423.

57.

Shapiro, S.S. & M.B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3/4). 591-611.

58.

Sharoff, Serge, Elena Umanskaya & James Wilson. 2014. A Frequency Dictionary of Russian: Core Vocabulary for Learners. Routledge.

59.

Solnyshkina, Marina I., Valery Solovyev, Vladimir Ivanov & Andrey Danilov. 2018. Studying text complexity in Russian academic corpus with Multi-Level Annotation. CEUR WORKSHOP PROCEEDINGS. Proceedings of Computational Models in Language and Speech Workshop, co-located with the 15th TEL International Conference on Computational and Cognitive Linguistics, TEL 2018.

60.

Solovyev, Valery, Marina Solnyshkina, Mariia Andreeva, Andrey Danilov & Radif Zamaletdinov. 2020. Text complexity and abstractness: Tools for the Russian language. In International Conference "Internet and Modern Society" (IMS-2020). St. Petersburg, Russia: CEUR Proceedings. 75-87.

61.

Solovyev, Valery, Marina I. Solnyshkina & Vladimir Ivanov. 2018. Complexity of Russian academic texts as the function of syntactic parameters. In 19th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing. Hanoi, Vietnam: Springer Lecture Notes in Computer Science.

62.

Spearman, Carl. 1987. The proof and measurement of association between two things. The American Journal of Psychology 100(3/4). 441-471.

63.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Long Beach, CA, USA: Curran Associates, Inc. 5998-6008.

64.

Vorontsov, Konstantin & Anna Potapenko. 2015. Additive regularization of topic models. Machine Learning 101(1) 303-323.