ReaderBench: Multilevel analysis of Russian text characteristics
- 作者: Corlatescu D.1, Ruseti Ș.1, Dascalu M.1,2
-
隶属关系:
- University Politehnica of Bucharest
- Academy of Romanian Scientists
- 期: 卷 26, 编号 2 (2022): Computational Linguistics and Discourse Complexology
- 页面: 342-370
- 栏目: Articles
- URL: https://journals.rudn.ru/linguistics/article/view/31328
- DOI: https://doi.org/10.22363/2687-0088-30145
如何引用文章
全文:
详细
This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the “nmod” dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out cross-validation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.
作者简介
Dragos Corlatescu
University Politehnica of Bucharest
Email: dragos.corlatescu@upb.ro
ORCID iD: 0000-0002-7994-9950
Teaching Assistant and a PhD student
Splaiul Independentei 313, Bucharest, 060042, RomaniaȘtefan Ruseti
University Politehnica of Bucharest
Email: stefan.ruseti@upb.ro
ORCID iD: 0000-0002-0380-6814
Lecturer
Splaiul Independentei 313, Bucharest, 060042, RomaniaMihai Dascalu
University Politehnica of Bucharest; Academy of Romanian Scientists
编辑信件的主要联系方式.
Email: mihai.dascalu@upb.ro
ORCID iD: 0000-0002-4815-9227
Full Professor
Splaiul Independentei 313, Bucharest, 060042, Romania参考
- Abadi, Martin. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) Savannah, GA, USA: {USENIX} Association. 265-283.
- Akhtiamov, Raouf B. 2019. Dictionary of abstract and concrete words of the Russian language: A methodology for creation and application. Journal of Research in Applied Linguistics. Saint Petersburg, Russia: Springer. 218-230.
- Bansal, S. 2014. Textstat. Retrieved September 1st, 2021. URL: https://github.com/shivam5992/textstat (accessed 26.05.2022).
- Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3(4-5). 993-1022.
- BNC Consortium. 2007. British national corpus. Oxford Text Archive Core Collection.
- Boguslavsky, Igor, Leonid Iomdin & Victor Sizov. 2004. Multilinguality in ETAP-3: Reuse of lexical resources. In Proceedings of the Workshop on Multilingual Linguistic Resources. Geneva, Switzerland: COLING. 1-8.
- Brysbaert, Marc, Boris New & Emmanuel Keuleers. 2012. Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods 44(4). 991-997.
- Brysbaert, Marc, Amy Beth Warriner & Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods 46(3). 904-911.
- Choi, Joon Suh & Scott A. Crossley. 2020. ARTE: Automatic Readability Tool for English. NLP Tools for the Social Sciences. linguisticanalysistools.org. Retrieved September 1st, 2021. URL: https://www.linguisticanalysistools.org/arte.html (accessed 26.05.2022).
- Churunina, Anna A., Ehl'zara Gizzatullina-Gafiyatova, Artem Zaikin & Marina I. Solnyshkina. 2020. Lexical Features of Text Complexity: The case of Russian academic texts. In SHS Web of Conferences. Nizhny Novgorod, Russia: EDP Sciences.
- Coltheart, Max. 1981. The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A 33(4). 497-505.
- Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer & Hervé Jégou. 2018. Word translation without parallel data. In 6th International Conference on Learning Representations. Vancouver, BC, Canada: OpenReview.net.
- Crossley, Scott A., Franklin Bradfield & Analynn Bustamante. 2019. Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research 11(2). 251-270.
- Crossley, Scott A., Kristopher Kyle, Jodi Davenport & Danielle S. McNamara. 2016. Automatic assessment of constructed response data in a Chemistry Tutor. In International Conference on Educational Data Ining. Raleigh, North Carolina, USA: International Educational Data Mining Society. 336-340.
- Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability: Instructions. Educational Research Bulletin 27(1). 37-54.
- Dascalu, Mihai. 2014. Analyzing Discourse and Text Complexity for Learning and Collaborating, Studies in Computational Intelligence (534). Switzerland: Springer.
- Dascalu, Mihai, Philippe Dessus, Stefan Trausan-Matu & Maryse Bianco. 2013. ReaderBench, an environment for analyzing text complexity and reading strategies. In H. Chad Lane, Kalina Yacef, Jack Mostow & Philip Pavlik (eds.), 16th Int. Conf. on Artificial Intelligence in Education (AIED 2013), 379-388. Memphis, TN, USA: Springer.
- Dascalu, Mihai, Danielle S. McNamara, Stefan Trausan-Matu & Laura K. Allen. 2018. Cohesion Network Analysis of CSCL Participation. Behavior Research Methods 50(2). 604-619. https://doi.org/10.3758/s13428-017-0888-4
- Dascalu, Mihai, Lucia Larise Stavarache, Stefan Trausan-Matu & Philippe Dessus. 2014. Reflecting comprehension through French textual complexity factors. In 26th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2014). 615-619. Limassol, Cyprus: IEEE.
- Dascalu, Mihai, Wim Westera, Stefan Ruseti, Stefan Trausan-Matu & Hub J. Kurvers. 2017. ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch. In Anne E. Baker, Xiangen Hu, Ma. Mercedes T. Rodrigo, Benedict du Boulay, Ryan Baker (eds.), 18th Int. Conf. on Artificial Intelligence in Education (AIED 2017), 52-63. Wuhan, China: Springer.
- Davies, Mark. 2010. The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing 25(4). 447-464.
- Delvin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, USA: Association for Computational Linguistics. 4171-4186.
- Flesch, Rudolf F. 1949. Art of Readable Writing.
- Gabitov, Azat, Marina Solnyshkina, Liliya Shayakhmetova, Liliya Ilyasova & Saida Adobarova. 2017. Text complexity in Russian textbooks on social studies. Revista Publicando 4(13 (2)). 597-606.
- Gifu, Daniela, Mihai Dascalu, Stefan Trausan-Matu & Laura K. Allen. 2016. Time evolution of writing styles in Romanian language. In 28th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2016). San Jose, CA: IEEE. 1048-1054.
- Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse & Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2). 193-202.
- Guryanov, Igor, Iskander Yarmakeev, Aleksandr Kiselnikov & Iena Harkova. 2017. Text complexity: Periods of study in Russian linguistics. Revista Publicando 4(13 (2)). 616-625.
- Gutu-Robu, Gabriel, Maria-Dorinela Sirbu, Ionut S Cristian Paraschiv, Mihai Dascălu, Philippe Dessus & Stefan Trausan-Matu. 2018. Liftoff - ReaderBench introduces new online functionalities. Romanian Journal of Human - Computer Interaction 11(1). 76-91.
- Honnibal, Montani & I. Montani. 2017. Spacy 2: Natural language understanding with bloom embeddings. Convolutional Neural Networks and Incremental Parsing 7(1).
- Hopkins, Kenneth D. & Douglas L. Weeks. 1990. Tests for normality and measures of skewness and kurtosis: Their place in research reporting. Educational and Psychological Measurement 50(4). 717-729.
- Kincaid, J. Peter, Robert P. Fishburne Jr., Richard L. Rogers & Brad S. Chissom. 1975. Derivation of New Readability Formulas: (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Air Station Memphis: Chief of Naval Technical Training.
- Kozea. 2016. Pyphen. Retrieved September 1st, 2021. URL: https://pyphen.org/ (accessed 20.05.2022).
- Kruskal, William H. & Allen W. Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47(260). 583-621.
- Kuperman, Victor, Hans Stadthagen-Gonzalez & Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44(4). 978-990.
- Kuratov, Yuri & Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213.
- Kyle, Kristopher. 2016. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication.
- Kyle, Kristopher, Scott A. Crossley & Cynthia Berger. 2018. The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods 50(3). 1030-1046.
- Kyle, Kristopher, Scott A. Crossley & Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18(2). 154-170.
- Kyle, Kristopher, Scott A. Crossley & Youjin J. Kim. 2015. Native language identification and writing proficiency. International Journal of Learner Corpus Research 1(2). 187-209.
- Landauer, Thomas K., Peter W. Foltz & Darrell Laham. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25(2/3). 259-284.
- LanguageTool. 2021. Language Tool. Retrieved September 1st, 2021. URL: https://languagetool.org/ (accessed 20.05.2022).
- Loukachevitch, Natalia V., G. Lashevich, Anastasia A. Gerasimova, Vyacheslav V. Ivanov. Boris V. Dobrov. 2016. Creating Russian wordnet by conversion. In Computational Linguistics and Intellectual Technologies: Annual conference Dialogue 2016. Moscow, Russia. 405-415.
- Mc Laughlin, G. H. 1969. SMOG grading-a new readability formula. Journal of Reading 12(8). 639-646.
- Mccarthy, Kathryn, Danielle Siobhan, Marina I. Solnyshkina, Fanuza Kh. Tarasove & Roman V. Kupriyanov. 2019. The Russian language test: Towards assessing text comprehension. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2: Yazykoznanie 18(4). 231-247.
- Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representation in Vector Space. In Workshop at ICLR. Scottsdale, AZ.
- Myint. 2014. language-check. Retrieved September 1st, 2021. URL: https://github.com/myint/language-check (accessed 23.05.2022).
- Pearson, Karl. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58. 240-242.
- Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12. 2825-2830.
- Quispesaravia, Andre, Walter Perez, Marco Sobrevilla Cabezudo & Fernando Alva-Manchego. 2016. Coh-Metrix-Esp: A complexity analysis tool for documents written in Spanish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 4694-4698.
- Rehurek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA. 45-50.
- Roscoe, Rod, Laura K. Allen, Jennifer L. Weston & Scott A. Crossley. 2014. The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition 34. 39-59.
- Sadoski, Mark, Ernest T. Goetz & Maximo Rodriguez. 2000. Engaging texts: Effects of concreteness on comprehensibility, interest, and recall in four text types. Journal of Educational Psychology 92(1). 85.
- Sakhovskiy, Andrey, Valery D. Solovyev & Marina Solnyshkina. 2020. Topic modeling for assessment of text complexity in Russian textbooks. In 2020 Ivannikov Ispras Open Conference (ISPRAS). Moscow, Russia: IEEE. 102-108.
- Schmid, Helmut, Marco Baroni, Erika Zanchetta & Achim Stein. 2007. Il sistema ‘tree-tagger arricchito’-The enriched TreeTagger system. IA Contributi Scientifici 4(2). 22-23.
- Senter, R.J. & E.A. Smith. 1967. Automated readability index: CINCINNATI UNIV OH.
- Shannon, Claude E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27(3). 379-423.
- Shapiro, S.S. & M.B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3/4). 591-611.
- Sharoff, Serge, Elena Umanskaya & James Wilson. 2014. A Frequency Dictionary of Russian: Core Vocabulary for Learners. Routledge.
- Solnyshkina, Marina I., Valery Solovyev, Vladimir Ivanov & Andrey Danilov. 2018. Studying text complexity in Russian academic corpus with Multi-Level Annotation. CEUR WORKSHOP PROCEEDINGS. Proceedings of Computational Models in Language and Speech Workshop, co-located with the 15th TEL International Conference on Computational and Cognitive Linguistics, TEL 2018.
- Solovyev, Valery, Marina Solnyshkina, Mariia Andreeva, Andrey Danilov & Radif Zamaletdinov. 2020. Text complexity and abstractness: Tools for the Russian language. In International Conference "Internet and Modern Society" (IMS-2020). St. Petersburg, Russia: CEUR Proceedings. 75-87.
- Solovyev, Valery, Marina I. Solnyshkina & Vladimir Ivanov. 2018. Complexity of Russian academic texts as the function of syntactic parameters. In 19th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing. Hanoi, Vietnam: Springer Lecture Notes in Computer Science.
- Spearman, Carl. 1987. The proof and measurement of association between two things. The American Journal of Psychology 100(3/4). 441-471.
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Long Beach, CA, USA: Curran Associates, Inc. 5998-6008.
- Vorontsov, Konstantin & Anna Potapenko. 2015. Additive regularization of topic models. Machine Learning 101(1) 303-323.