ReaderBench: многоуровневый анализ характеристик текста на русском языке

Обложка

Цитировать

Полный текст

Аннотация

В статье представлена новая версия платформы ReaderBench с открытым исходным кодом. В настоящее время Readerbench поддерживает многоуровневый анализ параметров текстов на русском языке, интегрируя при этом как индексы текстовой сложности, так и современные языковые модели, в частности, BERT. Оценка предлагаемого алгоритма обработки проводилась на корпусе русских текстов двух языковых уровней, используемых при обучении русскому языку как иностранному (A - базовый пользователь и B - независимый пользователь). Наши эксперименты показали, что (а) индексы сложности текстов различных уровней по Общеевропейской шкале, рассчитываемые при помощи ReaderBench, статистически значимы (по критерию Краскела-Уоллиса), при этом количество существительных на уровне предложения оказалось наилучшим предиктором сложности; б) a наша нейронная модель, сочетающая индексы сложности текста и контекстуализированные вложения, при перекрестной валидации достигла точности 92,36 % и превзошла базовый уровень BERT. ReaderBench может использоваться разработчиками учебных материалов для оценки и ранжирования текстов в зависимости от их сложности, а также более широкой аудиторией для оценки сложности восприятия текста в различных областях, включая юриспруденцию, естествознание или политику.

Об авторах

Драгош Корлатеску

Политехнический университет Бухареста

Email: dragos.corlatescu@upb.ro
ORCID iD: 0000-0002-7994-9950

ассистент и аспирант

Splaiul Independentei 313, Bucharest 060042, Romania

Штефан Русети

Политехнический университет Бухареста

Email: stefan.ruseti@upb.ro
ORCID iD: 0000-0002-0380-6814

преподаватель

Splaiul Independentei 313, Bucharest 060042, Romania

Михай Даскалу

Политехнический университет Бухареста; Академия румынских ученых

Автор, ответственный за переписку.
Email: mihai.dascalu@upb.ro
ORCID iD: 0000-0002-4815-9227

профессор

Splaiul Independentei 313, Bucharest 060042, Romania

Список литературы

  1. Abadi, Martin. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) Savannah, GA, USA: {USENIX} Association. 265-283.
  2. Akhtiamov, Raouf B. 2019. Dictionary of abstract and concrete words of the Russian language: A methodology for creation and application. Journal of Research in Applied Linguistics. Saint Petersburg, Russia: Springer. 218-230.
  3. Bansal, S. 2014. Textstat. Retrieved September 1st, 2021. URL: https://github.com/shivam5992/textstat (accessed 26.05.2022).
  4. Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3(4-5). 993-1022.
  5. BNC Consortium. 2007. British national corpus. Oxford Text Archive Core Collection.
  6. Boguslavsky, Igor, Leonid Iomdin & Victor Sizov. 2004. Multilinguality in ETAP-3: Reuse of lexical resources. In Proceedings of the Workshop on Multilingual Linguistic Resources. Geneva, Switzerland: COLING. 1-8.
  7. Brysbaert, Marc, Boris New & Emmanuel Keuleers. 2012. Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods 44(4). 991-997.
  8. Brysbaert, Marc, Amy Beth Warriner & Victor Kuperman. 2014. Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods 46(3). 904-911.
  9. Choi, Joon Suh & Scott A. Crossley. 2020. ARTE: Automatic Readability Tool for English. NLP Tools for the Social Sciences. linguisticanalysistools.org. Retrieved September 1st, 2021. URL: https://www.linguisticanalysistools.org/arte.html (accessed 26.05.2022).
  10. Churunina, Anna A., Ehl'zara Gizzatullina-Gafiyatova, Artem Zaikin & Marina I. Solnyshkina. 2020. Lexical Features of Text Complexity: The case of Russian academic texts. In SHS Web of Conferences. Nizhny Novgorod, Russia: EDP Sciences.
  11. Coltheart, Max. 1981. The MRC psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A 33(4). 497-505.
  12. Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer & Hervé Jégou. 2018. Word translation without parallel data. In 6th International Conference on Learning Representations. Vancouver, BC, Canada: OpenReview.net.
  13. Crossley, Scott A., Franklin Bradfield & Analynn Bustamante. 2019. Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research 11(2). 251-270.
  14. Crossley, Scott A., Kristopher Kyle, Jodi Davenport & Danielle S. McNamara. 2016. Automatic assessment of constructed response data in a Chemistry Tutor. In International Conference on Educational Data Ining. Raleigh, North Carolina, USA: International Educational Data Mining Society. 336-340.
  15. Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability: Instructions. Educational Research Bulletin 27(1). 37-54.
  16. Dascalu, Mihai. 2014. Analyzing Discourse and Text Complexity for Learning and Collaborating, Studies in Computational Intelligence (534). Switzerland: Springer.
  17. Dascalu, Mihai, Philippe Dessus, Stefan Trausan-Matu & Maryse Bianco. 2013. ReaderBench, an environment for analyzing text complexity and reading strategies. In H. Chad Lane, Kalina Yacef, Jack Mostow & Philip Pavlik (eds.), 16th Int. Conf. on Artificial Intelligence in Education (AIED 2013), 379-388. Memphis, TN, USA: Springer.
  18. Dascalu, Mihai, Danielle S. McNamara, Stefan Trausan-Matu & Laura K. Allen. 2018. Cohesion Network Analysis of CSCL Participation. Behavior Research Methods 50(2). 604-619. https://doi.org/10.3758/s13428-017-0888-4
  19. Dascalu, Mihai, Lucia Larise Stavarache, Stefan Trausan-Matu & Philippe Dessus. 2014. Reflecting comprehension through French textual complexity factors. In 26th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2014). 615-619. Limassol, Cyprus: IEEE.
  20. Dascalu, Mihai, Wim Westera, Stefan Ruseti, Stefan Trausan-Matu & Hub J. Kurvers. 2017. ReaderBench learns Dutch: Building a comprehensive automated essay scoring system for Dutch. In Anne E. Baker, Xiangen Hu, Ma. Mercedes T. Rodrigo, Benedict du Boulay, Ryan Baker (eds.), 18th Int. Conf. on Artificial Intelligence in Education (AIED 2017), 52-63. Wuhan, China: Springer.
  21. Davies, Mark. 2010. The corpus of contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing 25(4). 447-464.
  22. Delvin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, USA: Association for Computational Linguistics. 4171-4186.
  23. Flesch, Rudolf F. 1949. Art of Readable Writing.
  24. Gabitov, Azat, Marina Solnyshkina, Liliya Shayakhmetova, Liliya Ilyasova & Saida Adobarova. 2017. Text complexity in Russian textbooks on social studies. Revista Publicando 4(13 (2)). 597-606.
  25. Gifu, Daniela, Mihai Dascalu, Stefan Trausan-Matu & Laura K. Allen. 2016. Time evolution of writing styles in Romanian language. In 28th Int. Conf. on Tools with Artificial Intelligence (ICTAI 2016). San Jose, CA: IEEE. 1048-1054.
  26. Graesser, Arthur C., Danielle S. McNamara, Max M. Louwerse & Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers 36(2). 193-202.
  27. Guryanov, Igor, Iskander Yarmakeev, Aleksandr Kiselnikov & Iena Harkova. 2017. Text complexity: Periods of study in Russian linguistics. Revista Publicando 4(13 (2)). 616-625.
  28. Gutu-Robu, Gabriel, Maria-Dorinela Sirbu, Ionut S Cristian Paraschiv, Mihai Dascălu, Philippe Dessus & Stefan Trausan-Matu. 2018. Liftoff - ReaderBench introduces new online functionalities. Romanian Journal of Human - Computer Interaction 11(1). 76-91.
  29. Honnibal, Montani & I. Montani. 2017. Spacy 2: Natural language understanding with bloom embeddings. Convolutional Neural Networks and Incremental Parsing 7(1).
  30. Hopkins, Kenneth D. & Douglas L. Weeks. 1990. Tests for normality and measures of skewness and kurtosis: Their place in research reporting. Educational and Psychological Measurement 50(4). 717-729.
  31. Kincaid, J. Peter, Robert P. Fishburne Jr., Richard L. Rogers & Brad S. Chissom. 1975. Derivation of New Readability Formulas: (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Air Station Memphis: Chief of Naval Technical Training.
  32. Kozea. 2016. Pyphen. Retrieved September 1st, 2021. URL: https://pyphen.org/ (accessed 20.05.2022).
  33. Kruskal, William H. & Allen W. Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47(260). 583-621.
  34. Kuperman, Victor, Hans Stadthagen-Gonzalez & Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44(4). 978-990.
  35. Kuratov, Yuri & Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213.
  36. Kyle, Kristopher. 2016. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication.
  37. Kyle, Kristopher, Scott A. Crossley & Cynthia Berger. 2018. The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods 50(3). 1030-1046.
  38. Kyle, Kristopher, Scott A. Crossley & Scott Jarvis. 2021. Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly 18(2). 154-170.
  39. Kyle, Kristopher, Scott A. Crossley & Youjin J. Kim. 2015. Native language identification and writing proficiency. International Journal of Learner Corpus Research 1(2). 187-209.
  40. Landauer, Thomas K., Peter W. Foltz & Darrell Laham. 1998. An introduction to Latent Semantic Analysis. Discourse Processes 25(2/3). 259-284.
  41. LanguageTool. 2021. Language Tool. Retrieved September 1st, 2021. URL: https://languagetool.org/ (accessed 20.05.2022).
  42. Loukachevitch, Natalia V., G. Lashevich, Anastasia A. Gerasimova, Vyacheslav V. Ivanov. Boris V. Dobrov. 2016. Creating Russian wordnet by conversion. In Computational Linguistics and Intellectual Technologies: Annual conference Dialogue 2016. Moscow, Russia. 405-415.
  43. Mc Laughlin, G. H. 1969. SMOG grading-a new readability formula. Journal of Reading 12(8). 639-646.
  44. Mccarthy, Kathryn, Danielle Siobhan, Marina I. Solnyshkina, Fanuza Kh. Tarasove & Roman V. Kupriyanov. 2019. The Russian language test: Towards assessing text comprehension. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2: Yazykoznanie 18(4). 231-247.
  45. Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representation in Vector Space. In Workshop at ICLR. Scottsdale, AZ.
  46. Myint. 2014. language-check. Retrieved September 1st, 2021. URL: https://github.com/myint/language-check (accessed 23.05.2022).
  47. Pearson, Karl. 1895. VII. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58. 240-242.
  48. Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12. 2825-2830.
  49. Quispesaravia, Andre, Walter Perez, Marco Sobrevilla Cabezudo & Fernando Alva-Manchego. 2016. Coh-Metrix-Esp: A complexity analysis tool for documents written in Spanish. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 4694-4698.
  50. Rehurek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA. 45-50.
  51. Roscoe, Rod, Laura K. Allen, Jennifer L. Weston & Scott A. Crossley. 2014. The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition 34. 39-59.
  52. Sadoski, Mark, Ernest T. Goetz & Maximo Rodriguez. 2000. Engaging texts: Effects of concreteness on comprehensibility, interest, and recall in four text types. Journal of Educational Psychology 92(1). 85.
  53. Sakhovskiy, Andrey, Valery D. Solovyev & Marina Solnyshkina. 2020. Topic modeling for assessment of text complexity in Russian textbooks. In 2020 Ivannikov Ispras Open Conference (ISPRAS). Moscow, Russia: IEEE. 102-108.
  54. Schmid, Helmut, Marco Baroni, Erika Zanchetta & Achim Stein. 2007. Il sistema ‘tree-tagger arricchito’-The enriched TreeTagger system. IA Contributi Scientifici 4(2). 22-23.
  55. Senter, R.J. & E.A. Smith. 1967. Automated readability index: CINCINNATI UNIV OH.
  56. Shannon, Claude E. 1948. A mathematical theory of communication. The Bell System Technical Journal 27(3). 379-423.
  57. Shapiro, S.S. & M.B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52(3/4). 591-611.
  58. Sharoff, Serge, Elena Umanskaya & James Wilson. 2014. A Frequency Dictionary of Russian: Core Vocabulary for Learners. Routledge.
  59. Solnyshkina, Marina I., Valery Solovyev, Vladimir Ivanov & Andrey Danilov. 2018. Studying text complexity in Russian academic corpus with Multi-Level Annotation. CEUR WORKSHOP PROCEEDINGS. Proceedings of Computational Models in Language and Speech Workshop, co-located with the 15th TEL International Conference on Computational and Cognitive Linguistics, TEL 2018.
  60. Solovyev, Valery, Marina Solnyshkina, Mariia Andreeva, Andrey Danilov & Radif Zamaletdinov. 2020. Text complexity and abstractness: Tools for the Russian language. In International Conference "Internet and Modern Society" (IMS-2020). St. Petersburg, Russia: CEUR Proceedings. 75-87.
  61. Solovyev, Valery, Marina I. Solnyshkina & Vladimir Ivanov. 2018. Complexity of Russian academic texts as the function of syntactic parameters. In 19th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing. Hanoi, Vietnam: Springer Lecture Notes in Computer Science.
  62. Spearman, Carl. 1987. The proof and measurement of association between two things. The American Journal of Psychology 100(3/4). 441-471.
  63. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Long Beach, CA, USA: Curran Associates, Inc. 5998-6008.
  64. Vorontsov, Konstantin & Anna Potapenko. 2015. Additive regularization of topic models. Machine Learning 101(1) 303-323.

© Корлатеску Д., Русети Ш., Даскалу М., 2022

Creative Commons License
Эта статья доступна по лицензии Creative Commons Attribution-NonCommercial 4.0 International License.

Данный сайт использует cookie-файлы

Продолжая использовать наш сайт, вы даете согласие на обработку файлов cookie, которые обеспечивают правильную работу сайта.

О куки-файлах