Text complexity and linguistic features: Their correlation in English and Russian

封面

如何引用文章

详细

Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. The complexity level of the text should correspond to the reader’s competence. A too complicated text could be incomprehensible, whereas a too simple one could be boring. For many years, simple features were used to assess readability, e.g. average length of words and sentences or vocabulary variety. Thanks to the development of natural language processing methods, the set of text parameters used for evaluating readability has expanded significantly. In recent years, many articles have been published the authors of which investigated the contribution of various lexical, morphological, and syntactic features to the readability level. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. The purpose of this study is to conduct a large-scale comparison of features of different nature. We experimentally assessed seven commonly used feature types (readability, traditional features, morphological features, punctuation, syntax frequency, and topic modeling) on six corpora for text complexity assessment in English and Russian employing four common machine learning models: logistic regression, random forest, convolutional neural network and feedforward neural network. One of the corpora, the corpus of fiction literature read by Russian school students, was constructed for the experiment using a large-scale survey to ensure the objectivity of the labeling. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and data source.

作者简介

Dmitry Morozov

Novosibirsk State University

Email: morozowdm@gmail.com
ORCID iD: 0000-0003-4464-1355

Junior Researcher at the Laboratory of Applied Digital Technologies, International Mathematical Center

1 Pirogova, Novosibirsk, 630090, Russia

Anna Glazkova

University of Tyumen

Email: a.v.glazkova@utmn.ru
ORCID iD: 0000-0001-8409-6457

Doctor of Sc. (Technology), Associate Professor of the Department of Software at the Institute of Mathematics and Computer Science

6 Volodarsky, Tyumen, 625003, Russia

Boris Iomdin

Vinogradov Russian Language Institute

编辑信件的主要联系方式.
Email: iomdin@ruslang.ru
ORCID iD: 0000-0002-1767-5480

holds a Ph.D. in Philology and is a Leading Researcher

18/2 Volkhonka, Moscow, 119019, Russia

参考

  1. Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3. 993-1022. https://doi.org/10.1016/B978-0-12-411519-4.00006-9
  2. Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva & Marat Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. In Proceedings of ACL 2018, System Demonstrations. 122-127. https://doi.org/10.18653/v1/P18-4021
  3. Cantos, Pascual & Ángela Almela. 2019. Readability indices for the assessment of textbooks: A feasibility study in the context of EFL. Vigo International Journal of Applied Linguistics 16. 31-52. https://doi.org/10.35869/VIAL.V0I16.92
  4. Chollet, Francois. 2015. Keras. Github. https://github.com/fchollet/keras (accessed 31.01.2022).
  5. Coleman, Meri & Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology 60(2). 283.
  6. Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability: Instructions. Educational Research Bulletin 27. 11-20, 37-54.
  7. Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171-4186. Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
  8. Deutsch, Tovly, Masoud Jasbi & Stuart Shieber. 2020. Linguistic Features for Readability Assessment. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics. 1-17. https://doi.org/10.18653/v1/2020.bea-1.1
  9. Feng, Lijun, Martin Jansche, Matt Huenerfauth & Noémie Elhadad. 2010. A comparison of features for automatic readability assessment. In Coling 2010: Posters. 276-284.
  10. Glazkova, Anna, Yury Egorov & Maksim Glazkov. 2021. A comparative study of feature types for age-based text classification. Analysis of Images, Social Networks and Texts. 120-134. Cham. Springer International Publishing. https://doi.org/10.1007/978-3-030-72610-2_9
  11. Honnibal, Matthew & Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  12. Iomdin, Boris L. & Dmitry A. Morozov. 2021. Who Can Understand “Dunno”? Automatic Assessment of Text Complexity in Children’s Literature. Russian Speech 5. 55-68. https://doi.org/10.31857/S013161170017239-1
  13. Isaeva, Ulyana & Alexey Sorokin. 2020. Investigating the robustness of reading difficulty models for Russian educational texts. In AIST 2020: Recent Trends in Analysis of Images, Social Networks and Texts. 65-77. https://doi.org/10.1007/978-3-030-71214-3_6
  14. Ivanov, Vladimir, Marina Solnyshkina & Valery Solovyev. 2018. Efficiency of text readability features in Russian academic texts, In Komp'yuternaya Lingvistika I Intellektual'nye Tehnologii. 284-293.
  15. Kincaid, J. Peter, Robert P. Fishburne Jr., Richard L. Rogers & Brad S. Chissom. 1975. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Naval Technical Training Command Millington TN Research Branch. https://doi.org/10.21236/ada006655
  16. Kingma, Diederik P. & Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR.
  17. Korobov, Mikhail. 2015. Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts. 320-332. Springer. https://doi.org/10.1007/978-3-319-26123-2_31
  18. Kuratov, Yuri & Mikhail Arkhipov. 2019. Adaptation of deep bidirectional multilingual transformers for Russian language. Komp’uuternaya Lingvistika i Intellektual’nye Tehnologii. 333-339.
  19. Kutuzov, Andrey & Elizaveta Kuzmenko. 2016. Web-vectors: A toolkit for building web interfaces for vector semantic models. In International Conference on Analysis of Images, Social Networks and Texts. 155-161. Springer. https://doi.org/10.1007/978-3-319-52920-2 15
  20. Leech, Geoffrey, Paul Rayson & Andrew Wilson. 2001. Word Frequencies in Written and Spoken English: Based on the British National Corpus. Routledge.
  21. Loper, Edward & Steven Bird. 2002. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. 63-70.
  22. Loshchilov, Ilya & Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  23. Lyashevskaya, Olga & Serge Sharoff. 2009. The Frequency Dictionary of the Modern Russian Language (Based on the Materials of the Russian National Corpus). Moscow: Azbukovnik.
  24. Martinc, Matej, Senja Pollak & Marko Robnik-Sikonja. 2021. Supervised and unsupervised neural approaches to text readability. Computational Linguistics 47. 1-39. https://doi.org/10.1162/coli_a_00398
  25. McLaughlin, G. Harry. 1969. Smog grading - a new readability formula. Journal of reading 12(8). 639-646.
  26. Mikolov, Tomas, Edouard Grave, Piotr Bojanowski, Christian Puhrsch & Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  27. Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12. 2825-2830.
  28. Rehurek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks.
  29. Reimers, Nils & Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 3982-3992. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410
  30. Reimers, Nils & Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 4512-4525. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365
  31. Senter, R. J. & E. A. Smith. 1967. Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories. 1-14.
  32. Solnyshkina, Marina, Vladimir Ivanov & Valery Solovyev. 2018. Readability formula for Russian texts: A modified version. In Mexican International Conference on Artificial Intelligence. 132-145. Springer. https://doi.org/10.1007/978-3-030-04497-8_11
  33. Templin, Mildred C. 1957. Certain Language Skills in Children; Their Development and Interrelationships. Minneapolis: University of Minnesota Press.
  34. Vajjala, Sowmya & Ivana Lucic. 2018. OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 297-304. Association for Computational Linguistics. https://doi.org/10.18653/v1/W18-0535
  35. Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest & Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38-45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  36. Xun, Guangxu, Vishrawas Gopalakrishnan, Fenglong Ma, Yaliang Li, Jing Gao & Aidong Zhang. 2016. Topic discovery for short texts using word embeddings. In 2016 IEEE 16th International Conference on Data Mining (ICDM). 1299-1304. IEEE.
  37. Yan, Xiaohui, Jiafeng Guo, Yanyan Lan & Xueqi Cheng. 2013. A biterm topic model for short texts. In Proceedings of the 22nd International Conference on World Wide Web. 1445-1456. https://doi.org/10.1145/2488388.2488514
  38. Chapter 699a. Readable language in insurance policies. URL: https://www.cga.ct.gov/current/pub/chap_699a.htm#sec_38a-29 (accessed 29.05.2022).
  39. Readability. 2021. URL: https://github.com/morozowdmitry/readability (accessed 29.05.2022).
  40. Readability 0.3.1. 2019. URL: https://pypi.org/project/readability/ (accessed 29.05.2022).

版权所有 © Morozov D., Glazkova A., Iomdin B., 2022

Creative Commons License
此作品已接受知识共享署名-非商业性使用 4.0国际许可协议的许可。

##common.cookie##