Russian Journal of Linguistics

2687-00882686-8024

Peoples’ Friendship University of Russia named after Patrice Lumumba (RUDN University)

31331

10.22363/2687-0088-30118

Articles

Статьи

Articles

Research Article

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Сбор и оценка лексической сложности данных для русского языка с помощью краудсорсинга

https://orcid.org/0000-0002-5509-9680

Abramov

Aleksei V.

Абрамов

Алексей Валерьевич

PhD student

аспирант

AlVAbramov@stud.kpfu.ru

https://orcid.org/0000-0003-3289-8188

Ivanov

Vladimir V.

Иванов

Владимир Владимирович

Assistant Professor

доцент

v.ivanov@innopolis.ru

Kazan Federal UniversityКазанский (Приволжский) федеральный университет

Innopolis UniversityУниверситет Иннополис

29062022

262

Computational Linguistics and Discourse Complexology

Компьютерная лингвистика и дискурсивная комплексология

40942529062022

2022

Abramov A.V., Ivanov V.V.

Абрамов А.В., Иванов В.В.

Abramov A., Ivanov V.

https://creativecommons.org/licenses/by-nc/4.0

https://journals.rudn.ru/linguistics/article/view/31331

Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.

Оценка сложности слова с помощью бинарной или непрерывной метки является сложной задачей, изучение которой проводилось для различных доменов и естественных языков. Обычно данная задача обозначается как идентификация сложных слов или прогнозирование лексической сложности. Корректная оценка сложности слова может выступать важным этапом в алгоритмах лексического упрощения слов. Представленные в ранних работах методологии прогнозирования лексической сложности нередко предлагались с рядом ограничений: авторы использовали вручную созданные признаки, которые коррелируют со сложностью слов; проводили детальную генерацию признаков для описания целевых слов, таких как количество согласных, гиперонимов, метки именованных сущностей; тщательно выбирали целевую аудиторию для оценки. В более современных работах рассматривалось применение моделей, основанных на архитектуре Transformer, для извлечения признаков из контекста. Однако большинство представленных работ было посвящено алгоритмам оценки для английского языка, и лишь небольшая часть переносила их на другие языки, такие как немецкий, французский и испанский. В данной работе мы представляем набор данных для оценки лексической сложности слова, основанный на Синодальном переводе Библии и собранный с помощью краудсорсинговой платформы. Мы описываем методологию сбора и оценки данных с помощью шкалы Лайкерта с 5 градациями; приводим описательную статистику и сравниваем ее с аналогичной статистикой для английского языка. Мы оцениваем качество работы линейной регрессии как базового алгоритма на ряде признаков: вручную созданных; векторных представлениях слов fastText и ELMo, вычисленных на основе целевых слов. Результатом является корпус, содержащий 931 словоформу, которые встречались в 3364 различных контекстах.

Lexical complexityRussian languageannotationcorporaBible

лексическая сложностьрусский языкразметкакорпусБиблия

This paper has been supported by the Russian Science Foundation, grant # 22-21-00334, https://rscf.ru/project/22-21-00334/.

Aprosio, Alessio P., Stefano Menini & Sara Tonelli. 2020. Adaptive complex word identification through false friend detection. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 192-200. https://doi.org/10.1145/3340631.3394857

Aroyehun, Segun Taofeek, Jason Angel, Daniel Alejandro Pérez Alvarez & Alexander Gelbukh. 2018. Complex word identification: Convolutional neural network vs. feature engineering. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications. 322-327. https://doi.org/10.18653/v1/W18-0538

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3. 993-1022.

Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5. 135-146.

Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva & Marat Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. Proceedings of ACL 2018, System Demonstrations. 122-127. https://doi.org/10.18653/v1/P18-4021

Christodouloupoulos, Christos & Mark Steedman. 2015. A massively parallel corpus: The bible in 100 languages. Language Resources and Evaluation 2(49). 375-395. https://doi.org/10.1007/s10579-014-9287-y

Clark, Alexander, Chris Fox & Shalom Lappin (eds.). 2013. The Handbook of Computational Linguistics and Natural Language Processing. John Wiley & Sons.

Clark, Kevin, Minh-Thang Luong, Quoc V. Le & Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations.

Conneau, Alexis, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer & Hervé Jégou. 2017. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.

10.

Dale, Edgar & Jeanne S. Chall. 1948. A formula for predicting readability. Educational Research Bulletin 27. 37-54.

11.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (1). 4171-4186. https://doi.org/10.18653/v1/N19-1423

12.

Devlin, Siobhan & John Tait. 1998. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases. 161-173.

13.

He, Pengcheng, Xiaodong Liu, Jianfeng Gao & Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In Proceedings of the International Conference on Learning Representations.

14.

Kajiwara, Tomoyuki & Mamoru Komachi. 2018. Complex word identification based on frequency in a learner corpus. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 195-199.

15.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma & Radu Soricut.2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.

16.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer & Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).

17.

Lyashevskaya, Olga N. & Sergey A. Sharoff. 2009. The Frequency Dictionary of Modern Russian Language. Moscow: Azbukovnik. (In Russ.)

18.

Maddela, Mounica & Wei Xu. 2018. A word-complexity lexicon and a neural readability ranking model for lexical simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3749-3760. https://doi.org/10.18653/v1/D18-1410

19.

Malmasi, Shervin, Mark Dras & Marcos Zampieri. 2016. LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 996-1000. https://doi.org/10.18653/v1/S16-1154

20.

Manning, Christopher & Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT press.

21.

Morozov, Dmitry, Anna Glazkova & Boris Iomdin. 2022. Text Complexity and Linguistic Features: their correlation in English and Russian. Russian Journal of Linguistics 26 (2). 425-447.

22.

Mosquera, Alejandro. 2021. Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 554-559. https://doi.org/10.18653/v1/2021.semeval-1.68

23.

Nitin, Indurkhya & Fred J. Damerau (eds.). 2010. Handbook of Natural Language Processing. 2nd edn. Boca Raton: CRC Press.

24.

Paetzold, Gustavo & Lucia Specia. 2016. SemEval 2016 Task 11: Complex Word Identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 560-569. https://doi.org/10.18653/v1/S16-1085

25.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee & Luke Zettlemoyer. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1. 2227-2237. https://doi.org/10.18653/v1/N18-1202

26.

Shardlow, Matthew, Michael Cooper & Marcos Zampieri. 2020. CompLex - A New corpus for lexical complexity prediction from Likert Scale Data. Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI). 57-62.

27.

Shardlow, Matthew, Richard Evans, Gustavo Henrique Paetzold & Marcos Zampieri. 2021. Semeval-2021 task 1: Lexical complexity prediction. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 1-16. https://doi.org/10.18653/v1/2021.semeval-1.1

28.

Sharoff, Serge. 2022. What neural networks know about linguistic complexity? Russian Journal of Linguistics. 26(2). 370-389.

29.

Solnyshkina, Marina, Mcnamara Danielle & Zamaletdinov Radif. 2022. Natural language processing and discourse complexity studies. Russian Journal of Linguistics. 26(2). 317-341.

30.

Solovyev, Valery, Marina Solnyshkina & Mcnamara Danielle. 2022. Computational linguistics and Discourse complexology. Russian Journal of Linguistics. 26(2). 275-316.

31.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems. 5998-6008.

32.

Yaseen, Tuqa Bani, Qusai Ismail, Sarah Al-Omari, Eslam Al-Sobh & Malak Abdullah. 2021. JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 661-666. https://doi.org/10.18653/v1/2021.semeval-1.85

33.

Yimam, Seid Muhie, Sanja Stajner, Martin Riedl & Chris Biemann. 2017. Multilingual and cross-lingual complex word identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing. 813-822. https://doi.org/10.26615/978-954-452-049-6_104

34.

Yimam, Seid Muhie, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack & Marcos Zampieri. 2018. A report on the complex word identification shared Task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA). 66-78. https://doi.org/10.18653/v1/W18-0507

35.

Zaharia, George-Eduard, Dumitru-Clementin Cercel & Mihai Dascalu. 2020. Cross-lingual transfer learning for complex word identification. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). 384-390. https://doi.org/10.1109/ICTAI50040.2020.00067

36.

Zampieri, Marcos, Liling Tan & Josef van Genabith. 2016. Macsaar at semeval-2016 task 11: Zipfian and character features for complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 1001-1005. https://doi.org/10.18653/v1/S16-1155

37.

Zampieri, Marcos, Shervin Malmasi, Gustavo Paetzold & Lucia Specia. 2017. Complex word identification: Challenges in Data Annotation and System Performance. Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). 59-63.