<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">Russian Journal of Linguistics</journal-id><journal-title-group><journal-title xml:lang="en">Russian Journal of Linguistics</journal-title><trans-title-group xml:lang="ru"><trans-title>Russian Journal of Linguistics</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2687-0088</issn><issn publication-format="electronic">2686-8024</issn><publisher><publisher-name xml:lang="en">Peoples’ Friendship University of Russia named after Patrice Lumumba (RUDN University)</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">31331</article-id><article-id pub-id-type="doi">10.22363/2687-0088-30118</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>Articles</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Статьи</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="zh"><subject>Articles</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">Collection and evaluation of lexical complexity data for Russian language using crowdsourcing</article-title><trans-title-group xml:lang="ru"><trans-title>Сбор и оценка лексической сложности данных для русского языка с помощью краудсорсинга</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-5509-9680</contrib-id><name-alternatives><name xml:lang="en"><surname>Abramov</surname><given-names>Aleksei V.</given-names></name><name xml:lang="ru"><surname>Абрамов</surname><given-names>Алексей Валерьевич</given-names></name></name-alternatives><bio xml:lang="en"><p>PhD student</p></bio><bio xml:lang="ru"><p>аспирант</p></bio><email>AlVAbramov@stud.kpfu.ru</email><xref ref-type="aff" rid="aff1"/></contrib><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-3289-8188</contrib-id><name-alternatives><name xml:lang="en"><surname>Ivanov</surname><given-names>Vladimir V.</given-names></name><name xml:lang="ru"><surname>Иванов</surname><given-names>Владимир Владимирович</given-names></name></name-alternatives><bio xml:lang="en"><p>Assistant Professor</p></bio><bio xml:lang="ru"><p>доцент</p></bio><email>v.ivanov@innopolis.ru</email><xref ref-type="aff" rid="aff2"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">Kazan Federal University</institution></aff><aff><institution xml:lang="ru">Казанский (Приволжский) федеральный университет</institution></aff></aff-alternatives><aff-alternatives id="aff2"><aff><institution xml:lang="en">Innopolis University</institution></aff><aff><institution xml:lang="ru">Университет Иннополис</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2022-06-29" publication-format="electronic"><day>29</day><month>06</month><year>2022</year></pub-date><volume>26</volume><issue>2</issue><issue-title xml:lang="en">Computational Linguistics and Discourse Complexology</issue-title><issue-title xml:lang="ru">Компьютерная лингвистика и дискурсивная комплексология</issue-title><fpage>409</fpage><lpage>425</lpage><history><date date-type="received" iso-8601-date="2022-06-29"><day>29</day><month>06</month><year>2022</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2022, Abramov A.V., Ivanov V.V.</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2022, Абрамов А.В., Иванов В.В.</copyright-statement><copyright-statement xml:lang="zh">Copyright ©; 2022, Abramov A., Ivanov V.</copyright-statement><copyright-year>2022</copyright-year><copyright-holder xml:lang="en">Abramov A.V., Ivanov V.V.</copyright-holder><copyright-holder xml:lang="ru">Абрамов А.В., Иванов В.В.</copyright-holder><copyright-holder xml:lang="zh">Abramov A., Ivanov V.</copyright-holder><ali:free_to_read xmlns:ali="http://www.niso.org/schemas/ali/1.0/"/><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://creativecommons.org/licenses/by-nc/4.0</ali:license_ref></license></permissions><self-uri xlink:href="https://journals.rudn.ru/linguistics/article/view/31331">https://journals.rudn.ru/linguistics/article/view/31331</self-uri><abstract xml:lang="en"><p style="text-align: justify;">Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.</p></abstract><trans-abstract xml:lang="ru"><p style="text-align: justify;">Оценка сложности слова с помощью бинарной или непрерывной метки является сложной задачей, изучение которой проводилось для различных доменов и естественных языков. Обычно данная задача обозначается как идентификация сложных слов или прогнозирование лексической сложности. Корректная оценка сложности слова может выступать важным этапом в алгоритмах лексического упрощения слов. Представленные в ранних работах методологии прогнозирования лексической сложности нередко предлагались с рядом ограничений: авторы использовали вручную созданные признаки, которые коррелируют со сложностью слов; проводили детальную генерацию признаков для описания целевых слов, таких как количество согласных, гиперонимов, метки именованных сущностей; тщательно выбирали целевую аудиторию для оценки. В более современных работах рассматривалось применение моделей, основанных на архитектуре Transformer, для извлечения признаков из контекста. Однако большинство представленных работ было посвящено алгоритмам оценки для английского языка, и лишь небольшая часть переносила их на другие языки, такие как немецкий, французский и испанский. В данной работе мы представляем набор данных для оценки лексической сложности слова, основанный на Синодальном переводе Библии и собранный с помощью краудсорсинговой платформы. Мы описываем методологию сбора и оценки данных с помощью шкалы Лайкерта с 5 градациями; приводим описательную статистику и сравниваем ее с аналогичной статистикой для английского языка. Мы оцениваем качество работы линейной регрессии как базового алгоритма на ряде признаков: вручную созданных; векторных представлениях слов fastText и ELMo, вычисленных на основе целевых слов. Результатом является корпус, содержащий 931 словоформу, которые встречались в 3364 различных контекстах.</p></trans-abstract><kwd-group xml:lang="en"><kwd>Lexical complexity</kwd><kwd>Russian language</kwd><kwd>annotation</kwd><kwd>corpora</kwd><kwd>Bible</kwd></kwd-group><kwd-group xml:lang="ru"><kwd>лексическая сложность</kwd><kwd>русский язык</kwd><kwd>разметка</kwd><kwd>корпус</kwd><kwd>Библия</kwd></kwd-group><funding-group><funding-statement xml:lang="en">This paper has been supported by the Russian Science Foundation, grant # 22-21-00334, https://rscf.ru/project/22-21-00334/.</funding-statement></funding-group></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><mixed-citation>Aprosio, Alessio P., Stefano Menini &amp; Sara Tonelli. 2020. Adaptive complex word identification through false friend detection. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 192-200. https://doi.org/10.1145/3340631.3394857</mixed-citation></ref><ref id="B2"><label>2.</label><mixed-citation>Aroyehun, Segun Taofeek, Jason Angel, Daniel Alejandro Pérez Alvarez &amp; Alexander Gelbukh. 2018. Complex word identification: Convolutional neural network vs. feature engineering. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications. 322-327. https://doi.org/10.18653/v1/W18-0538</mixed-citation></ref><ref id="B3"><label>3.</label><mixed-citation>Blei, David M., Andrew Y. Ng &amp; Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3. 993-1022.</mixed-citation></ref><ref id="B4"><label>4.</label><mixed-citation>Bojanowski, Piotr, Edouard Grave, Armand Joulin &amp; Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5. 135-146.</mixed-citation></ref><ref id="B5"><label>5.</label><mixed-citation>Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva &amp; Marat Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. Proceedings of ACL 2018, System Demonstrations. 122-127. https://doi.org/10.18653/v1/P18-4021</mixed-citation></ref><ref id="B6"><label>6.</label><mixed-citation>Christodouloupoulos, Christos &amp; Mark Steedman. 2015. A massively parallel corpus: The bible in 100 languages. Language Resources and Evaluation 2(49). 375-395. https://doi.org/10.1007/s10579-014-9287-y</mixed-citation></ref><ref id="B7"><label>7.</label><mixed-citation>Clark, Alexander, Chris Fox &amp; Shalom Lappin (eds.). 2013. The Handbook of Computational Linguistics and Natural Language Processing. John Wiley &amp; Sons.</mixed-citation></ref><ref id="B8"><label>8.</label><mixed-citation>Clark, Kevin, Minh-Thang Luong, Quoc V. Le &amp; Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations.</mixed-citation></ref><ref id="B9"><label>9.</label><mixed-citation>Conneau, Alexis, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer &amp; Hervé Jégou. 2017. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations.</mixed-citation></ref><ref id="B10"><label>10.</label><mixed-citation>Dale, Edgar &amp; Jeanne S. Chall. 1948. A formula for predicting readability. Educational Research Bulletin 27. 37-54.</mixed-citation></ref><ref id="B11"><label>11.</label><mixed-citation>Devlin, Jacob, Ming-Wei Chang, Kenton Lee &amp; Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (1). 4171-4186. https://doi.org/10.18653/v1/N19-1423</mixed-citation></ref><ref id="B12"><label>12.</label><mixed-citation>Devlin, Siobhan &amp; John Tait. 1998. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases. 161-173.</mixed-citation></ref><ref id="B13"><label>13.</label><mixed-citation>He, Pengcheng, Xiaodong Liu, Jianfeng Gao &amp; Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. In Proceedings of the International Conference on Learning Representations.</mixed-citation></ref><ref id="B14"><label>14.</label><mixed-citation>Kajiwara, Tomoyuki &amp; Mamoru Komachi. 2018. Complex word identification based on frequency in a learner corpus. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 195-199.</mixed-citation></ref><ref id="B15"><label>15.</label><mixed-citation>Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma &amp; Radu Soricut.2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.</mixed-citation></ref><ref id="B16"><label>16.</label><mixed-citation>Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer &amp; Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 (2019).</mixed-citation></ref><ref id="B17"><label>17.</label><mixed-citation>Lyashevskaya, Olga N. &amp; Sergey A. Sharoff. 2009. The Frequency Dictionary of Modern Russian Language. Moscow: Azbukovnik. (In Russ.)</mixed-citation></ref><ref id="B18"><label>18.</label><mixed-citation>Maddela, Mounica &amp; Wei Xu. 2018. A word-complexity lexicon and a neural readability ranking model for lexical simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3749-3760. https://doi.org/10.18653/v1/D18-1410</mixed-citation></ref><ref id="B19"><label>19.</label><mixed-citation>Malmasi, Shervin, Mark Dras &amp; Marcos Zampieri. 2016. LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 996-1000. https://doi.org/10.18653/v1/S16-1154</mixed-citation></ref><ref id="B20"><label>20.</label><mixed-citation>Manning, Christopher &amp; Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. MIT press.</mixed-citation></ref><ref id="B21"><label>21.</label><mixed-citation>Morozov, Dmitry, Anna Glazkova &amp; Boris Iomdin. 2022. Text Complexity and Linguistic Features: their correlation in English and Russian. Russian Journal of Linguistics 26 (2). 425-447.</mixed-citation></ref><ref id="B22"><label>22.</label><mixed-citation>Mosquera, Alejandro. 2021. Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 554-559. https://doi.org/10.18653/v1/2021.semeval-1.68</mixed-citation></ref><ref id="B23"><label>23.</label><mixed-citation>Nitin, Indurkhya &amp; Fred J. Damerau (eds.). 2010. Handbook of Natural Language Processing. 2nd edn. Boca Raton: CRC Press.</mixed-citation></ref><ref id="B24"><label>24.</label><mixed-citation>Paetzold, Gustavo &amp; Lucia Specia. 2016. SemEval 2016 Task 11: Complex Word Identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 560-569. https://doi.org/10.18653/v1/S16-1085</mixed-citation></ref><ref id="B25"><label>25.</label><mixed-citation>Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee &amp; Luke Zettlemoyer. 2018. Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1. 2227-2237. https://doi.org/10.18653/v1/N18-1202</mixed-citation></ref><ref id="B26"><label>26.</label><mixed-citation>Shardlow, Matthew, Michael Cooper &amp; Marcos Zampieri. 2020. CompLex - A New corpus for lexical complexity prediction from Likert Scale Data. Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI). 57-62.</mixed-citation></ref><ref id="B27"><label>27.</label><mixed-citation>Shardlow, Matthew, Richard Evans, Gustavo Henrique Paetzold &amp; Marcos Zampieri. 2021. Semeval-2021 task 1: Lexical complexity prediction. Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 1-16. https://doi.org/10.18653/v1/2021.semeval-1.1</mixed-citation></ref><ref id="B28"><label>28.</label><mixed-citation>Sharoff, Serge. 2022. What neural networks know about linguistic complexity? Russian Journal of Linguistics. 26(2). 370-389.</mixed-citation></ref><ref id="B29"><label>29.</label><mixed-citation>Solnyshkina, Marina, Mcnamara Danielle &amp; Zamaletdinov Radif. 2022. Natural language processing and discourse complexity studies. Russian Journal of Linguistics. 26(2). 317-341.</mixed-citation></ref><ref id="B30"><label>30.</label><mixed-citation>Solovyev, Valery, Marina Solnyshkina &amp; Mcnamara Danielle. 2022. Computational linguistics and Discourse complexology. Russian Journal of Linguistics. 26(2). 275-316.</mixed-citation></ref><ref id="B31"><label>31.</label><mixed-citation>Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser &amp; Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems. 5998-6008.</mixed-citation></ref><ref id="B32"><label>32.</label><mixed-citation>Yaseen, Tuqa Bani, Qusai Ismail, Sarah Al-Omari, Eslam Al-Sobh &amp; Malak Abdullah. 2021. JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). 661-666. https://doi.org/10.18653/v1/2021.semeval-1.85</mixed-citation></ref><ref id="B33"><label>33.</label><mixed-citation>Yimam, Seid Muhie, Sanja Stajner, Martin Riedl &amp; Chris Biemann. 2017. Multilingual and cross-lingual complex word identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing. 813-822. https://doi.org/10.26615/978-954-452-049-6_104</mixed-citation></ref><ref id="B34"><label>34.</label><mixed-citation>Yimam, Seid Muhie, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja Štajner, Anaïs Tack &amp; Marcos Zampieri. 2018. A report on the complex word identification shared Task 2018. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA). 66-78. https://doi.org/10.18653/v1/W18-0507</mixed-citation></ref><ref id="B35"><label>35.</label><mixed-citation>Zaharia, George-Eduard, Dumitru-Clementin Cercel &amp; Mihai Dascalu. 2020. Cross-lingual transfer learning for complex word identification. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI). 384-390. https://doi.org/10.1109/ICTAI50040.2020.00067</mixed-citation></ref><ref id="B36"><label>36.</label><mixed-citation>Zampieri, Marcos, Liling Tan &amp; Josef van Genabith. 2016. Macsaar at semeval-2016 task 11: Zipfian and character features for complex word identification. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). 1001-1005. https://doi.org/10.18653/v1/S16-1155</mixed-citation></ref><ref id="B37"><label>37.</label><mixed-citation>Zampieri, Marcos, Shervin Malmasi, Gustavo Paetzold &amp; Lucia Specia. 2017. Complex word identification: Challenges in Data Annotation and System Performance. Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017). 59-63.</mixed-citation></ref></ref-list></back></article>
