<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root>
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" article-type="research-article" dtd-version="1.2" xml:lang="en"><front><journal-meta><journal-id journal-id-type="publisher-id">Russian Journal of Linguistics</journal-id><journal-title-group><journal-title xml:lang="en">Russian Journal of Linguistics</journal-title><trans-title-group xml:lang="ru"><trans-title>Russian Journal of Linguistics</trans-title></trans-title-group></journal-title-group><issn publication-format="print">2687-0088</issn><issn publication-format="electronic">2686-8024</issn><publisher><publisher-name xml:lang="en">Peoples’ Friendship University of Russia named after Patrice Lumumba (RUDN University)</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="publisher-id">31329</article-id><article-id pub-id-type="doi">10.22363/2687-0088-30178</article-id><article-categories><subj-group subj-group-type="toc-heading" xml:lang="en"><subject>Articles</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="ru"><subject>Статьи</subject></subj-group><subj-group subj-group-type="toc-heading" xml:lang="zh"><subject>Articles</subject></subj-group><subj-group subj-group-type="article-type"><subject>Research Article</subject></subj-group></article-categories><title-group><article-title xml:lang="en">What neural networks know about linguistic complexity</article-title><trans-title-group xml:lang="ru"><trans-title>Что нейронные сети знают о лингвистической сложности</trans-title></trans-title-group></title-group><contrib-group><contrib contrib-type="author"><contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-4877-0210</contrib-id><name-alternatives><name xml:lang="en"><surname>Sharoff</surname><given-names>Serge Aleksandrovich</given-names></name><name xml:lang="ru"><surname>Шаров</surname><given-names>Сергей Александрович</given-names></name></name-alternatives><bio xml:lang="en"><p>Researcher at the Centre for Translation Studies</p></bio><bio xml:lang="ru"><p>научный сотрудник Центра переводоведения</p></bio><email>s.sharoff@leeds.ac.uk</email><xref ref-type="aff" rid="aff1"/></contrib></contrib-group><aff-alternatives id="aff1"><aff><institution xml:lang="en">University of Leeds</institution></aff><aff><institution xml:lang="ru">Университет Лидса</institution></aff></aff-alternatives><pub-date date-type="pub" iso-8601-date="2022-06-29" publication-format="electronic"><day>29</day><month>06</month><year>2022</year></pub-date><volume>26</volume><issue>2</issue><issue-title xml:lang="en">Computational Linguistics and Discourse Complexology</issue-title><issue-title xml:lang="ru">Компьютерная лингвистика и дискурсивная комплексология</issue-title><fpage>371</fpage><lpage>390</lpage><history><date date-type="received" iso-8601-date="2022-06-29"><day>29</day><month>06</month><year>2022</year></date></history><permissions><copyright-statement xml:lang="en">Copyright ©; 2022, Sharoff S.A.</copyright-statement><copyright-statement xml:lang="ru">Copyright ©; 2022, Шаров С.А.</copyright-statement><copyright-statement xml:lang="zh">Copyright ©; 2022, Sharoff S.</copyright-statement><copyright-year>2022</copyright-year><copyright-holder xml:lang="en">Sharoff S.A.</copyright-holder><copyright-holder xml:lang="ru">Шаров С.А.</copyright-holder><copyright-holder xml:lang="zh">Sharoff S.</copyright-holder><ali:free_to_read xmlns:ali="http://www.niso.org/schemas/ali/1.0/"/><license><ali:license_ref xmlns:ali="http://www.niso.org/schemas/ali/1.0/">https://creativecommons.org/licenses/by-nc/4.0</ali:license_ref></license></permissions><self-uri xlink:href="https://journals.rudn.ru/linguistics/article/view/31329">https://journals.rudn.ru/linguistics/article/view/31329</self-uri><abstract xml:lang="en"><p style="text-align: justify;">Linguistic complexity is a complex phenomenon, as it manifests itself on different levels (complexity of texts to sentences to words to subword units), through different features (genres to syntax to semantics), and also via different tasks (language learning, translation training, specific needs of other kinds of audiences). Finally, the results of complexity analysis will differ for different languages, because of their typological properties, the cultural traditions associated with specific genres in these languages or just because of the properties of individual datasets used for analysis. This paper investigates these aspects of linguistic complexity through using artificial neural networks for predicting complexity and explaining the predictions. Neural networks optimise millions of parameters to produce empirically efficient prediction models while operating as a black box without determining which linguistic factors lead to a specific prediction. This paper shows how to link neural predictions of text difficulty to detectable properties of linguistic data, for example, to the frequency of conjunctions, discourse particles or subordinate clauses. The specific study concerns neural difficulty prediction models which have been trained to differentiate easier and more complex texts in different genres in English and Russian and have been probed for the linguistic properties which correlate with predictions. The study shows how the rate of nouns and the related complexity of noun phrases affect difficulty via statistical estimates of what the neural model predicts as easy and difficult texts. The study also analysed the interplay between difficulty and genres, as linguistic features often specialise for genres rather than for inherent difficulty, so that some associations between the features and difficulty are caused by differences in the relevant genres.</p></abstract><trans-abstract xml:lang="ru"><p style="text-align: justify;">Лингвистическая сложность - это комплексное явление, поскольку оно проявляется на разных уровнях (от сложности текстов до предложений, от слов до подсловных единиц), через разные особенности (от жанров до синтаксиса и семантики), а также через разные задачи (изучение языка, перевод, обучение, специфические потребности различных аудиторий). Наконец, результаты анализа сложности будут отличаться для разных языков из-за их типологических свойств, культурных традиций, связанных с конкретными жанрами в этих языках, или просто из-за свойств отдельных наборов данных, используемых для анализа. В данной статье эти аспекты лингвистической сложности исследуются с помощью искусственных нейронных сетей для прогнозирования сложности и объяснения данных прогнозов. Нейронные сети оптимизируют миллионы параметров для создания эмпирически эффективных моделей прогнозирования, работая как черный ящик, т.е. не определяя, какие лингвистические факторы приводят к конкретному решению. В статье показано, как связать нейронные прогнозы сложности текста с обнаруживаемыми свойствами лингвистических данных, например, с частотой союзов, дискурсивных частиц или придаточных предложений. Конкретное исследование касается нейронных моделей прогнозирования сложности, которые были обучены различать более простые и сложные тексты в разных жанрах на английском и русском языках, а также были исследованы на предмет лингвистических свойств, которые коррелируют с прогнозами. Представленное исследование показывает, что количество существительных и связанная с этим сложность именных групп влияют на сложность текста. Данная закономерность подтверждена статистически, а нейронная модель предсказывает сложность текста. В исследовании также проанализирована взаимосвязь сложности текста и жанра, поскольку лингвистические особенности часто связаны с жанром, а не с непосредственной сложностью текста, в связи с чем некоторые параметры взаимосвязи между функциями и сложностью детерминированы различиями в соответствующих жанрах.</p></trans-abstract><kwd-group xml:lang="en"><kwd>automatic text classification</kwd><kwd>deep learning</kwd><kwd>interpreting neural networks</kwd></kwd-group><kwd-group xml:lang="ru"><kwd>автоматическая классификация текста</kwd><kwd>глубокое обучение</kwd><kwd>интерпретация нейронных сетей</kwd></kwd-group><funding-group/></article-meta></front><body></body><back><ref-list><ref id="B1"><label>1.</label><mixed-citation>Baayen, Harald. 2008. Analyzing Linguistic Data. Cambridge University Press, Cambridge.</mixed-citation></ref><ref id="B2"><label>2.</label><mixed-citation>Balasubramanian, Sriram, Naman Jain, Gaurav Jindal, Abhijeet Awasthi &amp; Sunita Sarawagi. 2020. What’s in a name? Are BERT named entity representations just as good for any other name? Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, Online. 205-214.</mixed-citation></ref><ref id="B3"><label>3.</label><mixed-citation>Benko, Vladimír. 2016. Two years of Aranea: Increasing counts and tuning the pipeline. Proc LREC. Portorož, Slovenia.</mixed-citation></ref><ref id="B4"><label>4.</label><mixed-citation>Biber, Douglas. 1988. Variation Across Speech and Writing. Cambridge University Press.</mixed-citation></ref><ref id="B5"><label>5.</label><mixed-citation>Biber, Douglas. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.</mixed-citation></ref><ref id="B6"><label>6.</label><mixed-citation>Collins-Thompson, Kevyn. 2014. Computational assessment of text readability: A survey of current and future research. International Journal of Applied Linguistics 165(2). 97-135.</mixed-citation></ref><ref id="B7"><label>7.</label><mixed-citation>Collins-Thompson, Kevyn &amp; Jamie Callan. 2004. A language modeling approach to predicting reading difficulty. Proc. of HLT/NAACL. Boston. 193-200.</mixed-citation></ref><ref id="B8"><label>8.</label><mixed-citation>Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer &amp; Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.</mixed-citation></ref><ref id="B9"><label>9.</label><mixed-citation>Debnath, Alok &amp; Michael Roth. 2021. A computational analysis of vagueness in revisions of instructional texts. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Online. 30-35.</mixed-citation></ref><ref id="B10"><label>10.</label><mixed-citation>Devlin, Jacob, Ming-Wei Chang, Kenton Lee &amp; Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.</mixed-citation></ref><ref id="B11"><label>11.</label><mixed-citation>Doughty, Catherine, J. Michael &amp; H. Long. 2008. The Handbook of Second Language Acquisition 27. John Wiley &amp; Sons.</mixed-citation></ref><ref id="B12"><label>12.</label><mixed-citation>DuBay, William H. 2004. The Principles of Readability. Technical report, Impact Information.</mixed-citation></ref><ref id="B13"><label>13.</label><mixed-citation>Fytas, Panagiotis, Georgios Rizos &amp; Lucia Specia. 2021. What makes a scientific paper be accepted for publication? Proceedings of the First Workshop on Causal Inference and NLP. Association for Computational Linguistics, Punta Cana, Dominican Republic. 44-60.</mixed-citation></ref><ref id="B14"><label>14.</label><mixed-citation>Halliday, M.A.K. 1992. Language as system and language as instance: The corpus as a theoretical construct. In J. Svartvik (ed.), Directions in corpus linguistics: Proceedings of Nobel Symposium 82 Stockholm 65, 61-77. Walter de Gruyter.</mixed-citation></ref><ref id="B15"><label>15.</label><mixed-citation>Hosmer Jr, David W., Stanley Lemeshow &amp; Rodney X. Sturdivant. 2013. Applied Logistic Regression. John Wiley &amp; Sons.</mixed-citation></ref><ref id="B16"><label>16.</label><mixed-citation>Janizek, Joseph D., Pascal Sturmfels &amp; Su-In Lee. 2021. Explaining explanations: Axiomatic feature interactions for deep networks. Journal of Machine Learning Research 22(104). 1-54.</mixed-citation></ref><ref id="B17"><label>17.</label><mixed-citation>Juilland, Alphonse. 1964. Frequency Dictionary of Spanish Words. Mouton.</mixed-citation></ref><ref id="B18"><label>18.</label><mixed-citation>Käding, Friedrich Wilhelm (ed.). 1897. Häufigkeitswörterbuch der Deutschen Sprache. Selbstverlag.</mixed-citation></ref><ref id="B19"><label>19.</label><mixed-citation>Khallaf, Nouran &amp; Serge Sharoff. 2021. Automatic difficulty classification of Arabic sentences. Proceedings of the Sixth Arabic Natural Language Processing Workshop. Association for Computational Linguistics, Kyiv, Ukraine (Virtual). 105-114.</mixed-citation></ref><ref id="B20"><label>20.</label><mixed-citation>Kunilovskaya, Maria &amp; Ekaterina Lapshinova-Koltunski. 2019. Translationese features as indicators of quality in English-Russian human translation. Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019). Incoma Ltd., Shoumen, Bulgaria, Varna, Bulgaria. 47-56.</mixed-citation></ref><ref id="B21"><label>21.</label><mixed-citation>Laposhina, Antonina N., Tatyana Veselovskaya, Maria Lebedeva &amp; Olga Kupreshchenko. 2018. Automated text readability assessment for Russian second language learners. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue’’.</mixed-citation></ref><ref id="B22"><label>22.</label><mixed-citation>Lorge, Irving. 1944. Predicting readability. Teachers College Record.</mixed-citation></ref><ref id="B23"><label>23.</label><mixed-citation>Nadeem, Farah &amp; Mari Ostendorf. 2018. Estimating linguistic complexity for science texts. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, New Orleans, Louisiana. 45-55.</mixed-citation></ref><ref id="B24"><label>24.</label><mixed-citation>Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CEFR). Technical report, Council of Europe, Strasbourg.</mixed-citation></ref><ref id="B25"><label>25.</label><mixed-citation>Orlov, Jurij. 1983. Ein modell der häufigkeitsstruktur des vokabulars. In H. Guiter &amp; M. Arapov (eds.), Studies on Zipf’s law, 154-233.</mixed-citation></ref><ref id="B26"><label>26.</label><mixed-citation>Paun, Silviu, Bob Carpenter, Jon Chamberlain, Dirk Hovy, Udo Kruschwitz &amp; Massimo Poesio. 2018. Comparing Bayesian models of annotation. Transactions of the Association for Computational Linguistics 6. 571-585.</mixed-citation></ref><ref id="B27"><label>27.</label><mixed-citation>Pitler, Emily &amp; Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. Proc EMNLP. 186-195.</mixed-citation></ref><ref id="B28"><label>28.</label><mixed-citation>Rogers, Anna, Olga Kovaleva &amp; Anna Rumshisky. 2020. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8. 842-866.</mixed-citation></ref><ref id="B29"><label>29.</label><mixed-citation>Sharoff, Serge. 2021. Genre annotation for the web: Text-external and text-internal perspectives. Register Studies 3. 1-32.</mixed-citation></ref><ref id="B30"><label>30.</label><mixed-citation>Sharoff, Serge, Svitlana Kurella, &amp; Anthony Hartley. 2008. Seeking needles in the Web haystack: Finding texts suitable for language learners. Proc Teaching and Language Corpora Conference, TaLC 2008. Lisbon.</mixed-citation></ref><ref id="B31"><label>31.</label><mixed-citation>Shavrina, Tatiana &amp; Olga Shapovalova. 2017. To the methodology of corpus construction for machine learning: Taiga syntax tree corpus and parser. CORPORA, International Conference. Saint-Petersburg.</mixed-citation></ref><ref id="B32"><label>32.</label><mixed-citation>Sheehan, Kathleen M., Michael Flor &amp; Diane Napolitano. 2013. A two-stage approach for generating unbiased estimates of text complexity. Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility. Association for Computational Linguistics, Atlanta, Georgia. 49-58.</mixed-citation></ref><ref id="B33"><label>33.</label><mixed-citation>Solovyev, Valery, Marina Solnyshkina, Vladimir Ivanov &amp; Ildar Batyrshin. 2019. Prediction of reading difficulty in Russian academic texts. Journal of Intelligent &amp; Fuzzy System 36(5). 4553-4563.</mixed-citation></ref><ref id="B34"><label>34.</label><mixed-citation>Straka, Milan &amp; Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. Proc CoNLL 2017 Shared Task. Association for Computational Linguistics, Vancouver, Canada. 88-99.</mixed-citation></ref><ref id="B35"><label>35.</label><mixed-citation>Vajjala, Sowmya &amp; Detmar Meurers. 2012. On improving the accuracy of readability classification using insights from second language acquisition. Proceedings of the Seventh Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, Montréal, Canada. 163-173.</mixed-citation></ref><ref id="B36"><label>36.</label><mixed-citation>Vajjala, Sowmya &amp; Detmar Meurers. 2014. ‘Readability assessment for text simplification: From analysing documents to identifying sentential simplifications’. ITL-International Journal of Applied Linguistics 165(2). 194-222.</mixed-citation></ref><ref id="B37"><label>37.</label><mixed-citation>Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest &amp; Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.</mixed-citation></ref><ref id="B38"><label>38.</label><mixed-citation>Xia, Menglin, Ekaterina Kochmar &amp; Ted Briscoe. 2016. Text readability assessment for second language learners. Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, San Diego, CA. 12-22.</mixed-citation></ref><ref id="B39"><label>39.</label><mixed-citation>Yuan, Yu &amp; Serge Sharoff. 2020. Sentence level human translation quality estimation with attention-based neural networks. Proc LREC, Marseilles.</mixed-citation></ref><ref id="B40"><label>40.</label><mixed-citation>Zhai, Yuming, Gabriel Illouz &amp; Anne Vilnat. 2020. Detecting non-literal translations by fine-tuning cross-lingual pre-trained language models. Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online). 5944-5956.</mixed-citation></ref></ref-list></back></article>
