Computational linguistics and discourse complexology: Paradigms and research methods

USA  maki.solovyev@mail.ru Abstract The dramatic expansion of modern linguistic research and enhanced accuracy of linguistic analysis have become a reality due to the ability of artificial neural networks not only to learn and adapt, but also carry out automate linguistic analysis, select, modify and compare texts of various types and genres. The purpose of this article and the journal issue as a whole is to present modern areas of research in computational linguistics and linguistic complexology, as well as to define a solid rationale for the new interdisciplinary field, i.e. discourse complexology. The review of trends in computational linguistics focuses on the following aspects of research: applied problems and methods, computational linguistic resources, contribution of theoretical linguistics to computational linguistics, and the use of deep learning neural networks. The special issue also addresses the problem of objective and relative text complexity and its assessment. We focus on the two main approaches to linguistic complexity assessment: “parametric approach” and machine learning. The findings of the studies published in this special issue indicate a major contribution of computational linguistics to discourse complexology, including new algorithms developed to solve discourse complexology problems. The issue outlines the research areas of linguistic complexology and provides a framework to guide its further development including a design of a complexity matrix for texts of various types and genres, refining the list of complexity predictors, validating new complexity criteria, and expanding databases for the most significant ones. It requires non-trivial mathematical methods to evaluate informational hierarchy of text parts. The abstract, approach implies a generation of original sentences that summarize the content of the source text. In recent years the task of generating text abstracts was successfully fulfilled with neural networks. An important component of summarization systems are sentence parsing algorithms. A brief overview is provided in Allahyari (2017).


Introduction
The article addresses modern trends in computational linguistics, language and discourse complexity. It also provides a brief overview of the articles in the issue.
Computational linguistics (hereinafter CL), as the name implies, is an interdisciplinary science at the intersection of linguistics and computer sciences. It explores the problems of automatic processing of linguistic information. Another commonly used name for this discipline, that is synonymous with the term "computational linguistics", is Natural Language Processing (NLP). In a number of research works these concepts are separated, considering that CL is more of a theoretical discipline, and NLP is of a more applied nature. CL began to develop in the early 1950s, almost immediately after the advent of computers. Its first task was development of machine translation, and translation of journals from Russian into English in particular. The initial stage of CL development is comprehensively presented in J. Hutchins (1999). It surely was beyond the capacity of researchers to solve the problems of machine translation very quickly, and the initial optimism turned out to be groundless, although in recent years it has become possible to obtain translations of acceptable quality. However, within 70 years of development, CL has achieved significant success in solving many urgent practical problems, which made it one of the most dynamically developing and important research areas in both linguistics and computer science. In our opinion, the best monographs on CL are (Clark et al., Indurkhya & Damerau 2010). The latest review, including also an analysis of the prospects for its development, can be found in the article by Church and Liberman (2021).
In the review of computational linguistics trends, we focus on the following aspects of research: application-oriented tasks, methods, resources, contribution of theoretical linguistics to computer linguistics, and application of deep learning neural networks. The latter appeared about 10 years ago (Schmidhuber 2015) and revolutionized research of artificial intelligence, including many areas of CL. Artificial neural networks constitute a formal model of biological networks of neurons. Their most important feature is the ability to learn; in case of an error, the neural network is modified in a certain way. Although neural networks were proposed as early as 1943, a breakthrough in their use was made only a few years ago. It is associated with the three following factors: the emergence of new, more advanced 'self-learning', unsupervised training algorithms, improved performance of computers, and Internet database increase. Advances in NLP in the late 2018 were mainly related to BERT (Devlin et al. 2018), a neural network pre-trained on a corpus of texts. Currently, BERT and its enhanced models show better performance on many NLP problems (see Lauriola, Lavelli & Aiolli 2022).

Application-oriented tasks of Computational Linguistics
In addition to machine translation, the main application-oriented tasks of CL include document processing, computer analysis of social networks, speech analysis and synthesis (including voice assistants), question-answering systems, and recommender systems.
The largest task is document processing including a wide range of subtasks: search, summarization, classification, sentiment analysis, information extraction, etc.
Development of search engines, obviously, is the most well-known and widely used CL task, successfully implemented in Google and Yandex search engines. A detailed introduction to the issue of information retrieval can be found in (Manning et al. 2011). The main type of search queries is a set of keywords. The two main problems of search are as follows: the need to provide fast searches in the vast amount number of texts on the Internet and to ensure that any search takes into account not the query forms only but its semantics. The main idea of a quick search is to preprocess all documents on the Internet with the creation of a so-called search index that indicates location of the query in specific documents. A semantic document search, or a semantic search, is implemented in the well-known concept of Semantic Web (Domongue et al. 2011), based on the idea of ontologies (presented below). E.g., in response to a query "Beethoven ta ta ta tam" Google refers to the Wikipedia article about Beethoven's 5th symphony, although the text of the article does not contain the phrase "ta ta ta tam". Thus, the Google search engine "understands" that "ta ta ta tam" and the 5th symphony are semantically related. A successful search would be simply impossible without linguistic research, which led to the development of algorithms for morphological and syntactic analysis, thesauri and ontologies for the explication of semantic relationships between entities.
The term "information retrieval" is interpreted as a search for information of a certain type in the text, i.e. entities, their relationships, facts, etc. The best developed is the algorithm of extracting named entities (Name Entity Recognition, NER), i.e. persons, organizations, geographical objects, etc. A recent survey of IT professionals from various business areas 1 indicates that the NER task is the most demanded in business applications. Researchers apply various techniques to solve this problem: ready-made dictionaries of people's names and names of geographical objects; linguistic features (use of capital letters), defined patterns of noun phrases; and machine learning methods. An overview of this area can be found in Sharnagat (2014). NER systems based on dictionaries and rules correctly extract about 90% of entities in texts, while BERT-based systems already provide about 94% of correctly extracted entities (Wang 2020), which is comparable to the level of human accuracy and demonstrates benefits of deep learning neural networks.
The task of retrieval of events and facts is challenging. The classic approach here is to create event templates that capture types and roles of the entities participating in the events. For example, the event "June 24, 2021 Microsoft presented Windows 11" is described by the following template: Activity typesales presentation, Company -Microsoft, Product -Windows 11, Date -June 24, 2021. Templates of this type are created manually, which is labour-intensive. Efficiency of information extraction systems depends on their quality. Typically, such systems extract no more than 60% of facts (Jiang et al. 2016).
In recent years, many studies addressed the problem of text sentiment analysis (cf. Cambria 2017), i.e. identification of the so-called "tone" of texts: whether a text carries a positive or negative attitude towards the text referents. This area is important for companies to evaluate user comments on their products and services. The problem is also being solved with the help of developing specific patterns, dictionaries, and machine learning methods. The Russian dictionary RuSentiLex, (Loukachevitch & Levchik 2016), registers over 12,000 lemmas marked as positive, negative or neutral. The main problem of sentiment analysis of texts is its contextdependency as a word can be positive in certain contexts and negative in others. A possible way of addressing the problem is compiling sentiment lexicon dictionaries for specific subject areas.
Another fundamental problem is not only to assess the tone of the entire text, but define the referential aspect of the sentiment. It is especially important in applied research on customer reviews of products and services (Solovyev & Ivanov 2014). The achieved accuracy in the area, which is about 85%, was effected through BERT technology (Hoang et al. 2019).
Another important task of document processing is text summarization and text skimming (Miranda-Jiménez, Gelbukh & Sidorov 2013). Its practical importance is determined by the gigantic and increasing size of texts on the Internet. There are two approaches to solving this problem: extractive and abstract. The extractive approach -implies assessing the importance score of sentences in the text and selecting a small number of the most significant ones. It requires non-trivial mathematical methods to evaluate informational hierarchy of text parts. The abstract, approach implies a generation of original sentences that summarize the content of the source text. In recent years the task of generating text abstracts was successfully fulfilled with neural networks. An important component of summarization systems are sentence parsing algorithms. A brief overview is provided in Allahyari (2017).
Сomputer analysis of social networks and social media is another applicationoriented task. It can have multiple objectives with monitoring social attitudes, identifying manifestations of extremism and other illegal activities, and even analyzing the spread of epidemics. E.g. at the beginning of the coronavirus pandemic researchers suggested an analysis of social media content, including the spread of misinformation (cf. Cinelli, Quattrociocchi & Galeazzi 2020). Social network analysis implies defining the content of messages and connections between users, which enables identifying groups of users with common interests. At the same time, heterogeneity of content presents a significant challenge. In recent years, neural networks have become the main tool for social network analysis (cf. Ghani et al. 2019). Batrinca & Treleaven (2015) provide an overview of the research in the area and addresses mostly humanitarians.
Speech analysis and synthesis stand apart in CL, as they require specific software and hardware tools to work with acoustic signals. Speech recognition systems are very diverse and are classified according to many parameters: vocabulary size; speaker type (age, gender); type of speech; purpose; structural types and their selection principles (phrases, words, phonemes, diphones, allophones, etc.). The input speech flow is compared with acoustic and language models, including various features: spectral-temporal, cepstral features, amplitudefrequency, features of nonlinear dynamics. Speech recognition is challenging because words are pronounced differently by different people in different situations. Nevertheless, at the moment there are many commercial speech recognition systems, in particular those built into Windows. One of the best known is "Watson speech to text" developed by IBM (Cruz Valdez 2021).
Speech recognition is the heart of voice assistants becoming increasingly popular worldwide. A voice assistant commonly known in Russia is Alice 2 designed and developed by Yandex. Alice is integrated into the Yandex services: by a voice command it searches for information. E.g. it can find a weather forecast on Yandex.Weather, traffic data in Yandex.Maps, etc. Alice can control smart home systems and even entertain: play riddles with children, tell fairy tales and jokes. Speech recognition in voice assistants is facilitated by their ability to tune in to the voice of a certain person. State of the art review in voice assistants can be found in Nasirian, Ahmadian & Lee (2017), and one of the latest reviews of speech recognition problems is presented in Nassif (2019).
Speech synthesis is being actively used in information and reference systems, in airport, railway and office announcements. They are predominantly used in situations with a limited range of synthesized phrases. The simplest way to synthesize speech is sequencing pre-recorded elements. The quality of the synthesize speech is evaluated based on its similarity with human speech. Highquality speech synthesis systems are still a dream of many researchers and users. The latest overview of speech synthesis is presented in Tan (2021).
We also address recommender systems which are probably familiar to all Internet users. Recommender systems predict which objects (movies, music, books, news, websites) might be interesting to a particular user. For this, they collect information about users, sometimes explicitly, asking them to rate objects of interest, and more often implicitly, collecting information about users' behavior on the Internet. The following idea turned out to be productive: people who similarly estimated some objects in the past are most likely to give similar estimates to other objects in the future (Xiaoyuan & Khoshgoftaar 2009). This particular idea allows researchers to effectively extrapolate user behavior. Developing recommender systems depends mostly on linguistic resource. For example, an effective recommender system is based on synonyms dictionaries. Such systems are supposed to "understand" that "children's films" and "films for children" mean the same. For synonymy in recommender systems, see Moon (2019), a general review is presented in Patel & Patel (2020).
Question-answering systems, or QA-systems, are designed to provide answers in natural language, i.e. they have a natural language interface. They search for answers in a textual database that QA systems have. Like search engines, QA systems provide a user with the ability to search for information. However, an important distinguishing feature of QA systems is that they allow a user to find information that might be implicit, e.g., a film that a user might like but it could not be found with a regular search engine. Obviously, the quality of a QA system depends on its database size, i.e. whether it contains an answer to a question at all, as well as on the technologies for processing questions and comparing them with the database information. As for processing a question, it begins with identifying the type of question and the expected response. For example, the question "Who..." suggests that the answer is to contain the name of a person. QA systems apply numerous complex CL methods and, similar to recommender systems, face the issue of synonymy (Sigdel 2020). The latest review of QA systems is published by Ojokoh and Adebisi (2018).

Methods of Computational Linguistics
All CL methods can be divided into two large classes: a class based on dictionaries and rules (templates) and a class based on machine learning. These two classes are fundamentally different in their approaches. Dictionaries and rules use accumulated knowledge about the language, as well as results of highly professional manual labor, and therefore they are extremely expensive. Machine learning is implemented on a large number of examples, presented in annotated corpora which function as training sets. The algorithm implies analyzing training sets, identifying the existing patterns and then offering solutions to the problems set. Modern machine learning systems vary in their functions and applications, although deep learning neural networks have proved to be the most efficient. At an input node of a neural network, any language data is fed in encoded forms as tokens: letters, bigrams, short high-frequency morphemes, and words.
Application of this approach depends on a large body of annotated texts at a researcher's disposal: the larger the training set, the better the neural network will learn. At the same time, annotation is quite simple and its implementation does not necessarily involve professional linguists as researchers can refer to services of native speakers.
In this article, we will focus on the basic methods of CL and refer readers to the above-mentioned monographs for a detailed review of the area (cf. Clark et al. 2013, Indurkhya & Damerau 2010. Automatic text analysis usually begins with its pre-processing which includes text segmentation, i.e. segmentation into words and sentences. Though it may seem like a simple task, since words are separated from each other by spaces and sentences begin with a capital letter and end with a period (rarely, exclamation marks, question marks, ellipsis) followed by a space. The most typical example of the rule or pattern is the following: a period -space -capital letter. However, it is not that simple. A period can be in the middle of a sentence after the first initial, followed by a space and then a capitalized second initial. Here, the period does not explicitly indicate the division of the text into sentences. As an example, we can refer to the following sentence: "Lukashevich N.V., Levchik A.V. Creation of a lexicon of evaluative words of the Russian language RuCentilex // Proceedings of the OSTIS-2016 conference. pp. 377-382". Despite all the difficulties, the segmentation problem is considered to be practically solved. In 1989, Riley (1989) managed to achieve a 99.8% accuracy rate for splitting texts into sentences. To achieve this result, the researcher developed a complex system of rules taking into account the following features: length of the word before the dot, length of the word after the dot, presence of a word before the dot in the dictionary of abbreviations, etc.
The next step in the course of text analysis is morphological. Consider, as an example, a language with complex morphology -Russian. For the Russian language, morphological analysis is performed by a number of analyzers: MyStem, Natasha, pymorphy2, SpaCy, etc. In CL, morphological analysis, the purpose of which is to determine the morphological characteristics of a word, is based on a detailed description of inflectional paradigms. For the Russian language, a reference book of this kind is Zaliznyak (1977), which presents paradigm indices of almost 100,000 lemmas of the Russian language. The presence of such a directory made it possible to generate about 3 mln Word forms for the registered lemmas of the Russian language. Automatic text analysis finds a lemma corresponding to any word form and a complete list of morphological characteristics. The main challenge for the existing analyzers is homonymy, which the available parsers have not solved yet. And in situations when users require not all parsing options but one, analyzers produce the variant of morphological parsing of the highest frequency, still ignoring senses of the word in the context.
Another problem is parsing of the so-called "off-list" words, i.e. words not registered in the dictionary. Given that the average number of such words is about 3%, their morphological analysis requires developing special algorithms. The simplest solution foreseen is the following: based on the analysis of its flexion, the off-list word is assigned its morphological paradigm.
Syntactic parsing, or parsing, is much more complex. The result of syntactic parsing of a sentence is a dependency tree that presents a sentence structure either in the formalism of a generative grammar or in the formalism of a dependency grammar (cf. Tesnière 2015). Parsing requires a detailed description of the syntax of the language. The most successful analyzer for the Russian language is ETAP developed by the Laboratory of Computational Linguistics of the Institute for Information Transmission Problems of the Russian Academy of Sciences as a result of over 40 years of research. Its latest version, ETAP-4, is available at (ENA, June 6, 2020) 3 . ETAP parser is based on the well-known model "Meaning ⇔ Text" (Melchuk 1974), its formalized version is described in the monograph by Apresyan (1989).
In the recent decade, parsing has also been performed by neural networks (cf. Chen & Manning 2014) trained on syntactically annotated corpora. English Penn Treebank (ENA, June 6, 2022) 4 is used for English. For the Russian language, one can use SynTagRus (ENA, June 6, 2022) 5 , developed by the Laboratory of Computational Linguistics at the Institute for Information Transmission Problems RAS.
The task of semantic analysis is even more difficult. However, if we want the computer to "understand" the meaning, it is necessary to formalize semantics of words and sentences. The problem is solved in two classical ways. The first was initiated by C. Fillmore (1968), who introduced concepts of semantic cases or roles of noun phrases in a sentence. The correct establishing of semantic roles is an important step towards sentence comprehension. Fillmore's original ideas were realized in FrameNet lexical database (ENA, June 6, 2022) 6 .
The second approach was implemented in an electronic thesaurus, or lexical ontology, WordNet (Fellbaum 1998) which was originally designed for the English language. Subsequently its analogues were developed for many languages. There are numerous analogues of WordNet for the Russian language, the most effective and being widely used is RuWordNet thesaurus (ENA, June 6, 2022) 7 , (cf. Loukachevitch & Lashevich 2016), comprising over 130,000 words. WordNetlike thesauri explicate semantic relationships between words (concepts) including synonymy, hyponymy, hypernymy, etc., and their systemic parameters partially define their semantics. WordNet has been successfully implemented in a large number of both linguistic and computer research.
The idea of vector representation of semantics, i.e. word embeddings, has been proposed recently. Its core is constituted by the distributive hypothesis: linguistic units occurring in similar contexts have similar meanings (Sahlgren 2008). This hypothesis has been confirmed in numerous studies aimed at defining frequency vectors of words registered in large text corpora. There are multiple refinements and computer implementations of the idea, the most popular of which is word2vec (Mikolov et al. 2013) available in Gensim library (ENA, June 6, 2022) 8 . RusVectores system (Kutuzov & Kuzmenko 2017), available at (ENA, June 6, 2022) 9 identifies vector semantics for Russian words. Specifically, RusVectores evaluates semantic similarity of words.
Obviously, the most important tool for research in CL, as indeed in all modern linguistics, are text corpora. The first corpus compiled in the 1960s was Brown Corpus which when released contained one million words. Since then, corpora size requirements have increased dramatically. For the Russian language, the most well known is the National Corpus of the Russian Language (NCRL, ENA, June 6, 2022 10 ). Created in 2004, it is being constantly updated and currently includes over 600 mln words. In 2009, Google compiled and uploaded a very interesting multilingual resource, i.e. Google Books Ngram (ENA, June 6, 2022) 11 , containing 500 bln words, 67 bln words of which constitute the Russian sub-corpus (cf. Michel 2011).
Another important problem is corpus annotation or tagging, which in difficult cases is done manually. The work is usually carried out by several annotators and their performance consistency is closely monitored (Pons & Aliaga 2021). Despite the fact that corpora have become an integral part of linguistic research, there have been ongoing disputes on their representativeness, balance, differential completeness, subject and genre relatedness, as well as data correctness (cf. Solovyev, Bochkarev & Akhtyamova 2020).
Thus, thanks to CL, researchers fully implement numerous services including information retrieval, automatic error correction, etc. This became possible due to fundamentally important accomplishments not only in computer science, but also in linguistics. CL uses extensive dictionaries and thesauri, detailed syntax models, and giant corpora of texts. Automatic morphological analysis in its modern form would not exist without A. A. Zaliznyak's "Dictionary of the Russian Language Grammar" (1977). Multiple studies in CL are based on manually created WordNet and RuWordNet thesauri. Computer technologies, in turn, contribute to the development of linguistics. Text corpora and statistical methods have already become commonplace; without them serious linguistic research would be impossible.
All key CL technologies are publicly available, e.g. (ENA, June 6, 2022) 12 houses programs to solve numerous basic tasks for numerous languages.
It is not really feasible to cover all the topics of CL, a vast and rapidly developing field of linguistics, in one article. Many important questions have been left beyond. We refer readers interested in the topics of co-reference resolution, disambiguation, topic modeling, etc. to the above-mentioned publications.

Complexity of language and text as a research problem
The core of the special issue is made up of the articles focused on text complexity assessment. At first glance, estimating language complexity based on the number of categories in its system seems to be very logical, and the task itself appears feasible. A good example of the idea can be a phonological inventory of the language, the number of morphophonological rules or verb forms. Obviously, in this case, it becomes possible to compare complexity of different languages and assign them to some objective, absolute complexity (Miestamo, Sinnemäki & Karlsson 2008). Notably, it is the "objective" complexity that is significant when mastering a non-native language. On the other hand, if a language is acquired as a native language, it does not present any difficulty for children, and from this point of view, all languages complexity is absolutely the same. Researchers admit that language and text complexity "resists measurement", and scholars working in this field face conceptual and methodological difficulties.
Significant in the light of the problems under study is the description of the relationship and interdependence of two areas of complexity studies: language, or 'lingue' complexity, i.e. linguistic complexology, on the one hand, and text or discourse, 'parole' complexity, i.e. discursive complexology, on the other.
The interpretation of the very concept of "language (lingue) complexity" changed dramatically in the 19th-20th centuries. In the 19th century, the Humboldtian theory on interdependence between the structure of a language and stage of development of people speaking this language was universally accepted (Humboldt 1999: 37). Acknowledging this concept, researchers actually acknowledge unequal status of languages and peoples. In the XXth century, the Humboldian views asserting inequality of languages and their speakers were replaced by the concept of the so-called single complexity, identical and equal for all languages of the world. The idea received two names: ALEC -"All Languages are Equally Complex" (Deutscher 2009: 243) and linguistic equi-complexity dogma (Kusters 2003: 5). Researchers who support the idea are to prove two hypotheses: (1) language complexity is constituted of sub-complexities of its elements; (2) all sub-complexities in linguistic subsystems are compensated: simplicity in area A is compensated by complexity in area B, and vice versa ("compensatory hypothesis"). Arguing the concept "All languages are equally complex", Ch. Hockett quite boldly stated: "Objective measurement is difficult, but impressionistically it would seem that the total grammatical complexity of any language, counting both the morphology and syntax, is about the same as any other. This is not surprising, since all languages have about equally complex jobs to do: and what is not done morphologically has to be done syntactically" (Hockett 1958: 180-181). Unfortunately, in the works of that period and approach, scholars discussed neither complexity criteria nor its empirical evidence. For a detailed overview of the "linguistic equi-complexity dogma", see the seminal work by J. Sampson, D. Gil, and P. Trudgill, Language Complexity as an Evolving Variable (Sampson et al. 2009).
The twenty-first century opened with a number of critical reviews of ALEC theory, on the one hand (cf. , and McWhorter's provocative statement that "Creole grammars are the simplest grammars in the world" (McWhorter 2001). The very idea that all languages are equally complex has been convincingly rejected by sociolinguists, who have shown that language contact can lead to language simplification. This is shown in Afrikaans, Pidgins and Koine. Simplifying a language is possible, hence, before its simplification, the language was more complicated than after. And if a language can be more or less complex at different periods of its history, then some languages can be more complex than others (Trudgill 2012).
In the early 2000s the idea of linguistic complexity and the "dogma of equal complexity" was actively discussed at conferences and seminars (see the seminar "Language complexity as an evolving variable" organized by Max Planck Institute for Evolutionary Anthropology in 2007 in Leipzig ENA, June 6, 2022 13 ), in a number of journal articles (cf. Shosted 2006, Trudgill 2004) and monographs (Dahl 2009, Kusters 2003, Sampson et al. 2009).
Publications on language complexity in Russia are predominantly reviews written by foreign scholars, although in recent years interest in the area has visibly grown. The most comprehensive are the studies conducted by A. Berdichevsky (2012) and the review of Peter Trundgill's book "Sociolinguistic Typology", 2011 by Vakhtin (2014). The problems of language complexity were also discussed at the Institute for Linguistic Research of the Russian Academy of Sciences (ILI RAS) in 2018 at the conference "Balkan Languages and Dialects: Corpus and Quantitative Studies".
Local and global complexity The development of linguistic complexology led to the identification of two types of complexity: global, i.e. the complexity of the language (or dialect) as a whole, and local complexity, i.e. complexity of a particular level of language or domain . And if the assessment of global complexity of a language, according to researchers, is a very ambitious and probably hopeless task which H. Deutscher compares with "chasing wild geese" (Deutscher 2009), then the measurement of local complexity is considered as a feasible task, which imples the compiling of a list and evaluating complexity predictors at various language levels. The list of predictors of phonological complexity traditionally includes phoneme inventory, frequency of marked 14 phonemes, tonal differences, suprasegmental patterns, phonotactic restrictions, and maximal consonant clusters (Nichols 2009, Shosted 2006. When evaluating morphological complexity, classical "inconvenience factors" (Braunmüller's term 1990: 627) are the size of inflectional morphology of a language (or language variety), specificity of allomorph and morphophonemic processes, etc. (Dammel & Kürschner 2008, Kusters 2003. Syntactic complexity assessment is based on the accumulated data of syntax rules and follows the principle "the more, the more difficult", as well as language ability to generate recursions and clauses within a syntactic whole (Ortega 2003, Givón 2009, Karlsson 2009). Semantic and lexical complexity is estimated based on the number of ambiguous language units, the difference between inclusive and exclusive pronouns, lexical diversity, etc. (Fenk-Oczlon & Fenk 2008, Nichols 2009. The pragmatic or "hidden" complexity built on the law of economy is the complexity of inferences necessary to comprehend texts. Latent complexity languages allow for minimalist, very simple surface structures in which grammatical categories inferences are far from being trivial. The idea is exemplified by languages of Southeast Asia, which have achieved a particularly high degree of latent complexity. The latter is observed in the omission of pronouns and consequent multiple co-references in relative clauses, absence of relational markers, "bare" nouns lacking determiners and as such enabling a wide range of interpretations (Bisang 2009).
Research has indicated that high levels of local complexity at one level in a language do not necessarily entail low local complexity at another level, as predicted by the "dogma of equal complexity". For example, the analysis of metrics of morphological and phonological complexity in 34 languages carried out by R. Shosted did not reveal any expected statistically significant correlation (Shosted 2006). And the individual "balancing effects" (trade-offs) between local complexities observed by G. Fenk-Ozlog and A. Fenk, unfortunately, are also insufficient to validate the "dogma of equal complexity" of languages. G. Fenk-Ozlog and A. Fenk, in particular, found that in English the tendency towards phonological complexity and monosyllabicity is associated with a tendency towards homonymy and polysemy, towards a fixed word order and idiomatic speech (Fenk-Oczlon & Fenk 2008: 63). D. Gil has convincingly argued that isolating languages do not necessarily compensate for simple morphology with more complex syntax (Gil 2008).
Factors (or predictors) of language complexity are usually divided into internal and external ones. The number of elements and categories in the language, redundancy and irregularity of language categories are viewed as the internal factors of complexity. The modern paradigm developed the so-called "list approach" to assess internal complexity. The latter implies compiling a list of linguistic phenomena, the presence of which in a language increases its complexity. In fact, the lists of intrinsic complexity predictors are lists of the local complexity described above. For example, the complexity predictors list compiled by J. Nichols contains over 18 parameters and includes phonological, morphological, syntactic and lexical features (Nichols 2009). A language is considered more complex if it has more marked phonemes, tones, syntactic rules, grammatically expressed semantic and / or pragmatic differences, morphophonemic rules, more cases of addition, allomorph, agreement, etc. Scholars working in the area are interested, for example, in the number of grammatical categories in the language (Shosted 2006), the number of phonemic oppositions (McWhorter 2008), the length of the "minimal description" of the language system (Dahl 2009). McWhorter (2001) compares word order, i.e. the position of the verb in the Germanic languages, proving that English syntax has a lower degree of complexity than Swedish and German. The reason for the claim is the loss of the V2 (verb-second) rule in English, according to which the personal verb in Swedish and German takes the second place in the sentence.
Language elements and functions with "duplicate" information or overspecification are viewed as "redundant" internal predictors of complexity, and therefore optional elements in a discourse (McWhorter 2008). P. Trudgill calls such elements "historical baggage" (Trudgill 1999 The irregularity or "opacity" of form and word-formation processes as an internal factor in language complexity (see Mühlhäusler 1974) External factors that determine language complexity are culture, language age and language contacts. Older languages serving well-developed multi-level cultures are considered to be more complex because they accumulated "mature language features" (cf. Dahl 2009, Deutscher 2010, Parkvall 2008. At the same time, intensive contacts between linguistic communities have a significant impact on the complexity of languages. At the beginning of this century, P. Trudgill stated that "small, isolated, low-contact communities with tight social networks" develop more complex languages than high-contact communities (Trudgill 2004: 306). However, in his later work, the researcher clarifies that the dynamics of interacting languages complexity is determined by the duration of contacts and the age of speakers mastering the superstratum: language simplification occurs during shortterm contacts of communities, when adults learn a foreign (second) language. Language complication can take place in cases where the contacts are long-term and the second language is mastered not by adults, but children (Trudgill 2011). To prove the influence of language contacts on language complexity, B. Kortman and B. Smrechani (2004) compare the ways of implementing 76 morphosyntactic parameters, including the number of pronouns, noun phrases patterns, tense and aspect, modal verbs, verb morphology, adverbs, ways of expressing negations, agreement, word order, etc. in 46 variants of the English language. Researchers divide all variants of the English language into three large groups: (1) native to their speakers and performing all functions in the language community; (2) languages that function as the second official language of the state, and (3) creole languages based on English. The study confirmed that the third group of languages, i.e. English-based creoles, are the least complex, native English (first) language varieties are the most complex, and second-language English varieties exhibit intermediate complexity (Kortmann & Szmrecsanyi 2004).
In the most general terms, analytical methods for assessing complexity are divided into absolute (theoretical-oriented and treated as "objective") and relative (user-oriented and thus "subjective 15 ") (Crossley et al. 2008.). The absolute approach is popular in linguistic typology and is used to assess language complexity, while sociolinguistics and psycholinguistics use a relative approach. P. Trudgill defines relative difficulty as the difficulty which adults experience while learning a foreign language (Trudgill 2011: 371).
Text complexity as a construct is also modeled in discourse studies, linguistic personology, psycholinguistics and neurolinguistics. The area of these studies also includes relative complexity (or difficulty) of a text for different categories of recipients in different communicative environments, as well as absolute and relative (comparative) complexity of texts generated by different authors (see McNamara et al. 1996, Solnyshkina 2015.

Summary of articles in the issue
The current issue contains a detailed review and discussion of the best practices of text and discourse complexity assessment, as well as methods ranging from purely linguistic to complex interdisciplinary including multiple hard-and software tools.
One of the methods, i.e. eye-tracking, is viewed in the area as an objective way of assessing text complexity for different categories of readers. Research implementing eye-tracking techniques to evaluate Russian texts complexity remains spares. The basic task here is to select text parameters and oculomotor activity, as well as to identify methods of measuring text complexity perception. The features typically selected to measure text complexity are average word length and word frequency; as for parameters of oculomotor activity, it is preferably assessed with relative speed of reading a word, duration of fixations, and the number of fixations. Text readability is estimated in the number of words read per minute. Eye-tracking is the focus of articles contributed by Laposhina and co-authors and Bonch-Osmolovskaya and co-authors. Laposhina and co-authors show that the number of fixations on a word correlates with its length, while the duration of fixations correlates with its frequency. The research of Bonch-Osmolovskaya and co-authors is aimed at elementary discursive units (EDU) defined as "the quantum of oral discourse, a minimum element of discourse dynamics" (cf. Podlesskaya, Kibrik 2009: 309). Eye-tracking techniques allow to indicate that the structure of EDE affects text readability.
Methods of neural networks are implemented to assess texts complexity in the articles by Cortalescu et al., Sharoff, Morozov et al., and Ivanov et al. They also share the object of research, i.e. texts for studying Russian as a foreign language. An accurate assessment of their complexity enables better text selection for various educational environments. E.g. implementation of BERT model mentioned above provides a high degree of accuracy of text complexity assessment, i.e. 91-92%.
While using neural networks, researchers face an important research problem, i.e. which text features affect neural network results. A possible approach here is to use neural network to measure correlation coefficients of numerous text features with text complexity. An extensive study of collections of texts of various genres in English and Russian, taking into account dozens of linguistic features, has made it possible to identify a number of non-obvious effects. For example, research shows that more prepositions are used in more complex texts in Russian but in simpler texts in English. Obviously, this is due to the difference in the typological structures of languages. Notably, however, genre has a much larger effect on text complexity across all languages as compared to differences between languages.
A broad review of multiple methods applied in the area is provided in the work of M.I. Solnyshkina and co-authors. The paper covers six historic paradigms of discourse complexology: formative, classical, closed tests, structural-cognitive period, the period of natural language processing, and the period of artificial intelligence.
An important distinguishing feature of the articles in this special issue and its contribution to discourse complexology is constituted by its diverse and extensive data: several hundred linguistic features, different languages, different text corpora, different genres. Text complexity is assessed on several levels: lexical, morphological, syntactic, and discourse. Multifaceted studies prove to explicate the nature of text complexity. The publications in the current issue also provide information on the corpora and dictionaries being compiled.
One of the most important parameters of text complexity is abstractness. The more abstract words a text contains, the more difficult it is. The latter makes it relevant to compile dictionaries of abstract/concrete words and means of estimating text abstractness. English dictionaries of abstract/concrete words were published at the turn of the century, and the Russian language was lately viewed as "under resourced" since no dictionary identifying the degree of words abstractness was available. Solovyov and co-authors present a detailed methodology of composing a dictionary of abstractness for the Russian language. The article also describes the areas of dictionary application.
Linguistic complexity is an interdisciplinary problem, an object of computational linguistics, philosophy, applied linguistics, psychology, neurolinguistics, etc. In the 21st century, complexity studies acquired concepts and terminology, developed and verified a wide range of linguistic parameters of complexity. The main achievement of the new paradigm was the validation of cognitive predictors of complexity enabling the assessment of discourse complexity. This success, as well as an interdisciplinary approach to the problem, made it possible to integrate studies of discourse complexity into a separate area, i.e. discourse complexology. Complexity issues are not an "end in itself", since the research results are relevant both for linguistic analysis and for predicting comprehension in a wide range of pragmalinguistic situations.
One of these situations is cognitive analysis of mistakes made in a foreign language learning which is the object of research conducted by Lyashevskaya and Yanda and colleagues. Both studies focus on the interrelationship between text complexity of texts and cognitive resources necessary to comprehend a text. Lyashevskaya et al. established that the number of mistakes made by a student is correlated with morphological complexity of his/her discourse. Yanda et al. present a computer system designed to analyze and adequately explain mistakes of a learner of Russian as a foreign language.

Conclusion
The recent successes of computational linguistics have largely ensured accomplishments in discourse complexology and allowed scientists not only to automate a number of linguistic analysis operations, but also create user-friendly text profilers. Tools such as ReaderBench, Coh-Metrix, and RuMOR (cf. the current issue) are capable of solving both research and practical tasks: selecting texts for target audiences, editing and shortening texts, analyzing cognitive causes of errors, and even suggesting verbal strategies. The algorithms of automatic text profilers are based on classical and machine learning methods, including deep learning neural networks, one of the latest systems of which is BERT. At present, and this is well shown in a number of articles of the special issue, researchers are successfully combining methods of machine learning and the so-called "parametric approach".
However, the most important feature of modern research is a vast expansion of research problems and accuracy increase resulting from the abilities of artificial neural networks to learn and modify. Artificial intelligence breakthroughs are attributable to the three main factors: new advanced self-learning algorithms, high computer speeds, and a significant increase in training data. Modern databases, as well as dictionaries and tools for the Russian language developed in recent years, allowed the authors of the special issue to address and successfully solve a number of problems of text complexity. A solid foundation for success in discourse complexity were findings of cognitive scientists at the beginning of our century which completely changed complexology paradigm. If the main achievement of the XXth century complexology was the idea that "different types of texts are complex in different ways", the discourse complexology of the XXIst century proposed and verified complexity predictors for various types of texts and developed toolkits for assessing relative complexity of texts in various communicative situations. With cognitive methods in its arsenal, complexology acquired two additional variables: linguistic personality of the reader and reading environment.
The new research paradigm of linguistic complexology is manifested in those articles of the special issue which are aimed at defining new criteria for text complexity: expert evaluation, comprehension tests and reading speed tests have been replaced by new methods, which allow scholars to identify discourse units affecting text comprehension.
The studies published in the special issue also highlighted the main problems facing Russian linguistic complexology: creating a complexity matrix for texts of various types and genres, expanding the list of complexity predictors, validating new complexity criteria, and expanding databases for the Russian language.