Natural Language Processing and Fiction Text: Basis for Corpus Research

Alexey I. Gorozhanov; Горожанов Алексей Иванович; Innara A. Guseynova; Гусейнова Иннара Алиевна; Darya V. Stepanova; Степанова Дарья Валерьевна

doi:10.22363/2313-2299-2024-15-1-195-210

Обработка естественного языка и художественный текст: база для корпусного исследования

Авторы: Горожанов А.И.¹, Гусейнова И.А.¹, Степанова Д.В.²
Учреждения:
1. Московский государственный лингвистический университет
2. Минский государственный лингвистический университет
Выпуск: Том 15, № 1 (2024)
Страницы: 195-210
Раздел: ДИСКУРСИВНЫЕ И КОРПУСНЫЕ ИССЛЕДОВАНИЯ
URL: https://journals.rudn.ru/semiotics-semantics/article/view/38629
DOI: https://doi.org/10.22363/2313-2299-2024-15-1-195-210
EDN: https://elibrary.ru/FKVAOI

Цитировать

Полный текст

Аннотация
Об авторах
Список литературы
Дополнительные файлы
Статистика

Аннотация

Рассматриваются процедуры обработки естественного языка (NLP) на материале художественных текстов на немецком и английском языках, которые рассматриваются как сильные культурные тексты. Целью исследования является разработка модели такого инструмента обработки, анализа и интерпретации художественного текста, который раскрывал бы весь потенциал популярных инструментов NLP в рамках корпусного подхода. Общими методами, используемыми в исследовании, являются анализ и синтез. Для решения отдельных задач дополнительно применяются специальные методы: описательный метод, моделирование и качественно-количественный анализ. Научная новизна заключается в том, что авторы совмещают основополагающие принципы «классической» теории интерпретации текста и новейшие методы и инструменты прикладной лингвистики. В результате было разработано специальное программное обеспечение, способное работать с лингвистическими корпусами на основе баз данных SQL, автоматически построенными с помощью библиотеки spaCy и языка программирования Python. Созданное приложение можно использовать для интерпретации художественного текста, а также для составления учебных материалов для дисциплины «Домашнее чтение». Предполагается, что разработка специального программного обеспечения для сильных культурных текстов стимулирует поиск научных решений и в то же время позволит понять существенные различия, существующие между естественным и искусственным интеллектом.

Ключевые слова

обработка естественного языка, художественный текст, лингвистический корпус, Ф. Кафка, Дж. Лондон, прикладная лингвистика

Об авторах

Алексей Иванович Горожанов

Московский государственный лингвистический университет

Автор, ответственный за переписку.
Email: a_gorozhanov@mail.ru
ORCID iD: 0000-0003-2280-1282
SPIN-код: 1753-4920

доктор филологических наук, доцент, профессор кафедры грамматики и истории немецкого языка, факультет немецкого языка

119034, Российская Федерация, г. Москва, ул. Остоженка, 38, стр. 1

Иннара Алиевна Гусейнова

Московский государственный лингвистический университет

Email: guseynova@linguanet.ru
ORCID iD: 0000-0002-6544-699X
SPIN-код: 1635-5260

доктор филологических наук, доцент, проректор

119034, Российская Федерация, г. Москва, ул. Остоженка, 38, стр. 1

Дарья Валерьевна Степанова

Минский государственный лингвистический университет

Email: daryastepanova79@gmail.com
ORCID iD: 0000-0002-2857-4386
SPIN-код: 5291-8660

кандидат филологических наук, доцент

220034, Республика Белорусь, г. Минск, ул. Захарова, 21

Список литературы

Tsujii, J. (2021). Natural language processing and computational linguistics. Computational Linguistics, 47(4), 707-727. https://doi.org/10.1162/COLI_a_00420
O’Neill, H., Welsh, A., Smith, D.A., Roe, G. & Terras, M. (2021). Text mining mill: Computationally detecting influence in the writings of John Stuart Mill from library records. Digital Scholarship in the Humanities, 36(4), 1013-1029. https://doi.org/10.1093/llc/fqab010
Fonseca, C.A., Guelpeli, M.V.C. & De Souza Netto, R.S. (2021). Representation of structured data of the text genre as a technique for automatic text processing. Texto Livre, 15. https://doi.org/10.35699/1983-3652.2022.35445
Szabó, M.K., Ring, O., Nagy, B., Kiss, L., Koltai, J., Berend, G. & Kmetty, Z. (2020). Exploring the dynamic changes of key concepts of the Hungarian socialist era with natural language processing methods. Historical Methods, 54(1), 1-13. https://doi.org/10.1080/0161 5440.2020.1823289
Malyuga, E.N. & McCarthy, M. (2021). “No” and “net” as response tokens in English and Russian business discourse: In search of a functional equivalence. Russian Journal of Linguistics, 25(2), 391-416. https://doi.org/10.22363/2687-0088-2021-25-2-391-416
Gorozhanov, A.I. & Guseynova, I.A. (2020). Corpus analysis of the grammatical categories’ constituents in fiction texts considering the linguo-regional component. Journal of Siberian Federal University. Humanities & Social Sciences, 13(12), 2035-2048. https://doi.org/10.17516/1997-1370-0702. (In German).
Денисова Г.В. Интертекст в современной социокультурной реальности России и Италии. М.: Kanon+, 2020. С. 272.
Milne, P.W. (2022). Praescriptum: Kafka’s two bodies. Philosophy Today, 66(3), 587-603. https://doi.org/10.5840/philtoday2022324451
Itkin, A. (2021). Kafka’s worlds. German Quarterly, 94(4), 493-508. https://doi.org/10.1111/gequ.12241
Roca, J.B. & Rius, N.I. (2020). Kafka and disease. between reality and writing [Kafka y la enfermedad. Entre la realidad y la escritura] Revista Chilena De Literatura, 102, 233-247. https://doi.org/10.4067/S0718-22952020000200223
Logue, M. (2022). Patrick MacGill: A path to socialism shared with Jack London. [Patrick MacGill: el Camino hacia el Socialismo junto a Jack London]. Estudios Irlandeses, 17, 54-64. https://doi.org/10.24162/EI2022-10645
Hernandez, A. (2021). Jack London’s poetic animality and the problem of domestication. Journal of Modern Literature, 45(1), 40-55. https://doi.org/10.2979/jmodelite.45.1.03
López, J.I.G. (2020). Jack London, the socialist dream of a young poet. Revista De Estudios Norteamericanos, 24, 9-112. https://doi.org/10.12795/REN.2020.I24.05
Li, J., Lian, Z., Wu, Z., Zeng, L., Mu, L., yuan, y. & ye, J. (2023). Artificial intelligence- based method for the rapid detection of fish parasites (ichthyophthirius multifiliis, gyrodactylus kobayashii, and argulus japonicus). Aquaculture, 563. https://doi.org/10.1016/j.aquaculture.2022.738790
Hachemi, A. & Zeroual, A. (2022). Computer-assisted program for water calco-carbonic equilibrium computation. Earth Science Informatics, 15(1), 68-704. https://doi.org/10.1007/ s12145-021-00703-5
Li, W., Pu, H., & Wang, R. (2021). Sign language recognition based on computer vision. In: Priceeding of 2021 IEEE International Conference on Artificial Intelligence and Computer Applications, ICAICA 2021. pp. 919-922. https://doi.org/10.1109/ICAICA52286.2021.9498024
Schmitt, X., Kubler, S., Robert, J., Papadakis, M. & Letraon, y. (2019). A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, gate. In: Priceeding of 2019 6th International Conference on Social Networks Analysis, Management and Security, SNAMS 2019. pp. 338-343. https://doi.org/10.1109/SNAMS.2019.8931850
Ajani, D.T. (2019). Grammatico-Semantic Content of Primitives in the Major Themes of News Watch’s Reports on Nigerian Politics. The international journal of humanities & social studies, 7(12), 327-337. https://doi.org/10.24940/theijhss/2019/v7/i12/HS1912-066
Краева И.А. Германистика и лингводидактика в Московском и Минском государственных лингвистических университетах: истоки, развитие, перспективы. Казань: Бук, 2022.
Потапова Р.К. Дискурсивная составляющая современной корпусной лингвистики (применительно к устно-речевым базам данных) // Вестник Московского государственного лингвистического университета. 2012. № 639. С. 157-167.
Зубов А.В. Корпусная лингвистика: возможности и перспективы // Русский язык: система и функционирование. Минск: РИВШ, 2006. С. 22-27.
Kim, C., Choi, S., Jeong, J. & Lee, E. (2022). Automatic risks detection and comparison techniques for general conditions of technical documents in purchasing order. In: Proceedings of ACM International Conference Proceeding Series. pp. 236-241. https://doi.org/10.1145/3543712.3543721
Fantechi, A., Gnesi, S., Livi, S. & Semini, L. (2021). A spaCy-based tool for extracting variability from NL requirements. In: Priceeding of ACM International Conference Proceeding Series, Part F171625-B. pp. 32-35. https://doi.org/10.1145/3461002.3473074
Eyre, H., Chapman, A.B., Peterson, K.S., Shi, J., Alba, P.R., Jones, M.M. & Patterson, O.V. (2021). Launching into clinical space with medspaCy: A new clinical text processing toolkit in Python. In: Proceedings AMIA … Annual Symposium Proceedings. AMIA Symposium, 2021. pp. 438-447.
Partalidou, E., Spyromitros-Xioufis, E., Doropoulos, S., Vologiannidis, S. & Diamantaras, K.I. (2019). Design and implementation of an open source Greek POS tagger and entity recognizer using spaCy. In: Proceedings 2019 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2019. pp. 337-341. https://doi.org/10.1145/3350546.3352543
Jugran, S., Kumar, A., Tyagi, B.S. & Anand, V. (2021). Extractive automatic text summarization using SpaCy in Python NLP. In: Proceedings of 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE 2021. pp. 582-585. https://doi.org/10.1109/ICACITE51222.2021.9404712
Channabasamma, Suresh, y. & Manusha Reddy, A. (2021). A contextual model for information extraction in resume analytics using NLP’s spaCy. Inventive computation and information technologies. Springer. pp. 395-404. https://doi. org/10.1007/978-981-33-4305-4_30
Harahus, M., Juhar, J. & Hladek, D. (2022). Morphological annotation of the Slovak language in the spaCy library with the pretraining. In: Proceedings of 32nd International Conference Radioelektronika, Radioelektronika 2022. https://doi.org/10.1109/RADIOELEKTRONI KA54537.2022.9764935
Kumar, D., Choudhari, K., Patel, P., Pandey, S., Hajare, A. & Jante, S. (2022). STAT simple text annotation tool (STAT): Web-based tool for creating training data for spaCy models. In: ICT Analysis and Applications. Singapore: Springer Nature. https://doi.org/10.1007/978-981-16-5655-2_29
Soni, P.K. & Rambola, R. (2021). Deep learning, WordNet, and spaCy based hybrid method for detection of implicit aspects for sentiment analysis. In: Proceedings of 2021 International Conference on Intelligent Technologies, CONIT 2021. https://doi.org/10.1109/CONIT51480.2021.9498372
Chantrapornchai, C. & Tunsakul, A. (2021). Information extraction on tourism domain using spaCy and BERT. ECTI Transactions on Computer and Information Technology, 15(1), 108- 122. https://doi.org/10.37936/ecti-cit.2021151.228621
Singh, N. & Hussain, A. (2022). Rapid application development in cloud computing with IoT. In: IoT and AI technologies for sustainable living: A practical handbook. pp. 1-28. https://doi.org/10.1201/9781003051022-1
Горожанов А.И., Гусейнова И.А., Степанова Д.В. Инструментарий автоматизированного анализа перевода художественного произведения // Вопросы прикладной лингвистики. М.: Национальное объединение преподавателей иностранных языков делового и профессионального общения в сфере бизнеса, 2022. № 45. С. 62-89. https://doi.org/10.25076/vpl.45.03
Горожанов А.И. Метод компаративного анализа группы текстов (на материале немецкоязычных научных статей) // Вестник Московского государственного лингвистического университета. Гуманитарные науки. 2021. № 5(847). С. 48-59. https://doi.org/10.52070/2542-2197_2021_5_847_48
Singh, N., Kumar, M., Singh, B. & Singh, J. (2022). DeepSpacy-NER: An efficient deep learning model for named entity recognition for Punjabi language. Evolving Systems, 14, 673-683. https://doi.org/10.1007/s12530-022-09453-1