ММЕмАсис: мультимодальный метод оценки психофизиологического состояния человека
- Авторы: Киселёв Г.А.1,2, Лубышева Я.М.1, Вейценфельд Д.А.1,2
-
Учреждения:
- Российский университет дружбы народов
- Федеральный исследовательский центр «Информатика и управление» Российской академии наук
- Выпуск: Том 32, № 4 (2024)
- Страницы: 370-379
- Раздел: Информатика и вычислительная техника
- URL: https://journals.rudn.ru/miph/article/view/43666
- DOI: https://doi.org/10.22363/2658-4670-2024-32-4-370-379
- EDN: https://elibrary.ru/EPGKRU
- ID: 43666
Цитировать
Полный текст
Аннотация
В статье представлен новый мультимодальный подход анализа психоэмоционального состояния человека с помощью нелинейных классификаторов. Основными модальностями являются данные речи испытуемого и видеоданные мимики. Речь оцифровывается и транскрибируется библиотекой Писец, признаки настроения извлекаются системой Titanis от ФИЦ ИУ РАН. Для визуального анализа были реализованы два различных подхода: дообученная модель ResNet для прямой классификации настроений по выражениям лица и модель глубокого обучения, интегрирующая ResNet с основанной на графах глубокой нейронной сетью для распознавания мимических признаков. Оба подхода сталкивались с трудностями, связанными с факторами окружающей среды, влияющими на стабильность результатов. Второй подход продемонстрировал бóльшую гибкость благодаря регулируемым словарям классификации, что облегчало калибровку после развёртывания. Интеграция текстовых и визуальных данных значительно улучшила точность и надёжность анализа психоэмоционального состояния человека.
Полный текст
1. Introduction Automatic detection and identification of signs of psychoemotional states are among the topical applied directions of engineering and artificial intelligence technologies development. Such systems make it possible to automate the process of controlling the actions of both individuals and groups of people, including in places of increased danger by timely informing the controlling services. In recent years, in the field of recognizing the psycho-emotional state of users, the importance of automatic multimodal recognition has been increasing, providing the next level after syntax and semantics analysis, word search from emotion dictionaries. Automatic multimodal recognition techniques allow to increase the amount of information processed, which has a positive impact on the accuracy of emotion recognition. Addition of video and audio modalities allows to operate also on the analysis of users’ gestures, their facial expressions, sequences of reactive movements, to analyze the timbre, volume of the voice, to find hidden artifacts in it. These factors significantly complement classical textual methods of analyzing the emotional state of users and allow to create applied actual systems. Operational recognition of emotional states as an applied task of artificial intelligence technology is in demand in many fields. Risk analysis of employee behavior allows the employer to optimally plan the company’s business processes and predict the personal efficiency of employees and the team as a whole. Monitoring of an employee’s condition allows to take timely measures to stabilize it at the individual level or to solve general organizational problems. The library can be used in modeling the psychological climate of teams. To recognize target psycho-emotional states, basic methods of emotion recognition supplemented with behavioral models can be used. Most existing developments (FaceReader by Dutch company Noldus, EmoDetect, etc.) are based on the theory of basic emotions, where the classes are 6 emotions: joy, surprise, sadness, disgust, anger, fear. In the present project, a complex psychophysiological state will be revealed not only on the basis of mimic signs, but also by analyzing the subject’s movements and speech. This way of analysis is chosen on the basis of ideas about behavioral approach and its differences from the discrete model of emotions. An example of a discrete model system is the development of [1], which uses human skeletal landmarks to analyze movements and identify the six emotions mentioned above. In addition to analyzing movements and facial expression multimodal methods (video, audio, text) are used to recognize emotions, as in [2-4]. Subjective psychological experience is inevitably accompanied by physiological changes necessary to organize a particular behavior. Emotion allows rapid organization of responses of separated physiological systems, including facial expressions, somatic muscle tone, acoustic characteristics of the speech signal, autonomic nervous and endocrine systems, to prepare the organism for adaptive behavior [5-7]. 2. Modality overview Analyzing human emotion is a complex process with step-by-step extraction of feature space and its analysis. Analysis of facial features. Mimics are coordinated movements of facial muscles. Certain facial expressions that occur to communicate one’s state to others (expression of emotions) are closely related to the psychophysiological state. The mimic expression of basic emotions is very similar across cultures, but is often masked depending on certain cultural attitudes, partially discordant with subjective experiences and physiological indicators, justifying validation within a specific culture. Analysis of gestures and posture. The need to analyze human gestures and posture is due to two main factors. Human posture, as well as facial expressions, is an important means of expressing emotions. The analysis of posture allows to reveal not only obvious psychophysiological states, but also more subtle non-verbal signals reflecting tension, fatigue or stress, which may not be explicitly expressed through facial expressions. Speech analysis. The need for speech analysis stems from the increased accuracy of emotion recognition in identifying features such as acoustic and tempo-dynamic characteristics of speech. Assessing the dynamics of posture change. One important quantitative measure is the change in posture over time. The dynamics of body movements offer a rich source of emotional information that cannot be obtained from static postures alone. The way a person moves from one pose to another, the speed and fluidity of movements can indicate specific emotions with greater clarity and nuance. Information from a person’s face, voice, and posture is interrelated with the person’s movements, reinforcing the emotions expressed in the face and voice. Certain emotions are closely related to specific temporal movement patterns. For example, sudden, jerky movements may indicate surprise or fear, while slow, jerky movements may signal fatigue and depression. Capturing these dynamics is critical for accurate emotion recognition. Understanding the context of a movement sequence can greatly influence its emotional interpretation. By analyzing dynamics, the context and progression of emotional states can be better understood, leading to more accurate recognition. Some emotions are expressed through subtle changes in movement dynamics that might be missed if only static postures were analyzed. Evaluating the dynamics allows these subtle signals to be detected. In using applications, understanding the dynamics of body movement can lead to more immersive and responsive experiences. This allows systems to respond not only to the fact of movement, but also to its emotional content. Analyzing body movement dynamics can also help in predicting future actions and emotional states, which is valuable in the fields of safety, health, and education [8]. Analysis of dynamics (both pose and facial expressions) can eliminate artifacts associated with an individual’s habitual postures and expressive expressions. While the analysis of static images can be distorted by facial or body features, analyzing the changes that occur significantly increases the reliability of the data obtained. 3. Methods Algorithm 1 Algorithmic representation of approach 1 Require:Об авторах
Г. А. Киселёв
Российский университет дружбы народов; Федеральный исследовательский центр «Информатика и управление» Российской академии наук
Email: kiselev@isa.ru
ORCID iD: 0000-0001-9231-8662
Scopus Author ID: 57195683637
ResearcherId: Y-6971-2018
Candidate of Technical Sciences, Senior Lecturer at the Department of Mathematical Modeling and Artificial Intelligence of RUDN University; Researcher of Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
ул. Миклухо-Маклая, д. 6, Москва, 117198, Российская Федерация; ул. Вавилова, д. 44, корп. 2, Москва, 119333, Российская ФедерацияЯ. М. Лубышева
Российский университет дружбы народов
Email: gorbunova_y_m@mail.ru
ORCID iD: 0000-0001-6280-6040
Master’s degree student of Department of Mathematical Modeling and Artificial Intelligence
ул. Миклухо-Маклая, д. 6, Москва, 117198, Российская ФедерацияД. А. Вейценфельд
Российский университет дружбы народов; Федеральный исследовательский центр «Информатика и управление» Российской академии наук
Автор, ответственный за переписку.
Email: veicenfeld@isa.ru
ORCID iD: 0000-0002-2787-0714
Master’s degree student of Department of Mechanics and Control Processes
ул. Миклухо-Маклая, д. 6, Москва, 117198, Российская Федерация; ул. Вавилова, д. 44, корп. 2, Москва, 119333, Российская ФедерацияСписок литературы
- Piana, S., Staglianò, A., Odone, F.,Verri, A. & Camurri, A. Real-timeAutomaticEmotionRecognition from Body Gestures 2014. doi: 10.48550/arXiv.1402.5047.
- Hu, G., Lin, T., Zhao, Y., Lu, G., Wu, Y. & Li, Y. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. doi: 10.48550/arXiv.2211.11256 (2022).
- Zhao, J., Zhang, T., Hu, J., Liu, Y., Jin, Q., Wang, X. & Li, H. M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics, Dublin, Ireland, May, 2022, 2022), 5699-5710. doi: 10.18653/v1/2022.acl-long.391.
- Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E. & Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. doi: 10.48550/arXiv.1810.02508 (2018).
- Ekman, P. Emotion: common characteristics and individual differences. Lecture presented at 8th World Congress of I.O.P. Tampere Finland (1996).
- Levenson, R. W. The intrapersonal functions of emotion. Cognition & Emotion 13, 481-504 (1999).
- Keltner, D. & Gross, J. Functional accounts of emotions. Cognition & Emotion 13, 467-480 (1999).
- Ferdous, A., Bari, A. & Gavrilova, M. Emotion Recognition From Body Movement. IEEE Access. doi: 10.1109/ACCESS.2019.2963113 (Dec. 2019).
- Zadeh, A., Liang, P., Poria, S., Cambria, E. & Morency, L.-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (July 2018), 2236-2246. doi: 10.18653/v1/P18-1208.
- Busso, C., Bulut, M. & Lee, C. e. a. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation 42, 335-359. doi: 10.1007/s10579-008-9076-6 (2008).
- Kossaifi, J. et al. SEWA DB: A Rich Database for Audio-Visual Emotion and Sentiment Research in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 13. doi: 10.1109/TPAMI.2019.2944808 (Oct. 2019).
- O’Reilly, H., Pigat, D., Fridenson, S., Berggren, S., Tal, S., Golan, O., Bölte, S., Baron-Cohen, S. & Lundqvist, D. The EU-Emotion Stimulus Set: A validation study. Behav Res Methods 48, 567-576. doi: 10.3758/s13428-015-0601-4 (2016).
- Soleymani, M., Lichtenauer, J., Pun, T. & Pantic, M. A Multimodal Database for Affect Recognition and Implicit Tagging. IEEE Transactions on Affective Computing 3, 42-55. doi: 10.1109/T-AFFC.2011.25 (2012).
- Chou, H. C., Lin, W. C., Chang, L. C., Li, C. C., Ma, H. P. & Lee, C. C. NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) (2017), 292-298. doi: 10.1109/ACII.2017.8273615.
- Ringeval, F., Sonderegger, A., Sauer, J. & Lalanne, D. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2013), 1-8. doi: 10.1109/FG.2013.6553805.
- Reznikova, J. I. Intelligence and language in animals and humans 253 pp. (Yurayt, 2016).
- Samokhvalov, V. P., Kornetov, A. N., Korobov, A. A. & Kornetov, N. A. Ethology in psychiatry 217 pp. (Health, 1990).
- Gullett, N., Zajkowska, Z., Walsh, A., Harper, R. & Mondelli, V. Heart rate variability (HRV) as a way to understand associations between the autonomic nervous system (ANS) and affective states: A critical review of the literature. International Journal of Psychophysiology 192, 35-42. doi: 10.1016/j.ijpsycho.2023.08.001 (2023).
- Bondarenko, I. Pisets: A Python library and service for automatic speech recognition and transcribing in Russian and English https://github.com/bond005/pisets.
- Savchenko, A. V. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks in 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY) (2021), 119-124.
- Luo, C., Song, S., Xie, W., Shen, L. & Gunes, H. Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782 (2022).
- Gajarsky, T. Facetorch: A Python library for analysing faces using PyTorch https://github.com/tomasgajarsky/facetorch.
- Deng, J., Guo, J., Ververas, E., Kotsia, I. & Zafeiriou, S. Retinaface: Single-shot multi-level face localisation in the wild in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), 5203-5212.
Дополнительные файлы






