Цель исследования

izvestswsu

Известия Юго-Западного государственного университета

Proceedings of the Southwest State University

2223-15602686-6757

ЮЗГУ

10.21869/2223-1560-2021-25-1-82-109

izvestswsu-869

Research Article

Информатика, вычислительная техника и управление

Computer science, computer engineering and IT managment

Применение многозадачного глубокого обучения в задаче распознавания эмоций в речи

Applying Multitask Deep Learning to Emotion Recognition in Speech

https://orcid.org/0000-0002-3572-4493

Рябинов

А. В.

Ryabinov

A. V.

Рябинов Артем Валерьевич, программист лаборатории автономных робототехнических систем, Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН), Санкт-Петербургский институт информатики и автоматизации Российской академии наук

14-я линия В. О. 39, Санкт-Петербург 199178

Artem V. Ryabinov, Software Engineer of Laboratory of Autonomous Robotic Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178

iamryabinov@gmail.com

https://orcid.org/0000-0002-7032-0291

Уздяев

М. Ю.

Uzdiaev

M. Yu.

Уздяев Михаил Юрьевич, младший научный сотрудник лаборатории технологий больших данных социокиберфизических систем, Санкт-Петербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН), Санкт-Петербургский институт информатики и автоматизации Российской академии наук

14-я линия В. О. 39, Санкт-Петербург 199178

Mikhail Yu. Uzdiaev, Junior Researcher of Laboratory of Big Data In Socio-Cyberphysical Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178

uzdyaev.m@iias.spb.su

https://orcid.org/0000-0001-5388-8152

Ватаманюк

И. В.

Vatamaniuk

I. V.

Ватаманюк Ирина Валерьевна, младший научный сотрудник лаборатории автономных робототехнических систем, СанктПетербургский Федеральный исследовательский центр Российской академии наук (СПб ФИЦ РАН), Санкт-Петербургский институт информатики и автоматизации Российской академии наук

14-я линия В. О. 39, Санкт-Петербург 199178

Irina V. Vatamaniuk, Junior Researcher of Laboratory of Autonomous Robotic Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178

vatamaniuk.i.v@gmail.com

Санкт-Петербургский Федеральный исследовательский центр Российской академии наук; Санкт-Петербургский институт информатики и автоматизации Российской академии наукSt. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS); St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

2021

30052021

25182109

2021

Рябинов А.В., Уздяев М.Ю., Ватаманюк И.В.

Ryabinov A.V., Uzdiaev M.Y., Vatamaniuk I.V.

Данная работа распространяется под лицензией Creative Commons Attribution 4.0.

This work is licensed under a Creative Commons Attribution 4.0 License.

https://izvestswsu.elpub.ru/jour/article/view/869

Цель исследования

Цель исследования. Эмоции играют одну из ключевых ролей в регуляции поведения человека. Решение задачи автоматического распознавания эмоций позволяет повысить эффективность функционирования целого ряда цифровых систем: систем обеспечения безопасности, человеко-машинных интерфейсов, систем электронной коммерции и т.д. При этом отмечается низкая эффективность современных подходов распознавания эмоций в речи. Данная работа посвящена исследованию автоматического распознавания эмоций в речи с помощью методов машинного обучения.

Методы

Методы. В статье описан и протестирован подход к автоматическому распознаванию эмоций в речи на основе многозадачного обучения глубоких сверточных нейронных сетей архитектур AlexNet и VGG с применением автоматического подбора коэффициентов весов каждой задачи при вычислении итогового значения потери в процессе обучения. Все модели были обучены на выборке набора данных IEMOCAP с четырьмя эмоциональными категориями «гнев», «счастье», «нейтральная эмоция», «грусть». В качестве входных данных используются обработанные специализированным алгоритмом лог-мел спектрограммы высказываний.

Результаты

Результаты. Рассмотренные модели были протестированы на основе численных метрик: доля верно распознанных экземпляров, точность, полнота, f-мера. По всем вышеперечисленным метрикам получено улучшение качества распознавания эмоций предлагаемой моделью по сравнению с двумя базовыми однозадачными моделями, а также с известными решениями. Это достигается благодаря применению автоматического взвешивания значений функций потерь от отдельных задач при формировании итогового значения ошибки в процессе обучения.

Заключение

Заключение. Полученное улучшение качества распознавания эмоций по сравнению с известными решениями подтверждает целесообразность применения концепции многозадачного обучения для увеличения точности моделей распознавания эмоций. Разработанный подход позволяет достичь равномерного и одновременного снижения ошибок отдельных задач и используется в области распознавания эмоций в речи впервые.

Purpose of research

Purpose of research. Emotions play one of the key roles in the regulation of human behaviour. Solving the problem of automatic recognition of emotions makes it possible to increase the effectiveness of operation of a whole range of digital systems such as security systems, human-machine interfaces, e-commerce systems, etc. At the same time, the low efficiency of modern approaches to recognizing emotions in speech can be noted. This work studies automatic recognition of emotions in speech applying machine learning methods.

Methods

Methods. The article describes and tests an approach to automatic recognition of emotions in speech based on multitask learning of deep convolution neural networks of AlexNet and VGG architectures using automatic selection of the weight coefficients for each task when calculating the final loss value during learning. All the models were trained on a sample of the IEMOCAP dataset with four emotional categories of ‘anger’, ‘happiness’, ‘neutral emotion’, ‘sadness’. The log-mel spectrograms of statements processed by a specialized algorithm are used as input data.

Results

Results. The considered models were tested on the basis of numerical metrics: the share of correctly recognized instances, accuracy, completeness, f-measure. For all of the above metrics, an improvement in the quality of emotion recognition by the proposed model was obtained in comparison with the two basic single-task models as well as with known solutions. This result is achieved through the use of automatic weighting of the values of the loss functions from individual tasks when forming the final value of the error in the learning process.

Conclusion

Conclusion. The resulting improvement in the quality of emotion recognition in comparison with the known solutions confirms the feasibility of applying multitask learning to increase the accuracy of emotion recognition models. The developed approach makes it possible to achieve a uniform and simultaneous reduction of errors of individual tasks, and is used in the field of emotions recognition in speech for the first time.

многозадачное обучениесверточные нейронные сетиречевые технологииавтоматическое распознавание эмоцийанализ аудиосигналов речи

multitask learningconvolution neural networksspeech technologiesautomatic emotion recognitionanalysis of audio signals of speech

Работа выполнена при поддержке РФФИ (18-29-22061_мк).

The work was supported by the Russian Foundation for Basic Research (18-29-22061_mk).

References1

Tokuno S., Tsumatori, G., Shono S., Takei E., Yamamoto T., Suzuki G., Mituyoshi S., Shimura M. Usage of emotion recognition in military health care // Defense Science Research Conference and Expo (DSR). IEEE, 2011, P. 1-5. https://doi.org/10.1109/DSR.2011.6026823

Tokuno S., Tsumatori, G., Shono S., Takei E., Yamamoto T., Suzuki G., Mituyoshi S., Shimura M. Usage of emotion recognition in military health care. Defense Science Research Conference and Expo (DSR). IEEE, 2011:1-5. https://doi.org/10.1109/DSR.2011.6026823

Saste S.T., Jagdale S.M. Emotion recognition from speech using MFCC and DWT for security system // 2017 international conference of electronics, communication and aerospace technology (ICECA). IEEE, 2017. 1. P. 701-704. https://doi.org/10.1109/ICECA.2017.8203631

Saste S.T., Jagdale S.M. Emotion recognition from speech using MFCC and DWT for security system. 2017 international conference of electronics, communication and aerospace technology (ICECA). IEEE, 2017; 1:701-704. https://doi.org/10.1109/ICECA.2017.8203631

Rázuri J.G., Sundgren D., Rahmani R., Moran A., Bonet I., Larsson A. Speech emotion recognition in emotional feedbackfor human-robot interaction // International Journal of Advanced Research in Artificial Intelligence (IJARAI). 2015. No. 4(2). P. 20¬27. https://doi.org/10.14569/IJARAI.2015.040204

Rázuri J.G., Sundgren D., Rahmani R., Moran A., Bonet I., Larsson A. Speech emotion recognition in emotional feedbackfor human-robot interaction. International Journal of Advanced Research in Artificial Intelligence (IJARAI), 2015, 4(2), pp. 20¬27. https://doi.org/10.14569/IJARAI.2015.040204

Bojanić M., Delić V., Karpov A. Call redistribution for a call center based on speech emotion recognition // Applied Sciences. 2020. 10(13). P. 4653. https://doi.org/10.3390/app10134653

Bojanić M., Delić V., Karpov A. Call redistribution for a call center based on speech emotion recognition. Applied Sciences, 2020, no. 10(13), pp. 46-53. https://doi.org/10.3390/app10134653

Björn W., Schuller L. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends // Communications of the Acm. 2018. 61(5). P. 90¬99. https://doi.org/10.1145/3129340

Björn W., Schuller L. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the Acm, 2018, no. 61(5), pp.90¬99. https://doi.org/10.1145/3129340

Вилюнас В.К. Эмоции // Большой психологический словарь/под общ. ред. Б.Г. Мещерякова, В.П. Зинченко. URL: https://psychological.slovaronline.com/2078-EMOTSII

Vilyunas V.K. [Emotions]. Bol'shoj psihologicheskij slovar' [Big psychological dictionary] /pod obshch. red. B.G. Meshcheryakova, V.P. Zinchenko (In Russ.). Available at: https://psychological.slovaronline.com/2078-EMOTSII

Ильин Е.П. Эмоции и чувства. СПб.: Издательский дом "Питер", 2011.

Il'in E.P., Emocii i chuvstva [Emotions and feelings]. Saint-Petersburg, Piter Publ., 2011 (In Russ.)

Sailunaz K., Dhaliwal M., Rokne J., Alhajj R. Emotion detection from text and speech: a survey // Social Network Analysis and Mining. 2018. 8(1). P. 28. https://doi.org/10.1007/s13278-018-0505-2

Sailunaz K., Dhaliwal M., Rokne J., Alhajj R. Emotion detection from text and speech: a survey. Social Network Analysis and Mining, 2018, no. 8(1), p. 28. https://doi.org/10.1007/s13278-018-0505-2

Ekman P. Facial expression and emotion // American psychologist. 1993. 48 (4). P. 384. https://doi.org/10.1037/0003-066X.48.4.384

Ekman P. Facial expression and emotion. American psychologist, 1993. 48(4), 384 p. https://doi.org/10.1037/0003-066X.48.4.384

Russell J.A. Affective space is bipolar // Journal of personality and social psychology. 1979. 37 (3). P. 345. https://doi.org/10.1037/0022-3514.37.3.345

Russell J.A. Affective space is bipolar. Journal of personality and social psychology, 1979, no. 37 (3), 345 p. https://doi.org/10.1037/0022-3514.37.3.345

Russell J.A. Culture and the categorization of emotions // Psychological bulletin. – 1991. 110 (3). P. 426. https://doi.org/10.1037/0033-2909.110.3.426

Russell J.A. Culture and the categorization of emotions. Psychological bulletin, 1991, no. 110 (3), 426 p. https://doi.org/10.1037/0033-2909.110.3.426

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network / G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller, S. Zafeiriou // 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016. P. 5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669

Trigeorgis G., Ringeval F., Brueckner R., Marchi E., Nicolaou M.A., Schuller B., Zafeiriou S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016:5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669

Continuous Speech Emotion Recognition with Convolutional Neural Networks / N. Vryzas, L. Vrysis, M. Matsiola, R. Kotsakis, C. Dimoulas, G. Kalliris // Journal of the Audio Engineering Society. 2020. 68 (1/2). P. 14-24. https://doi.org/10.17743/jaes.2019.0043

Vryzas N., Vrysis L., Matsiola M., Kotsakis R., Dimoulas C., Kalliris G. Continuous Speech Emotion Recognition with Convolutional Neural Networks. Journal of the Audio Engineering Society, 2020, no. 68(1/2), pp. 14-24. https://doi.org/10.17743/jaes.2019.0043

3-D convolutional recurrent neural networks with attention model for speech emotion recognition / M. Chen, X. He, J. Yang, H. Zhang // IEEE Signal Processing Letters. 2018. 25(10). P. 1440-1444. https://doi.org/10.1109/LSP.2018.2860246

Chen M., He X., Yang J., Zhang H. 3¬D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 2018, no. 25(10), pp.1440-1444. https://doi.org/10.1109/LSP.2018.2860246

Satt A., Rozenberg S., Hoory R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms // Interspeech. 2017. P. 1089-1093. https://doi.org/10.21437/Interspeech.2017-200

Satt A., Rozenberg S., Hoory R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Interspeech, 2017, pp. 1089-1093. https://doi.org/10.21437/Interspeech.2017-200

Zhang Z., Wu B., Schuller B. Attention-augmented end-to-end multi-task learning for emotion prediction from speech // ICASSP 2019¬2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. P. 6705-6709. https://doi.org/10.1109/ICASSP.2019.8682896

Zhang Z., Wu B., Schuller B. Attention-augmented end-to-end multi-task learning for emotion prediction from speech. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6705-6709. https://doi.org/10.1109/ICASSP.2019.8682896

Affective video content analysis: A multidisciplinary insight / Y. Baveye, C. Chamaret, E. Dellandréa, L. Chen // IEEE Transactions on Affective Computing. 2017. 9(4). P. 396-409. https://doi.org/1-1.10.1109/TAFFC.2020.2983669

Baveye Y., Chamaret C., Dellandréa E., Chen L. Affective video content analysis: A multidisciplinary insight. IEEE Transactions on Affective Computing, 2017, no. 9(4), pp. 396-409. https://doi.org/1-1.10.1109/TAFFC.2020.2983669

Caruana R. Multitask learning // Machine learning. 1997. 28(1). P. 41-75. https://doi.org/10.1023/A:1007379606734

Caruana R. Multitask learning. Machine learning, 1997, no. 28(1), pp. 41-75. https://doi.org/10.1023/A:1007379606734

IEMOCAP: Interactive emotional dyadic motion capture database / C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, S.S. Narayanan // Language resources and evaluation. 2008. 42(4). P. 335. https://doi.org/10.1007/s10579-008-9076-6

Busso C., Bulut M., Lee C.C., Kazemzadeh A., Mower E., Kim S., Chang J., Lee S., Narayanan S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 2008, no. 42(4), 335 p. https://doi.org/10.1007/s10579-008-9076-6

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing / F. Eyben, K.R. Scherer, B.W. Schuller, J. Sundberg, E. André, C. Busso, L. Devillers, J. Epps, P. Laukka, S. Narayanan, K. Truong // IEEE transactions on affective computing. 2015. 7(2). P. 190-202. https://doi.org/10.1109/TAFFC.2015.2457417

Eyben F., Scherer K.R., Schuller B.W., Sundberg J., André E., Busso C., Devillers L., Epps J., Laukka P., Narayanan S., Truong K. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, 2015, no. 7(2), pp. 190-202. https://doi.org/10.1109/TAFFC.2015.2457417

The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism / B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, S. Kim // Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France. 2013. URL: https://mediatum.ub.tum.de/doc/1189705/file.pdf

Schuller B., Steidl S., Batliner A., Vinciarelli A., Scherer K., Ringeval F., Chetouani M., Weninger F., Eyben F., Marchi E., Mortillaro M., Salamin H., Polychroniou A., Valente F., Kim S. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013. Available at: https://mediatum.ub.tum.de/doc/1189705/file.pdf

Akçay M.B., Oğuz K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers // Speech Communication. 2020. 116. P. 56-76. https://doi.org/10.1016/j.specom.2019.12.001

Akçay M.B., Oğuz K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication. 2020, no. 116, pp. 56-76. Available at: https://doi.org/10.1016/j.specom.2019.12.001

The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals / B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. V. Kessous Aharonson // Eighth Annual Conference of the International Speech Communication Association. 2007. P. 2253-2256. URL: https://www.isca-speech.org/archive/interspeech_2007/i07_2253.htm

Schuller B., Batliner A., Seppi D., Steidl S., Vogt T., Wagner J., Devillers L., Vidrascu L., Amir N., Kessous L. Aharonson V. The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. Eighth Annual Conference of the International Speech Communication Association, 2007, pp. 2253-2256. Available at: https://www.isca-speech.org/archive/interspeech_2007/i07_2253.html

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions / F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne // 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 2013. P. 1-8. https://doi.org/10.1109/FG.2013.6553805

Ringeval F., Sonderegger A., Sauer J., Lalanne D. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1-8. https://doi.org/10.1109/FG.2013.6553805

Sound classification using convolutional neural network and tensor deep stacking network / A. Khamparia, D. Gupta, N.G. Nguyen, A. Khanna, B. Pandey, P. Tiwari // IEEE Access. 2019. 7. P. 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882

Khamparia A., Gupta D., Nguyen N.G., Khanna A., Pandey B., Tiwari P. Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access, 2019; 7:7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882

Speaker-independent Japanese isolated speech word recognition using TDRC features / N.S.S. Srinivas, N. Sugan, L.S. Kumar, M.K. Nath, A. Kanhe // 2018 International CET Conference on Control, Communication, and Computing (IC4). IEEE, 2018. P. 278-283. https://doi.org/10.1109/CETIC4.2018.8530947

Srinivas N.S.S., Sugan N., Kumar L.S., Nath M.K., Kanhe A. Speaker-independent Japanese isolated speech word recognition using TDRC features. 2018 International CET Conference on Control, Communication, and Computing (IC4). IEEE, 2018, pp. 278¬283. https://doi.org/10.1109/CETIC4.2018.8530947

Speaker identification using FrFT-based spectrogram and RBF neural network / P. Li, Y. Li, D. Luo, H. Luo // 2015 34th Chinese Control Conference (CCC). IEEE, 2015. P. 3674-3679. https://doi.org/10.1109/ChiCC.2015.7260207

Li P., Li Y., Luo D., Luo H. Speaker identification using FrFT¬based spectrogram and RBF neural network. 2015 34th Chinese Control Conference (CCC). IEEE, 2015, pp. 3674¬3679. https://doi.org/10.1109/ChiCC.2015.7260207

Speech emotion recognition for performance interaction / N. Vryzas, R. Kotsakis, A. Liatsou, C.A. Dimoulas, G. Kalliris // Journal of the Audio Engineering Society. 2018. 66(6). P. 457-467. https://doi.org/10.17743/jaes.2018.0036

Vryzas N., Kotsakis R., Liatsou A., Dimoulas C.A., Kalliris G. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 2018, 66(6), pp.457-467. https://doi.org/10.17743/jaes.2018.0036

Attention-based models for speech recognition / J.K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio // Advances in neural information processing systems. 2015. 28. P. 577-585. URL: https://papers.nips.cc/paper/2015/hash/1068c6e4c8051cfd4e9ea8072e3189e2-Abstract.html

Chorowski J.K., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-based models for speech recognition. Advances in neural information processing systems, 2015, 28, pp. 577-585. Available at: https://papers.nips.cc/paper/2015/hash/1068c6e4c8051cfd4e9ea8072e3189e2-Abstract.html

A database of German emotional speech / F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss // Ninth European Conference on Speech Communication and Technology. 2005. URL: https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1517.pdf

Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W.F., Weiss B. A database of German emotional speech. Ninth European Conference on Speech Communication and Technology, 2005. Available at: https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1517.pdf

Dropout: a simple way to prevent neural networks from overfitting / N. Srivastava, G. Hinton, A., Krizhevsky I. Sutskever, R. Salakhutdinov // The journal of machine learning research. 2014. 15(1). P. 1929-1958. https://dl.acm.org/doi/abs/10.5555/2627435.2670313

Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014, no. 15(1), pp.1929¬1958. Available at: https://dl.acm.org/doi/abs/10.5555/2627435.2670313

Bilen H., Vedaldi A. Universal representations: The missing link between faces, text, planktons, and cat breeds //arXiv preprint arXiv:1701.07275. 2017.

Bilen H., Vedaldi A. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275. 2017.

Das A., Hasegawa-Johnson M., Veselý K. Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions // INTERSPEECH. 2017. P. 2073-2077. https://doi.org/10.21437/Interspeech.2017-582

Das A., Hasegawa-Johnson M., Veselý K. Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions. INTERSPEECH, 2017, pp. 2073-2077. https://doi.org/10.21437/Interspeech.2017-582

Sanh V., Wolf T., Ruder S. A hierarchical multi-task approach for learning embeddings from semantic tasks // Proceedings of the AAAI Conference on Artificial Intelligence. – 2019. 33. P. 6949-6956. https://doi.org/10.1609/aaai.v33i01.33016949

Sanh V., Wolf T., Ruder S. A hierarchical multi-task approach for learning embeddings from semantic tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, no. 33. pp. 6949-6956. https://doi.org/10.1609/aaai.v33i01.33016949

Distral: Robust multitask reinforcement learning / Y. Teh, V. Bapst, W.M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, R. Pascanu // Advances in Neural Information Processing Systems. 2017. 30. P. 4496-4506. URL: https://proceedings.neurips.cc/paper/2017/hash/0abdc563a06105aee3c6136871c9f4d1-Abstract.html

Teh Y., Bapst V., Czarnecki W.M., Quan J., Kirkpatrick J., Hadsell R., Heess N., Pascanu R. Distral: Robust multitask reinforcement learning. Advances in Neural Information Processing Systems, 2017, no. 30, pp.4496-4506. Available at: https://proceedings.neurips.cc/paper/2017/hash/0abdc563a06105aee3c6136871c9f4d1-Abstract.html

Ranjan R., Patel V.M., Chellappa R. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition // IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017. 41(1). P. 121-135. https://doi.org/10.1109/TPAMI.2017.2781233

Ranjan R., Patel V.M., Chellappa R. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, no. 41(1), pp. 121-135. https://doi.org/10.1109/TPAMI.2017.2781233

Parthasarathy S., Busso C. Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning // Interspeech. 2017. P. 1103-1107. URL: https://www.iscaspeech.org/archive/Interspeech_2017/pdfs/1494.PDF

Parthasarathy S., Busso C. Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning. Interspeech. 2017:1103-1107. Available at: https://www.iscaspeech.org/archive/Interspeech_2017/pdfs/1494.PDF

Progressive neural networks for transfer learning in emotion recognition / J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, E.M. Provost // arXiv preprint arXiv:1706.03256. 2017.

Gideon J., Khorram S., Aldeneh Z., Dimitriadis D., Provost E.M. Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256. 2017.

MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception / C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, E.M. Provost // IEEE Transactions on Affective Computing. 2016. 8(1). P. 67-80. https://doi.org/10.1109/TAFFC.2016.2515617

Busso C., Parthasarathy S., Burmania A., AbdelWahab M., Sadoughi N., Provost E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing,. 2016, no. 8(1), pp.67-80. https://doi.org/10.1109/TAFFC.2016.2515617

Kendall A., Gal Y., Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics // Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. P. 7482-7491. https://doi.org/10.1109/CVPR.2018.00781

Kendall A., Gal Y., Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.7482-7491. https://doi.org/10.1109/CVPR.2018.00781

Liebel L., Körner M. Auxiliary tasks in multi-task learning // arXiv preprint arXiv:1805.06334. 2018.

Liebel L., Körner M. Auxiliary tasks in multi-task learning. arXiv preprint arXiv:1805.06334. 2018.

A comparison of loss weighting strategies for multi task learning in deep neural networks / T. Gong, T. Lee, C. Stephenson, V. Renduchintala, S. Padhy, A. Ndirango, G. Keskin, O.H. Elibol // IEEE Access. 2019. 7. P. 141627-141632. https://doi.org/10.1109/ACCESS.2019.2943604

Gong T., Lee, T., Stephenson C., Renduchintala V., Padhy S., Ndirango A., Keskin G., Elibol O.H. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access. 2019; 7:141627-141632. https://doi.org/10.1109/ACCESS.2019.294360

Liu S., Johns E., Davison A. J. End-to-end multi-task learning with attention // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. P. 1871-1880. https://doi.org/10.1109/CVPR.2019.00197

Liu S., Johns E., Davison A. J. End-to-end multi-task learning with attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1871-1880. https://doi.org/10.1109/CVPR.2019.00197

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks / Z. Chen, V. Badrinarayanan, C.Y. Lee, A. Rabinovich // International Conference on Machine Learning. PMLR, 2018. P. 794-803. URL: http://proceedings.mlr.press/v80/chen18a.html

Chen Z., Badrinarayanan V., Lee C.Y., Rabinovich A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. International Conference on Machine Learning. PMLR, 2018. pp.794-803. http://proceedings.mlr.press/v80/chen18a.html

Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks // Communications of the ACM. 2017. 60(6). P. 84-90. URL: https://dl.acm.org/doi/abs/10.1145/3065386

Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, no. 60(6), pp.84¬90. https://dl.acm.org/doi/abs/10.1145/3065386

Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition //arXiv preprint arXiv:1409.1556. 2014.

Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.

He K. et al. Deep residual learning for image recognition // Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. P. 770-778. URL: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

He K. et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. Available at: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

Kingma D.P., Ba J. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. 2014.

Kingma D.P., Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

Livingstone S.R., Russo F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English // PloS one. 2018. 13(5). P. e0196391. https://doi.org/10.1371/journal.pone.0196391

Livingstone S.R., Russo F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, no. 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391

Mariooryad S., Lotfian R., Busso C. Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora // Fifteenth Annual Conference of the International Speech Communication Association. 2014. URL: https://www.iscaspeech.org/archive/interspeech_2014/i14_0238.html

Mariooryad S., Lotfian R., Busso C. Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora. Fifteenth Annual Conference of the International Speech Communication Association. 2014. Available at: https://www.isca-speech.org/archive/interspeech_2014/i14_0238.html

Maaten L., Hinton G. Visualizing data using t¬SNE // Journal of machine learning research. 2008. 9(Nov). P. 2579¬2605. URL: https://www.jmlr.org/papers/v9/vandermaaten08a.html

Maaten L., Hinton G. Visualizing data using t-SNE. Journal of machine learning research, 2008, 9(Nov), pp. 2579-2605. Available at: https://www.jmlr.org/papers/v9/vandermaaten08a.html

Grad-cam: Visual explanations from deep networks via gradient-based localization / R.R. Sel-varaju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra // Proceedings of the IEEE international conference on computer vision. 2017. P. 618-626. URL: https://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_ Explanations_ICCV_2017_paper.html

Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D. Gradcam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision. 2017:618-626. Available at: https://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html

The authors declare that there are no conflicts of interest present.