A unified multimodal learning framework for sentiment analysis and mental health indicators from YouTube videos
Journal article, 2026

This study presents a multimodal deep learning framework designed to analyze sentiment patterns in YouTube videos and explore their association with early indicators of mental well-being. The approach integrates textual transcripts, vocal characteristics, and facial expressions into a unified representation to capture the emotional depth that individual modalities often miss. The model was trained and evaluated on a curated dataset of diverse YouTube content, and the results consistently showed that the fused architecture performed better than unimodal baselines. Compared with text-only or audio-only systems, the multimodal model achieved higher accuracy and fewer misclassifications, particularly in cases where speakers displayed subtle or mixed emotions. The integration of vocal cues such as pitch variation, speaking rate, and stress patterns helped clarify emotional ambiguity, while visual features such as micro-expressions, gaze direction, and facial tension added further clarity to the sentiment shifts within the videos. Transformer-based fusion delivered the most stable performance, demonstrating strong generalization across varied communication styles and recording conditions. In addition to reporting classification outcomes, the study examined how specific multimodal patterns correlate with non-clinical markers of mental health. Consistent associations were observed between fluctuating sentiment trajectories and indicators such as emotional instability, sustained negative tone, and reduced expressive variability. Instances where facial expressions contradicted verbal sentiment also showed relevance for identifying mild distress signals. These findings suggest that multimodal emotional cues can offer valuable insights into the affective state of content creators and may support research on digital well-being. The analysis also revealed challenges related to background noise, varying video quality, and inconsistent facial visibility, which influenced the reliability of certain features. Despite these limitations, the study demonstrates that combining audio, visual, and textual information provides a more complete and reliable picture of sentiment expression on social media platforms. The proposed framework offers a foundation for future systems aimed at understanding online emotional behavior and contributes to ongoing discussions on the responsible use of machine learning in mental-health-oriented applications.

Extreme learning machine (ELM)

Natural language processing

Multimodal sentiment analysis

Deep learning

Multimodal fusion

Machine learning

Tokenization

Author

Priyanshu Satapathy

Bharati Vidyapeeths Coll Engn

Onushka Chauhan

Bharati Vidyapeeths Coll Engn

Deepika Kumar

Bharati Vidyapeeths Coll Engn

Preeti Sharma

Bharati Vidyapeeths Coll Engn

Sumit Kumar Banshal

Alliance University

Oana Geman

University of Gothenburg

Chalmers, Computer Science and Engineering (Chalmers), Data Science and AI

Lucia Morosan-Danila

University of Suceava

Jude D. Hemanth

Karunya Institute of Technology and Sciences

Discover Mental Health

27314383 (eISSN)

Vol. 6 1 46

Subject Categories (SSIF 2025)

Natural Language Processing

DOI

10.1007/s44192-026-00388-6

PubMed

41706419

Related datasets

Multimodal-Sentiment-Analysis-Pipeline-for-YouTube-Videos [dataset]

URI: https://github.com/Preeti061204/Multimodal-Sentiment-Analysis-Pipeline-for-YouTube-Videos

More information

Latest update

4/20/2026