Multimodal messages are ubiquitous. On social media platforms such as TikTok 25 million videos are shared daily (Ch 2024). These videos often include moving images, audio, text overlays, and narrating voices, simultaneously. Past research falls short in investigating how multiple modes (i.e., text, image, audio) jointly affect consumer behavior (Grewal et al. 2022; Packard and Berger 2024), more so when the context of sustainability is considered. Research in said context, indeed, focuses mostly on text-based communication (e.g., Kronrod et al. 2023; Olsen, Slotegraaf, and Chandukala 2014) neglecting to focus on multiple modes. Similarly, marketing research on videos has explored the impact of individual modes on consumer responses, focusing, for example, only on text (e.g., Cascio Rizzo et al. 2023) or audial features (e.g., Chang, Mukherjee and Chattopadhyay 2023). Little research investigates the combined use of different modes and the effect on consumer’s attitude and behavior (Holiday et al. 2023; Zhou et al. 2021). In this paper we aim at filling this gap by investigating how the combination of visual, textual, and audio stimuli affects consumer engagement of social media videos.