Meet-In-Style: Text-driven Real-time Video Stylization using Diffusion Models

Meet-In-Style—a new approach to real-time stylization of live video streams using text prompts. In contrast to previous text-based techniques, our system is able to stylize input video at 30 fps on commodity graphics hardware while preserving structural consistency of the stylized sequence and minimizing temporal flicker. A key idea of our approach is to combine diffusion-based image stylization with a few-shot patch-based training strategy that can produce a custom image-to-image stylization network with real-time inference capabilities. Such a combination not only allows for fast stylization, but also greatly improves consistency of individual stylized frames compared to a scenario where diffusion is applied to each video frame separately. We conducted a number of user experiments in which we found our approach to be particularly useful in video conference scenarios enabling participants to interactively apply different visual styles to themselves (or to each other) to enhance the overall chatting experience.

Meet-In-Style: Text-driven Real-time Video Stylization using Diffusion Models

David Kunz

Ondřej Texler

David Mould

Daniel Sýkora