VISION XL: High Definition Video Inverse Problem Solver using Latent Diffusion Models

Code (Comming soon)

Solving HD video inverse problems using only latent diffusion models.

Supporting wide-range ratio using SDXL.

Landscape

 768x1280 (SDXL)

Vertical

 1280x768 (SDXL)

Square

 1024x1024 (SDXL)

Abstract

Diffusion model-based inverse problem solvers (DIS) are now leading techniques for solving inverse problems. Recent extensions have adapted these models to solve video inverse problems using image diffusion models, but they face limitations such as restricted resolution (256x256) and dependency on additional pre-trained modules like optical flow estimators and task-specific restoration models. In response, we introduce a novel framework for solving high-definition video inverse solver using only latent diffusion models. For efficient high-resolution processing on a single GPU, we introduce a pseudo-batch consistent sampling strategy. Additionally, to enhance temporal consistency, we implement batch-consistent inversion, an initialization method that incorporates informative latents from the measurement frame.
Ours 😁
Scalability Supports muiti ratio & high-resolution reconstructions
Memory & sampling time effieciency Requires 13GB VRAM for 25-frame videos, within 2.5 min.
Accessibility Using open-sourced latent diffusion model (SDXL)
By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting.

Method

1

Illustration of the intermediate sampling process in our method using only latent diffusion models. We conduct parallel sampling process of each frame latents to acquire Tweedie denoised pseudo-batch. By taking multi-step CG in pixel(decoded) space of the entire Tweedie denoised batch, we optimize the video inverse problem. We re-encode the optimized video into latent(encoded) space with scheduled low-pass filtering.

Batch-consistent Inversion (Initialization)

2

Illustration of batch-consistent inversion to initialize the informative latents by inverting the measurement frame and replicating it. Rather using uninformative Gaussian prior, batch-consistent inversion not only offers a good initialization for temporal consistency but also reduces the overall sampling time.

Sampling path evolution

3

Geometric illustration of the sampling path evolution. Parallel sampling refines each latent, while decoded frame-dependent perturbation using multi-step CG ensures spatiotemporal data consistency.

Additional visualizations

We provide visualization of experimental evaluations and ablation studys in video format. Please click the link.