vision-xl

Abstract

Diffusion model-based inverse problem solvers (DIS) are now leading techniques for solving inverse problems. Recent extensions have adapted these models to solve video inverse problems using image diffusion models, but they face limitations such as restricted resolution (256x256) and dependency on additional pre-trained modules like optical flow estimators and task-specific restoration models. In response, we introduce a novel framework for solving high-definition video inverse solver using only latent diffusion models. For efficient high-resolution processing on a single GPU, we introduce a pseudo-batch consistent sampling strategy. Additionally, to enhance temporal consistency, we implement pseudo-batch inversion, an initialization method that incorporates informative latents from the measurement frame.

Ours 😁

Scalability Supports muiti ratio & high-resolution reconstructions

Memory & sampling time effieciency Requires 13GB VRAM for 25-frame videos, within 2.5 min.

Accessibility Using open-sourced latent diffusion model (SDXL)

By integrating with SDXL, our framework achieves state-of-the-art video reconstruction across a wide range of spatio-temporal inverse problems, including complex combinations of frame averaging and various spatial degradations, such as deblurring, super-resolution, and inpainting.

Method

Illustration of the intermediate sampling process in our method using only latent diffusion models. We conduct a pseudo-batch sampling strategy to acquire a Tweedie-denoised pseudo-batch for GPU efficiency when conducting modern latent diffusion models. By taking multi-step CG in pixel(decoded) space of the entire Tweedie denoised batch, we optimize the video inverse problem. We re-encode the optimized video into latent(encoded) space with scheduled low-pass filtering.

Pseudo-batch Inversion (Initialization)

Pseudo-batch inversion initializes the informative latents by inverting the measurement frames. Rather using uninformative Gaussian prior, pseudo-batch inversion not only offers a good initialization for temporal consistency but also reduces the overall sampling time.

Sampling path evolution

Geometric illustration of the sampling path evolution. Pseudo-batch consistent sampling refines each latent, while decoded frame-dependent perturbation using multi-step CG ensures spatiotemporal data consistency.

Additional visualizations

We provide visualization of experimental evaluations and ablation studys in video format. Please click the link.

	Ours 😁
Scalability	Supports muiti ratio & high-resolution reconstructions
Memory & sampling time effieciency	Requires 13GB VRAM for 25-frame videos, within 2.5 min.
Accessibility	Using open-sourced latent diffusion model (SDXL)

VISION XL: High Definition Video Inverse Problem Solver using Latent Diffusion Models

Solving HD video inverse problems using only latent diffusion models.

Supporting wide-range ratio using SDXL.

Landscape

768x1280 (SDXL)

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + SR (x4)”

“7-frame averaged + SR (x4)”

“7-frame averaged + inpaint (50%)”

“7-frame averaged + inpaint (50%)”

“Deblur (σ = 3.0)”

“Deblur (σ = 3.0)”

“Inpaint (50%)”

“Inpaint (50%)”

Vertical

1280x768 (SDXL)

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + SR (x4)”

“7-frame averaged + SR (x4)”

“7-frame averaged + inpaint (50%)”

“7-frame averaged + inpaint (50%)”

“Deblur (σ = 3.0)”

“Deblur (σ = 3.0)”

“Inpaint (50%)”

“Inpaint (50%)”

Square

1024x1024 (SDXL)

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + Deblur (σ = 3.0)”

“7-frame averaged + SR (x4)”

“7-frame averaged + SR (x4)”

“7-frame averaged + inpaint (50%)”

“7-frame averaged + inpaint (50%)”

“Deblur (σ = 3.0)”

“Deblur (σ = 3.0)”

“Inpaint (50%)”

“Inpaint (50%)”

Abstract

Method

Pseudo-batch Inversion (Initialization)

Sampling path evolution

Additional visualizations