STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
data:image/s3,"s3://crabby-images/99b08/99b08eda9982501d8e05f130b773247d3910fa9e" alt="Cover image for STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution"
Introduction
STAR (Spatial-Temporal Augmentation with Text-to-Video Models) is an innovative real-world video super-resolution framework jointly developed by Nanjing University, ByteDance, and Southwest University. It is the first to integrate diverse, powerful text-to-video diffusion priors into real-world video super-resolution, effectively addressing various challenges faced by traditional methods in processing real-world videos.
Core Features
- 🌟 Innovative Spatio-Temporal Quality Enhancement Framework: Specifically designed for real-world video super-resolution
- 🎯 Powerful Text-to-Video Model Integration: Leveraging T2V models for video quality enhancement
- 🔄 Excellent Temporal Consistency: Effectively maintaining coherence between video frames
- 🖼️ Realistic Spatial Details: Generating high-quality, detail-rich video frames
- 🛠️ Practical Open-Source Implementation: Providing complete code and pre-trained models
Technical Principles
STAR framework consists of four main modules:
- VAE Encoder: Processes video input
- Text Encoder: Handles prompt text
- ControlNet: Controls the generation process
- T2V Model with Local Information Enhancement Module (LIEM):
- LIEM specifically designed to reduce artifacts
- Dynamic Frequency (DF) Loss for adaptive adjustment of high and low-frequency component constraints
These components work together to achieve high spatio-temporal quality, reduced artifacts, and enhanced fidelity.
data:image/s3,"s3://crabby-images/a496f/a496f91ec03a7fcabe483e3cbafda14e3130064a" alt="STAR Framework"
Installation and Usage
Environment Setup
# Clone repository
git clone https://github.com/NJU-PCALab/STAR.git
cd STAR
# Create environment
conda create -n star python=3.10
conda activate star
pip install -r requirements.txt
sudo apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
Pre-trained Models
STAR offers two base model versions:
-
I2VGen-XL Based Version
- Light Degradation Model: Suitable for videos with minor quality loss
- Heavy Degradation Model: Suitable for videos with severe quality loss
-
CogVideoX-5B Based Version
- Specifically for processing heavily degraded videos
- Only supports 720x480 input resolution
Usage Steps
-
Download Pre-trained Models
- Download model weights from HuggingFace
- Place weight files in the
pretrained_weight/
directory
-
Prepare Test Data
- Place test videos in the
input/video/
directory - Text prompts have three options:
- No prompt
- Automatically generate prompts using Pllava
- Manually write prompts (place in
input/text/
)
- Place test videos in the
-
Configure Paths Modify paths in
video_super_resolution/scripts/inference_sr.sh
:- video_folder_path
- txt_file_path
- model_path
- save_dir
-
Run Inference
bash video_super_resolution/scripts/inference_sr.sh
Note: If encountering memory issues, you can set a smaller
frame_length
value ininference_sr.sh
.
Practical Effects
STAR demonstrates significant advantages in processing real-world videos:
- Effectively enhances quality of low-resolution videos from platforms like Bilibili
- Significantly improves visual quality when processing heavily degraded videos
- Maintains good temporal coherence in generated videos
- High detail fidelity without over-smoothing effects
Summary
STAR provides a powerful solution for real-world video super-resolution. Through innovative architectural design and integration of advanced text-to-video models, it effectively handles video quality enhancement needs in various real-world scenarios. The project’s open-source nature also enables researchers and developers to conveniently use and improve this technology.
More Articles
![OpenAI 12-Day Technical Livestream Highlights Detailed Report [December 2024]](/_astro/openai-12day.C2KzT-7l_1ndTgg.jpg)
data:image/s3,"s3://crabby-images/c1bf5/c1bf5865286d00ab4d17bfbd91f2ce0a455a13a8" alt="AI Model Tools Comparison How to Choose Between SGLang, Ollama, VLLM, and LLaMA.cpp?"
data:image/s3,"s3://crabby-images/49e7e/49e7e96fe1847e6c7e1030520babdee5000eec35" alt="Ant Design X - React Component Library for Building AI Chat Applications"
data:image/s3,"s3://crabby-images/a4ba7/a4ba7c68c21d4134a0e14972d54e27dae70d4913" alt="CES 2024 Review:Revisiting the Tech Highlights of 2024"
data:image/s3,"s3://crabby-images/61b97/61b970a4a7550922b5a124c43e3ee9497f307957" alt="VLC Automatic Subtitles and Translation (Based on Local Offline Open-Source AI Models) | CES 2025"
data:image/s3,"s3://crabby-images/37926/37926da66646b6654210b1c1a3480ccfc02878f9" alt="ClearerVoice-Studio: A One-Stop Solution for Speech Enhancement, Speech Denoising, Speech Separation and Speaker Extraction"
data:image/s3,"s3://crabby-images/ae5fe/ae5fe3499027252db14d2a4582a86100a22c4f39" alt="CogAgent-9B Released: A GUI Interaction Model Jointly Developed by Zhipu AI and Tsinghua"
data:image/s3,"s3://crabby-images/30358/303582cbd83d7ea7faaaa213711feae8f4958f41" alt="How to Install and Use ComfyUI on Windows - Complete Guide"
data:image/s3,"s3://crabby-images/9b527/9b52774d783754cb309b77477f9a28ccd479cab5" alt="DeepSeek-V3 Model In-Depth Analysis: A Brilliant Star in the New AI Era"