Tech Explorer Logo

Search Content

Hallo3: An Open-Source High-Dynamic Realistic Portrait Animation Model

3 min read
Cover image for Hallo3: An Open-Source High-Dynamic Realistic Portrait Animation Model

Project Overview

Hallo3 (Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks) is a portrait image animation model developed by the Fudan Generative Vision Lab. Based on diffusion transformer networks, this project combines static photos with audio input to generate highly dynamic and realistic talking head videos, providing robust technical support for digital avatar creation.

The model can be widely applied in various scenarios:

  • Digital Avatar Creation: Quickly generate talking digital avatars from a single photo and audio input, suitable for virtual hosts and digital spokespersons
  • Education and Training: Transform static teaching materials into engaging video content, enhancing online education interactivity
  • Content Creation: Help creators efficiently produce talking head videos, significantly improving content production efficiency
  • Marketing Presentations: Provide personalized digital avatar solutions for brand and product presentations

Key Features

  1. High Dynamicity: The model generates highly dynamic and naturally fluid facial movements and expressions.

  2. Realism: The generated portrait animations feature high realism and detailed expression.

  3. Open Source: The project is fully open-source, available for researchers and developers to use and study.

Technical Implementation

Hallo3 is implemented based on the following key technologies:

  • Diffusion Transformer Networks architecture
  • Advanced animation generation strategies
  • High-quality portrait image animation support

System Requirements

  • OS: Ubuntu 20.04/Ubuntu 22.04
  • CUDA Version: 12.1
  • Tested GPU: H100

Pretrained Model Download

You can obtain the required pretrained models through:

  1. Using huggingface-cli:
   cd $ProjectRootDir
pip install huggingface-cli
huggingface-cli download fudan-generative-ai/hallo3 --local-dir ./pretrained_models
  1. Or manually download from these sources:
  • hallo3: Main project checkpoints
  • Cogvidex: Cogvideox-5b-i2v pretrained model, including transformer and 3d vae
  • t5-v1_1-xxl: Text encoder
  • audio_separator: Kim Vocal_2 MDX-Net vocal separation model
  • wav2vec: Facebook’s audio-to-vector model
  • insightface: 2D and 3D face analysis models
  • face landmarker: Face detection and mesh model from mediapipe

Installation Steps

  1. Clone the repository:
   git clone https://github.com/fudan-generative-vision/hallo3
cd hallo3
  1. Create and activate conda environment:
   conda create -n hallo python=3.10
conda activate hallo
  1. Install dependencies:
   pip install -r requirements.txt
apt-get install ffmpeg

Training Preparation

Data Preparation

Organize your raw videos in the following directory structure:

   dataset_name/
|-- videos/
|   |-- 0001.mp4
|   |-- 0002.mp4
|   `-- 0003.mp4
|-- caption/
|   |-- 0001.txt
|   |-- 0002.txt
|   `-- 0003.txt

Data Preprocessing

Process the videos using the following command:

   bash scripts/data_preprocess.sh {dataset_name} {parallelism} {rank} {output_name}

Model Training

  1. Update the data path settings in configuration files:

In configs/sft_s1.yaml and configs/sft_s2.yaml:

   #sft_s1.yaml
train_data: [
    "./data/output_name.json"
]

#sft_s2.yaml
train_data: [
    "./data/output_name.json"
]
  1. Start training:
   # Stage 1 training
bash scripts/finetune_multi_gpus_s1.sh

# Stage 2 training
bash scripts/finetune_multi_gpus_s2.sh

Inference

Requirements

Input data must meet the following conditions:

  1. Reference image must be 1:1 or 3:2 aspect ratio
  2. Driving audio must be in WAV format
  3. Audio must be in English (as training dataset only contains English)
  4. Ensure clear vocals in the audio (background music is acceptable)

Running Inference

Execute the following command for inference:

   bash scripts/inference_long_batch.sh ./examples/inference/input.txt ./output

Generated animation results will be saved in the ./output directory.

References

Share

More Articles