InternLM-XComposer-2.5-OmniLive: A Practical Guide to Multimodal AI Model
Model Introduction
InternLM-XComposer-2.5-OmniLive is a next-generation multimodal large model developed by Shanghai AI Laboratory. It supports multiple input modalities including images, videos, and audio, demonstrating powerful multimodal understanding and generation capabilities. The model has shown excellent performance across various authoritative benchmarks.
Key Features
- Multimodal Understanding: Supports various input formats including images, videos, and audio
- Real-time Interaction: Enables real-time audio-visual stream processing and human-computer interaction
- Open Source & Commercial: Released under Apache 2.0 license, suitable for commercial use
- Superior Performance: Achieves leading scores in multiple benchmark tests
Installation
Requirements
- Python >= 3.8
- PyTorch >= 1.12 (2.0+ recommended)
- CUDA >= 11.4 (for GPU users)
- flash-attention2 (for high-resolution processing)
Installation Steps
- Create and activate virtual environment:
conda create -n xcomposer python=3.8 -y
conda activate xcomposer
- Install PyTorch:
pip3 install torch torchvision torchaudio
- Install dependencies:
pip install -r requirements.txt
Docker Installation
You can also quickly deploy using the official Docker image:
docker pull yhcao6/ixc2.5-ol:latest
Quick Start
Step 1: Clone Repository
git clone https://github.com/InternLM/InternLM-XComposer.git
cd InternLM-XComposer/InternLM-XComposer-2.5-OmniLive
Step 2: Download Model
huggingface-cli download internlm/internlm-xcomposer2d5-ol-7b \
--local-dir internlm-xcomposer2d5-ol-7b \
--local-dir-use-symlinks False \
--resume-download
Basic Usage
InternLM-XComposer-2.5-OmniLive offers multiple usage methods. Here we’ll introduce several common scenarios.
Method 1: Using Example Scripts
The model provides example scripts for three scenarios:
Audio Model Inference
python examples/infer_audio.py
Base Model Inference
python examples/infer_llm_base.py
Memory-Enhanced Model Inference
python examples/merge_lora.py
python examples/infer_llm_with_memory.py
Method 2: Using in Code
Audio Understanding Example
import os
os.environ['USE_HF'] = 'True'
import torch
from swift.llm import (
get_model_tokenizer, get_template, ModelType,
get_default_template_type, inference
)
# Initialize model
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
model, tokenizer = get_model_tokenizer(
model_type,
torch.float16,
model_id_or_path=model_id_or_path,
model_kwargs={'device_map': 'cuda:0'}
)
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
# Speech recognition example
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')
Image Understanding Example
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# Initialize model and tokenizer
model = AutoModel.from_pretrained(
'internlm/internlm-xcomposer2d5-ol-7b',
model_dir='base',
torch_dtype=torch.bfloat16,
trust_remote_code=True
).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(
'internlm/internlm-xcomposer2d5-ol-7b',
model_dir='base',
trust_remote_code=True
)
model.tokenizer = tokenizer
# Image analysis example
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
Performance Benchmarks
Speech Recognition Performance
Performance on WenetSpeech and LibriSpeech benchmarks:
Method | LLM | Wenetspeech | Librispeech | ||||
---|---|---|---|---|---|---|---|
Test_Net | Test_Meeting | Dev_Clean | Dev_Other | Test_Clean | Test_Other | ||
IXC2.5-OL | Qwen2-1.5B | 9.0 | 9.2 | 2.5 | 5.7 | 2.6 | 5.8 |
Video Understanding Performance
Performance on MLVU benchmark:
Method | Params | Topic Rea. | Anomaly Recog. | Needle QA | Ego Rea. | Plot QA | Action Or. | Action Co. | M-Avg |
---|---|---|---|---|---|---|---|---|---|
IXC2.5-OL | 7B | 84.1 | 68.5 | 76.6 | 60.8 | 75.1 | 57.1 | 41.3 | 66.2 |
Advanced Applications
Multi-turn Dialogue
# Initialize dialogue
history = []
# First round
query = "What season was this photo taken in?"
response, history = model.chat(tokenizer, query, image, history=history)
# Second round
query = "Can you tell which city this is?"
response, history = model.chat(tokenizer, query, image, history=history)
Multimodal Mixed Input
# Image + text mixed input
query = "Compare these two images"
response = model.chat(tokenizer, query, images=[image1, image2])
# Video + audio mixed input
response = model.chat(tokenizer, query, video=video_frames, audio=audio_data)
Best Practices
-
Input Preprocessing
- Recommended image size: 224x224 to 448x448
- Video frame rate: 8-16 frames recommended
- Audio sampling rate: 16kHz
-
Performance Optimization
- Use half-precision (FP16) inference
- Batch processing for higher throughput
- Set appropriate context length
-
Memory Management
- 7B model requires ≥16GB VRAM
- Clear CUDA cache regularly
- Use gradient checkpointing
Common Issues
-
Out of Memory
- Solution: Reduce batch size, use gradient checkpointing
- Use CPU inference mode
-
Slow Inference
- Check GPU acceleration
- Optimize input data preprocessing
- Consider using quantized version
Resources
Related Posts
No related posts yet