InternLM-XComposer-2.5-OmniLive: A Practical Guide to Multimodal AI Model

InternLM Multimodal LLM Vision-Language Model InternLM-XComposer-2.5-OmniLive InternLM-XComposer

Dec 20, 2024 3 min read

Cover image for InternLM-XComposer-2.5-OmniLive: A Practical Guide to Multimodal AI Model

Model Introduction

InternLM-XComposer-2.5-OmniLive is a next-generation multimodal large model developed by Shanghai AI Laboratory. It supports multiple input modalities including images, videos, and audio, demonstrating powerful multimodal understanding and generation capabilities. The model has shown excellent performance across various authoritative benchmarks.

Key Features

Multimodal Understanding: Supports various input formats including images, videos, and audio
Real-time Interaction: Enables real-time audio-visual stream processing and human-computer interaction
Open Source & Commercial: Released under Apache 2.0 license, suitable for commercial use
Superior Performance: Achieves leading scores in multiple benchmark tests

Installation

Requirements

Python >= 3.8
PyTorch >= 1.12 (2.0+ recommended)
CUDA >= 11.4 (for GPU users)
flash-attention2 (for high-resolution processing)

Installation Steps

Create and activate virtual environment:

conda create -n xcomposer python=3.8 -y
conda activate xcomposer

Install PyTorch:

pip3 install torch torchvision torchaudio

Install dependencies:

pip install -r requirements.txt

Docker Installation

You can also quickly deploy using the official Docker image:

docker pull yhcao6/ixc2.5-ol:latest

Quick Start

Step 1: Clone Repository

git clone https://github.com/InternLM/InternLM-XComposer.git
cd InternLM-XComposer/InternLM-XComposer-2.5-OmniLive

Step 2: Download Model

huggingface-cli download internlm/internlm-xcomposer2d5-ol-7b \
  --local-dir internlm-xcomposer2d5-ol-7b \
  --local-dir-use-symlinks False \
  --resume-download

Basic Usage

InternLM-XComposer-2.5-OmniLive offers multiple usage methods. Here we’ll introduce several common scenarios.

Method 1: Using Example Scripts

The model provides example scripts for three scenarios:

Audio Model Inference

python examples/infer_audio.py

Base Model Inference

python examples/infer_llm_base.py

Memory-Enhanced Model Inference

python examples/merge_lora.py
python examples/infer_llm_with_memory.py

Method 2: Using in Code

Audio Understanding Example

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)

# Initialize model
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(
    model_type, 
    torch.float16, 
    model_id_or_path=model_id_or_path,
    model_kwargs={'device_map': 'cuda:0'}
)
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)

# Speech recognition example
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

Image Understanding Example

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# Initialize model and tokenizer
model = AutoModel.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    trust_remote_code=True
)
model.tokenizer = tokenizer

# Image analysis example
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

Performance Benchmarks

Speech Recognition Performance

Performance on WenetSpeech and LibriSpeech benchmarks:

Method	LLM	Wenetspeech		Librispeech
		Test_Net	Test_Meeting	Dev_Clean	Dev_Other	Test_Clean	Test_Other
IXC2.5-OL	Qwen2-1.5B	9.0	9.2	2.5	5.7	2.6	5.8

Video Understanding Performance

Performance on MLVU benchmark:

Method	Params	Topic Rea.	Anomaly Recog.	Needle QA	Ego Rea.	Plot QA	Action Or.	Action Co.	M-Avg
IXC2.5-OL	7B	84.1	68.5	76.6	60.8	75.1	57.1	41.3	66.2

Advanced Applications

Multi-turn Dialogue

# Initialize dialogue
history = []

# First round
query = "What season was this photo taken in?"
response, history = model.chat(tokenizer, query, image, history=history)

# Second round
query = "Can you tell which city this is?"
response, history = model.chat(tokenizer, query, image, history=history)

Multimodal Mixed Input

# Image + text mixed input
query = "Compare these two images"
response = model.chat(tokenizer, query, images=[image1, image2])

# Video + audio mixed input
response = model.chat(tokenizer, query, video=video_frames, audio=audio_data)

Best Practices

Input Preprocessing
- Recommended image size: 224x224 to 448x448
- Video frame rate: 8-16 frames recommended
- Audio sampling rate: 16kHz
Performance Optimization
- Use half-precision (FP16) inference
- Batch processing for higher throughput
- Set appropriate context length
Memory Management
- 7B model requires ≥16GB VRAM
- Clear CUDA cache regularly
- Use gradient checkpointing

Common Issues

Out of Memory
- Solution: Reduce batch size, use gradient checkpointing
- Use CPU inference mode
Slow Inference
- Check GPU acceleration
- Optimize input data preprocessing
- Consider using quantized version