Tech Explorer Logo

Search Content

InternLM-XComposer-2.5-OmniLive: A Practical Guide to Multimodal AI Model

3 min read
Cover image for InternLM-XComposer-2.5-OmniLive: A Practical Guide to Multimodal AI Model

Model Introduction

InternLM-XComposer-2.5-OmniLive is a next-generation multimodal large model developed by Shanghai AI Laboratory. It supports multiple input modalities including images, videos, and audio, demonstrating powerful multimodal understanding and generation capabilities. The model has shown excellent performance across various authoritative benchmarks.

Key Features

  • Multimodal Understanding: Supports various input formats including images, videos, and audio
  • Real-time Interaction: Enables real-time audio-visual stream processing and human-computer interaction
  • Open Source & Commercial: Released under Apache 2.0 license, suitable for commercial use
  • Superior Performance: Achieves leading scores in multiple benchmark tests

Installation

Requirements

  • Python >= 3.8
  • PyTorch >= 1.12 (2.0+ recommended)
  • CUDA >= 11.4 (for GPU users)
  • flash-attention2 (for high-resolution processing)

Installation Steps

  1. Create and activate virtual environment:
   conda create -n xcomposer python=3.8 -y
conda activate xcomposer
  1. Install PyTorch:
   pip3 install torch torchvision torchaudio
  1. Install dependencies:
   pip install -r requirements.txt

Docker Installation

You can also quickly deploy using the official Docker image:

   docker pull yhcao6/ixc2.5-ol:latest

Quick Start

Step 1: Clone Repository

   git clone https://github.com/InternLM/InternLM-XComposer.git
cd InternLM-XComposer/InternLM-XComposer-2.5-OmniLive

Step 2: Download Model

   huggingface-cli download internlm/internlm-xcomposer2d5-ol-7b \
  --local-dir internlm-xcomposer2d5-ol-7b \
  --local-dir-use-symlinks False \
  --resume-download

Basic Usage

InternLM-XComposer-2.5-OmniLive offers multiple usage methods. Here we’ll introduce several common scenarios.

Method 1: Using Example Scripts

The model provides example scripts for three scenarios:

Audio Model Inference

   python examples/infer_audio.py

Base Model Inference

   python examples/infer_llm_base.py

Memory-Enhanced Model Inference

   python examples/merge_lora.py
python examples/infer_llm_with_memory.py

Method 2: Using in Code

Audio Understanding Example

   import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)

# Initialize model
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(
    model_type, 
    torch.float16, 
    model_id_or_path=model_id_or_path,
    model_kwargs={'device_map': 'cuda:0'}
)
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)

# Speech recognition example
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

Image Understanding Example

   import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# Initialize model and tokenizer
model = AutoModel.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    trust_remote_code=True
)
model.tokenizer = tokenizer

# Image analysis example
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

Performance Benchmarks

Speech Recognition Performance

Performance on WenetSpeech and LibriSpeech benchmarks:

MethodLLMWenetspeechLibrispeech
Test_NetTest_MeetingDev_CleanDev_OtherTest_CleanTest_Other
IXC2.5-OLQwen2-1.5B9.09.22.55.72.65.8

Video Understanding Performance

Performance on MLVU benchmark:

MethodParamsTopic Rea.Anomaly Recog.Needle QAEgo Rea.Plot QAAction Or.Action Co.M-Avg
IXC2.5-OL7B84.168.576.660.875.157.141.366.2

Advanced Applications

Multi-turn Dialogue

   # Initialize dialogue
history = []

# First round
query = "What season was this photo taken in?"
response, history = model.chat(tokenizer, query, image, history=history)

# Second round
query = "Can you tell which city this is?"
response, history = model.chat(tokenizer, query, image, history=history)

Multimodal Mixed Input

   # Image + text mixed input
query = "Compare these two images"
response = model.chat(tokenizer, query, images=[image1, image2])

# Video + audio mixed input
response = model.chat(tokenizer, query, video=video_frames, audio=audio_data)

Best Practices

  1. Input Preprocessing

    • Recommended image size: 224x224 to 448x448
    • Video frame rate: 8-16 frames recommended
    • Audio sampling rate: 16kHz
  2. Performance Optimization

    • Use half-precision (FP16) inference
    • Batch processing for higher throughput
    • Set appropriate context length
  3. Memory Management

    • 7B model requires ≥16GB VRAM
    • Clear CUDA cache regularly
    • Use gradient checkpointing

Common Issues

  1. Out of Memory

    • Solution: Reduce batch size, use gradient checkpointing
    • Use CPU inference mode
  2. Slow Inference

    • Check GPU acceleration
    • Optimize input data preprocessing
    • Consider using quantized version

Resources

Share

Related Posts

No related posts yet