InternLM-XComposer-2.5-OmniLive(浦语·灵笔2.5)中文教程：多模态AI模型从入门到实践 | 图像/视频/音频处理

浦语·灵笔2.5 多模态大模型视觉语言模型 InternLM-XComposer-2.5-OmniLive InternLM-XComposer

Dec 20, 2024 6 min read

Cover image for InternLM-XComposer-2.5-OmniLive(浦语·灵笔2.5)中文教程：多模态AI模型从入门到实践 | 图像/视频/音频处理

模型介绍

InternLM-XComposer-2.5-OmniLive 是上海人工智能实验室开发的新一代多模态大模型，支持图像、视频、音频等多模态输入，具备强大的多模态理解与生成能力。该模型在多个权威基准测试中展现出卓越性能。

主要特点

多模态理解：支持图像、视频、音频等多种输入形式
实时互动：支持实时音视频流处理和人机交互
开源可商用：采用 Apache 2.0 协议，支持商业应用
性能优异：在多个基准测试中取得领先成绩

安装配置

环境要求

Python >= 3.8
PyTorch >= 1.12 (推荐 2.0+)
CUDA >= 11.4 (GPU用户)
flash-attention2 (用于高分辨率处理)

安装步骤

创建并激活虚拟环境：

conda create -n xcomposer python=3.8 -y
conda activate xcomposer

安装 PyTorch：

pip3 install torch torchvision torchaudio

安装依赖：

pip install -r requirements.txt

Docker 安装

也可以使用官方提供的 Docker 镜像快速部署：

docker pull yhcao6/ixc2.5-ol:latest

快速开始

第一步：克隆仓库

git clone https://github.com/InternLM/InternLM-XComposer.git
cd InternLM-XComposer/InternLM-XComposer-2.5-OmniLive

第二步：下载模型

huggingface-cli download internlm/internlm-xcomposer2d5-ol-7b \
  --local-dir internlm-xcomposer2d5-ol-7b \
  --local-dir-use-symlinks False \
  --resume-download

基础使用

InternLM-XComposer-2.5-OmniLive 提供了多种使用方式，下面我们将介绍几种常见场景的使用方法。

方式一：使用示例脚本

模型提供了三种场景的示例脚本：

音频模型推理

python examples/infer_audio.py

基础模型推理

python examples/infer_llm_base.py

带记忆的模型推理

python examples/merge_lora.py
python examples/infer_llm_with_memory.py

方式二：在代码中使用

音频理解示例

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)

# 初始化模型
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(
    model_type, 
    torch.float16, 
    model_id_or_path=model_id_or_path,
    model_kwargs={'device_map': 'cuda:0'}
)
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)

# 中文语音识别示例
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

图像理解示例

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分词器
model = AutoModel.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    trust_remote_code=True
)
model.tokenizer = tokenizer

# 图像分析示例
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

音频理解

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)

model, tokenizer = get_model_tokenizer(
    model_type, 
    torch.float16, 
    model_id_or_path=model_id_or_path,
    model_kwargs={'device_map': 'cuda:0'}
)
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)

# 中文语音识别示例
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

图像理解

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分词器
model = AutoModel.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained(
    'internlm/internlm-xcomposer2d5-ol-7b',
    model_dir='base',
    trust_remote_code=True
)
model.tokenizer = tokenizer

# 图像分析示例
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

性能评测

语音识别性能

在 WenetSpeech 和 LibriSpeech 基准测试上的表现：

Method	LLM	Wenetspeech		Librispeech
		Test_Net	Test_Meeting	Dev_Clean	Dev_Other	Test_Clean	Test_Other
IXC2.5-OL	Qwen2-1.5B	9.0	9.2	2.5	5.7	2.6	5.8

视频理解性能

在 MLVU 基准测试上的表现：

Method	Params	Topic Rea.	Anomaly Recog.	Needle QA	Ego Rea.	Plot QA	Action Or.	Action Co.	M-Avg
IXC2.5-OL	7B	84.1	68.5	76.6	60.8	75.1	57.1	41.3	66.2

高级应用

多轮对话

# 初始化对话
history = []

# 第一轮对话
query = "这张图片拍摄于什么季节？"
response, history = model.chat(tokenizer, query, image, history=history)

# 第二轮对话
query = "你能看出是在哪个城市吗？"
response, history = model.chat(tokenizer, query, image, history=history)

多模态混合输入

# 图像+文本混合输入
query = "比较这两张图片的异同"
response = model.chat(tokenizer, query, images=[image1, image2])

# 视频+音频混合输入
response = model.chat(tokenizer, query, video=video_frames, audio=audio_data)