Ivy-VL Launch: 3B Parameters Dominates Edge Visual AI, Surpassing Qwen/InternVL

AI Safeguard Multimodal Vision Language Model Edge AI Lightweight Model Ivy-VL

Dec 13, 2024 4 min read

Cover image for Ivy-VL Launch: 3B Parameters Dominates Edge Visual AI, Surpassing Qwen/InternVL

AI Safeguard, in collaboration with CMU and Stanford University, has officially released Ivy-VL, a lightweight multimodal model that has garnered significant attention for its compact parameter size and exceptional performance.

Background

As artificial intelligence rapidly evolves, multimodal large language models (MLLMs) play crucial roles in computer vision, natural language processing, and multimodal tasks. However, deploying these models on mobile and edge devices has remained a significant challenge due to hardware resource constraints and energy efficiency requirements. In this context, Ivy-VL emerges as a new benchmark for mobile-oriented multimodal models with its outstanding performance.

Key Features

Ivy-VL offers several notable advantages:

Ultimate Lightweight Design: With only 3B parameters, significantly smaller than mainstream multimodal models of 7B to tens of billions, enabling efficient operation on resource-constrained devices like AI glasses and smartphones
Superior Performance: Achieves first place among open-source models under 4B on the OpenCompass leaderboard, surpassing top edge-side models including Qwen2-VL-2B, InternVL2-2B, InternVL2.5-2B, and SmolVLM-Instruct
Low Latency Response: 3B model size ensures real-time inference capabilities, striking a perfect balance between generation speed, energy efficiency, and accuracy
Robust Cross-modal Understanding: Excels in visual question answering, image description, and complex reasoning tasks

Technical Architecture

Ivy-VL employs advanced technical solutions:

Base Architecture: Built on LLaVA-One-Vision
Language Model: Utilizes Qwen/Qwen2.5-3B-Instruct
Visual Encoder: Implements google/siglip-so400m-patch14-384
Training Optimization: Specially optimized for performance and efficiency

Application Scenarios

The model is particularly suited for:

Smart Wearables: Enables real-time visual Q&A on AI glasses, enhancing AR experiences
Mobile AI Assistants: Delivers intelligent multimodal interaction capabilities for natural AI services
IoT Devices: Powers efficient multimodal data processing in smart home and IoT scenarios
Mobile Education & Entertainment: Enhances image understanding and interaction in educational software

Performance Evaluation

Ivy-VL demonstrates exceptional performance in multiple authoritative evaluations:

Ranks first among open-source models under 4B on OpenCompass
Surpasses leading edge-side SOTA models:
- Qwen2-VL-2B
- InternVL2-2B
- InternVL2.5-2B
- SmolVLM-Instruct
- Aquila-VL-2B
- PaliGemma 3B

Open Source & Access

Available on Hugging Face platform:

Complete open-source code and weights
Apache 2.0 license
Supports academic research and commercial use
Comprehensive deployment documentation

Usage Guide

1. Installation

Install via pip:

pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

2. Python Example

Basic usage example:

from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

# Load pretrained model
pretrained = "AI-Safeguard/Ivy-VL-llava"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"

# Initialize model and tokenizer
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained, 
    None, 
    model_name, 
    device_map=device_map
)

model.eval()

# Load image (URL or local)
# Option 1: Load from URL
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Option 2: Load from local
# image = Image.open("./local_image.jpg")

# Process image
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

# Set conversation template and question
conv_template = "qwen_1_5"
question = DEFAULT_IMAGE_TOKEN + "\nDescribe this image"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(
    prompt_question, 
    tokenizer, 
    IMAGE_TOKEN_INDEX, 
    return_tensors="pt"
).unsqueeze(0).to(device)

image_sizes = [image.size]

response = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

# Decode output
output_text = tokenizer.batch_decode(response, skip_special_tokens=True)
print(output_text)

3. Key Parameters

# Model configuration parameters
model_args = {
    "device_map": "auto",    # Device mapping method
    "dtype": torch.float16,  # Model precision
    "max_length": 4096,      # Maximum token length
}

# Generation parameters
generate_args = {
    "do_sample": False,      # Whether to use sampling
    "temperature": 0,        # Generation temperature
    "max_new_tokens": 4096,  # Maximum new tokens
}

4. Important Notes

Requires CUDA environment
Recommended: GPU with 16GB+ VRAM
Supported image formats: jpg, png, webp
Recommended image resolution: 224x224 to 1024x1024
Supports batch processing of multiple images
Stable network required for model weight download

Access Links

Hugging Face Platform: @AI-Safeguard/Ivy-VL-llava
ModelScope Platform: AI-Safeguard/Ivy-VL-llava

Developers can quickly start using the model through pip installation. For detailed instructions, please refer to the official documentation. Technical support is available from the development team.

Future Outlook

The launch of Ivy-VL marks a significant breakthrough in lightweight multimodal models for edge devices. AI Safeguard plans to continue optimization in: