Tech Explorer Logo

Search Content

Ivy-VL Launch: 3B Parameters Dominates Edge Visual AI, Surpassing Qwen/InternVL

4 min read
Cover image for Ivy-VL Launch: 3B Parameters Dominates Edge Visual AI, Surpassing Qwen/InternVL

AI Safeguard, in collaboration with CMU and Stanford University, has officially released Ivy-VL, a lightweight multimodal model that has garnered significant attention for its compact parameter size and exceptional performance.

Background

As artificial intelligence rapidly evolves, multimodal large language models (MLLMs) play crucial roles in computer vision, natural language processing, and multimodal tasks. However, deploying these models on mobile and edge devices has remained a significant challenge due to hardware resource constraints and energy efficiency requirements. In this context, Ivy-VL emerges as a new benchmark for mobile-oriented multimodal models with its outstanding performance.

Key Features

Ivy-VL offers several notable advantages:

  • Ultimate Lightweight Design: With only 3B parameters, significantly smaller than mainstream multimodal models of 7B to tens of billions, enabling efficient operation on resource-constrained devices like AI glasses and smartphones
  • Superior Performance: Achieves first place among open-source models under 4B on the OpenCompass leaderboard, surpassing top edge-side models including Qwen2-VL-2B, InternVL2-2B, InternVL2.5-2B, and SmolVLM-Instruct
  • Low Latency Response: 3B model size ensures real-time inference capabilities, striking a perfect balance between generation speed, energy efficiency, and accuracy
  • Robust Cross-modal Understanding: Excels in visual question answering, image description, and complex reasoning tasks

Technical Architecture

Ivy-VL employs advanced technical solutions:

  • Base Architecture: Built on LLaVA-One-Vision
  • Language Model: Utilizes Qwen/Qwen2.5-3B-Instruct
  • Visual Encoder: Implements google/siglip-so400m-patch14-384
  • Training Optimization: Specially optimized for performance and efficiency

Application Scenarios

The model is particularly suited for:

  • Smart Wearables: Enables real-time visual Q&A on AI glasses, enhancing AR experiences
  • Mobile AI Assistants: Delivers intelligent multimodal interaction capabilities for natural AI services
  • IoT Devices: Powers efficient multimodal data processing in smart home and IoT scenarios
  • Mobile Education & Entertainment: Enhances image understanding and interaction in educational software

Performance Evaluation

Ivy-VL demonstrates exceptional performance in multiple authoritative evaluations:

  • Ranks first among open-source models under 4B on OpenCompass
  • Surpasses leading edge-side SOTA models:
    • Qwen2-VL-2B
    • InternVL2-2B
    • InternVL2.5-2B
    • SmolVLM-Instruct
    • Aquila-VL-2B
    • PaliGemma 3B

Open Source & Access

Available on Hugging Face platform:

  • Complete open-source code and weights
  • Apache 2.0 license
  • Supports academic research and commercial use
  • Comprehensive deployment documentation

Usage Guide

1. Installation

Install via pip:

   pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

2. Python Example

Basic usage example:

   from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
from PIL import Image
import requests
import copy
import torch
import warnings

warnings.filterwarnings("ignore")

# Load pretrained model
pretrained = "AI-Safeguard/Ivy-VL-llava"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"

# Initialize model and tokenizer
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained, 
    None, 
    model_name, 
    device_map=device_map
)

model.eval()

# Load image (URL or local)
# Option 1: Load from URL
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Option 2: Load from local
# image = Image.open("./local_image.jpg")

# Process image
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

# Set conversation template and question
conv_template = "qwen_1_5"
question = DEFAULT_IMAGE_TOKEN + "\nDescribe this image"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

# Generate response
input_ids = tokenizer_image_token(
    prompt_question, 
    tokenizer, 
    IMAGE_TOKEN_INDEX, 
    return_tensors="pt"
).unsqueeze(0).to(device)

image_sizes = [image.size]

response = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    temperature=0,
    max_new_tokens=4096,
)

# Decode output
output_text = tokenizer.batch_decode(response, skip_special_tokens=True)
print(output_text)

3. Key Parameters

   # Model configuration parameters
model_args = {
    "device_map": "auto",    # Device mapping method
    "dtype": torch.float16,  # Model precision
    "max_length": 4096,      # Maximum token length
}

# Generation parameters
generate_args = {
    "do_sample": False,      # Whether to use sampling
    "temperature": 0,        # Generation temperature
    "max_new_tokens": 4096,  # Maximum new tokens
}

4. Important Notes

  • Requires CUDA environment
  • Recommended: GPU with 16GB+ VRAM
  • Supported image formats: jpg, png, webp
  • Recommended image resolution: 224x224 to 1024x1024
  • Supports batch processing of multiple images
  • Stable network required for model weight download

Developers can quickly start using the model through pip installation. For detailed instructions, please refer to the official documentation. Technical support is available from the development team.

Future Outlook

The launch of Ivy-VL marks a significant breakthrough in lightweight multimodal models for edge devices. AI Safeguard plans to continue optimization in:

  • Further enhancing video modality task performance
  • Exploring more industry application scenarios
  • Optimizing model deployment efficiency
  • Promoting mobile device AI application adoption
Share

More Articles