ModernBERT: A New Star in Natural Language Processing

ModernBERT NLP BERT Machine Learning Deep Learning

Dec 25, 2024 5 min read

Cover image for ModernBERT: A New Star in Natural Language Processing

Introduction

In the long development process of Natural Language Processing (NLP), model iteration and innovation have always been the core driving force pushing the industry forward. The BERT model, which emerged in 2018, is undoubtedly an important milestone in the field, with its powerful capabilities demonstrated in various natural language processing tasks laying a solid foundation for subsequent research. However, the waves of technology never cease, and after six years of development, ModernBERT has emerged like a brilliant new star, shining with unique brilliance in the NLP firmament, injecting new vitality and infinite possibilities into this vibrant field.

As an outstanding representative of encoder-only models, the ModernBERT family includes a base version with 139M parameters and a large version with 395M parameters. This parameter scale design ensures model performance while considering model flexibility and applicability. Compared to BERT and similar models, ModernBERT has achieved remarkable Pareto improvements in both speed and accuracy. In practical applications, this advantage is fully demonstrated. For example, in information retrieval tasks, when processing large-scale text data, ModernBERT can quickly filter out text segments highly relevant to user queries, with significantly improved response speed compared to traditional models, greatly reducing users’ time cost in obtaining information. In text classification tasks, whether it’s classifying news article topics or analyzing social media text sentiment, ModernBERT can make precise judgments with higher accuracy, providing more reliable data insights for enterprises and research institutions. In terms of transformers library application, installation from main is required before v4.48.0 version, and as a masked language model, it can be loaded through the convenient fill-mask pipeline or AutoModelForMaskedLM. In downstream tasks, fine-tuning it and using Flash Attention 2 on supported GPUs can fully exploit the model’s potential, greatly improving processing efficiency, enabling the model to run efficiently in actual business processes, meeting enterprises’ dual requirements for real-time performance and accuracy.

Model Comparison

Comparison Dimension	Decoder Models (e.g., GPT, Llama, Claude)	Encoder Models (e.g., BERT)	ModernBERT
Model Size	Usually large, e.g., Llama can reach 405B	Relatively small, BERT has certain scale but smaller than some decoder models	Base version 139M, large version 395M, relatively compact
Speed	Slower, often requires several seconds for API response	Faster, performs well in retrieval and classification tasks	Faster than other models for variable-length inputs on NVIDIA RTX 4090, 2-3x faster for long context inputs than the next fastest model
Privacy & Cost	Privacy protection challenging, high cost, complex replication	Lower cost, advantageous inference cost	High cost-effectiveness, can be used effectively on small and inexpensive GPUs
Application Scenarios	Art generation, interactive chat, etc.	Retrieval, classification, entity extraction, etc.	Excels in retrieval, natural language understanding, and code retrieval tasks, covering traditional NLU scenarios and code domain

Standard Academic Benchmark Tests

In standard academic benchmark tests, ModernBERT shows impressive results. It outperforms DeBERTaV3 on GLUE tasks, uses less memory while being faster, and its context length of 8192 far exceeds existing encoders, with particular advantages in code retrieval tasks.

Model Architecture Improvements

ModernBERT’s architectural improvements are significant. It adopts rotary position embeddings (RoPE) to replace old position encodings, GeGLU layers to replace old MLP layers, removes unnecessary bias terms, and adds normalization layers. For efficiency improvements, alternating attention mechanisms (global attention every 3 layers, sliding window local attention for others), removal of padding and sequence packing, and hardware-aware design play key roles. Training data based on 2 trillion tokens from web documents, code, scientific articles, etc., has improved diversity, and training objectives and processes have been optimized.

In conclusion, ModernBERT demonstrates strong potential with its many advantages, and we look forward to more breakthroughs and innovative applications it will bring to the NLP field.

Quick Start

Before the official release of transformers v4.48.0, installation from the main branch is required:

pip install git+https://github.com/huggingface/transformers.git

Since ModernBERT is a masked language model (MLM), you can use the fill-mask pipeline or load it through AutoModelForMaskedLM.

⚠️ If your GPU supports it, we recommend using Flash Attention 2 for maximum efficiency. Installation method:

pip install flash-attn

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "The future of AI is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions for masked position
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

Using pipeline:

import torch
from transformers import pipeline
from pprint import pprint

pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)

input_text = "Deep learning is [MASK] developing."
results = pipe(input_text)
pprint(results)

Note: ModernBERT does not use token type IDs, unlike earlier BERT models. Most downstream usage on the Hugging Face Hub is the same as standard BERT models, just omitting the token_type_ids parameter.

Model Fine-tuning

ModernBERT supports efficient fine-tuning using the ms-swift framework. ms-swift is the official large model and multimodal large model fine-tuning deployment framework provided by the ModelScope community.

Environment Preparation

First, clone and install ms-swift:

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .[llm]

Fine-tuning Example

Here’s an example script for fine-tuning a classification task using the HC3 dataset:

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model answerdotai/ModernBERT-base \
    --dataset simpleai/HC3:finance_cls#20000 \
    --task_type seq_cls \
    --num_labels 2 \
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --freeze_vit true \
    --gradient_accumulation_steps 1 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 5 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4