Qwen2.5-1M: Deploy your own Qwen with context length up to 1M tokens

Two months after enhancing Qwen2.5-Turbo to accommodate context lengths up to one million tokens, we’re excited to introduce the open-source Qwen2.5-1M models and their accompanying inference framework support.

Open-Source Models

We are releasing two new checkpoints: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M. This marks our first open-source upgrade of Qwen models to handle contexts of one million tokens.

Inference Framework

Our fully open-sourced inference framework, based on vLLM, integrates sparse attention methods. This framework processes inputs of 1M tokens 3x to 7x faster, enhancing efficiency for developers deploying the Qwen2.5-1M models.

Technical Report

The technical report details the design insights for training and inference frameworks, alongside ablation experiments. Experience the Qwen2.5-1M models through our demos on HuggingFace and ModelScope.

Qwen Chat

We recently introduced Qwen Chat, an advanced AI assistant from the Qwen series. It allows for conversations, code writing, searches, image, and video generation, and various tool usages. Qwen Chat features the Qwen2.5-Turbo model, supporting long-context processing up to 1M tokens.

Model Performance

Long-Context Tasks

Qwen2.5-1M models excel in long-context tasks like Passkey Retrieval with a 1M-token context. They outperform their 128K counterparts, particularly for sequences beyond 64K. The Qwen2.5-14B-Instruct-1M model surpasses both Qwen2.5-Turbo and GPT-4o-mini across multiple datasets.

Short-Context Tasks

Performance on short sequences remains strong for both Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, comparable to their 128K versions, while supporting an eightfold longer context length compared to GPT-4o-mini.

Key Techniques

Long-Context Training

We employ a progressive approach to expand the context length to 1M tokens, using Adjusted Base Frequency and a multi-stage supervised fine-tuning process. The models maintain sequence handling up to 256K tokens post-training.

Length Extrapolation

By employing Dual Chunk Attention (DCA), we address large relative positional distances, successfully extending context support to 1M tokens without additional training.

Sparse Attention

We introduce a sparse attention mechanism via MInference to accelerate long-context inference. Improvements like chunked prefill integration and sparsity refinement reduce VRAM usage and accuracy loss.

Deploying Qwen2.5-1M Models Locally

System Preparation

For optimal performance, use GPUs with Ampere or Hopper architectures. Ensure CUDA 12.1 or 12.3 and Python >=3.9 & <=3.12. Qwen2.5-7B-Instruct-1M requires 120GB VRAM, while Qwen2.5-14B-Instruct-1M requires 320GB VRAM.

Install Dependencies

Clone our vLLM branch and install:

git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git
cd vllm
pip install -e . -v

Launch OpenAI-Compatible API Service

Configure and start the service with:

vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
  --tensor-parallel-size 4 \
  --max-model-len 1010000 \
  --enable-chunked-prefill --max-num-batched-tokens 131072 \
  --enforce-eager \
  --max-num-seqs 1

Interact with the Model

Use Curl or Python to interact with the deployed model:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-