Two months after enhancing Qwen2.5-Turbo to accommodate context lengths up to one million tokens, we’re excited to introduce the open-source Qwen2.5-1M models and their accompanying inference framework support.
Open-Source Models
We are releasing two new checkpoints: Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M. This marks our first open-source upgrade of Qwen models to handle contexts of one million tokens.
Inference Framework
Our fully open-sourced inference framework, based on vLLM, integrates sparse attention methods. This framework processes inputs of 1M tokens 3x to 7x faster, enhancing efficiency for developers deploying the Qwen2.5-1M models.
Technical Report
The technical report details the design insights for training and inference frameworks, alongside ablation experiments. Experience the Qwen2.5-1M models through our demos on HuggingFace and ModelScope.
Qwen Chat
We recently introduced Qwen Chat, an advanced AI assistant from the Qwen series. It allows for conversations, code writing, searches, image, and video generation, and various tool usages. Qwen Chat features the Qwen2.5-Turbo model, supporting long-context processing up to 1M tokens.
Model Performance
Long-Context Tasks
Qwen2.5-1M models excel in long-context tasks like Passkey Retrieval with a 1M-token context. They outperform their 128K counterparts, particularly for sequences beyond 64K. The Qwen2.5-14B-Instruct-1M model surpasses both Qwen2.5-Turbo and GPT-4o-mini across multiple datasets.
Short-Context Tasks
Performance on short sequences remains strong for both Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, comparable to their 128K versions, while supporting an eightfold longer context length compared to GPT-4o-mini.
Key Techniques
Long-Context Training
We employ a progressive approach to expand the context length to 1M tokens, using Adjusted Base Frequency and a multi-stage supervised fine-tuning process. The models maintain sequence handling up to 256K tokens post-training.
Length Extrapolation
By employing Dual Chunk Attention (DCA), we address large relative positional distances, successfully extending context support to 1M tokens without additional training.
Sparse Attention
We introduce a sparse attention mechanism via MInference to accelerate long-context inference. Improvements like chunked prefill integration and sparsity refinement reduce VRAM usage and accuracy loss.
Deploying Qwen2.5-1M Models Locally
System Preparation
For optimal performance, use GPUs with Ampere or Hopper architectures. Ensure CUDA 12.1 or 12.3 and Python >=3.9 & <=3.12. Qwen2.5-7B-Instruct-1M requires 120GB VRAM, while Qwen2.5-14B-Instruct-1M requires 320GB VRAM.
Install Dependencies
Clone our vLLM branch and install:
git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git cd vllm pip install -e . -v
Launch OpenAI-Compatible API Service
Configure and start the service with:
vllm serve Qwen/Qwen2.5-7B-Instruct-1M \ --tensor-parallel-size 4 \ --max-model-len 1010000 \ --enable-chunked-prefill --max-num-batched-tokens 131072 \ --enforce-eager \ --max-num-seqs 1
Interact with the Model
Use Curl or Python to interact with the deployed model:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-