NAME
Deploying Qwen 2.5 32B with vLLM
Building a dedicated inference server for local AI workloads using vLLM and AWQ quantization
SYNOPSIS
Deploying Qwen 2.5 32B with vLLM
After months of running smaller models on consumer hardware, it was time to build a proper inference server. The goal: run Qwen 2.5 32B locally with reasonable performance.
The Hardware
The build started with what I had available:
- CPU: Ryzen 9 5900X (12 cores, 24 threads)
- RAM: 64GB DDR4-3600
- GPU: NVIDIA RTX 3090 (24GB VRAM)
- Storage: 2TB NVMe for models
Quantization Strategy
Full precision 32B models need ~64GB of VRAM. With a 3090’s 24GB, quantization is mandatory. I went with AWQ (Activation-aware Weight Quantization) for the best quality-to-size ratio.
# Pull the AWQ-quantized model
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ
vLLM Configuration
vLLM handles the inference efficiently with continuous batching:
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen2.5-32B-Instruct-AWQ",
quantization="awq",
gpu_memory_utilization=0.95,
max_model_len=8192,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
)
Docker Compose Setup
The final deployment uses Docker for easy management:
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model Qwen/Qwen2.5-32B-Instruct-AWQ
--quantization awq
--gpu-memory-utilization 0.95
Performance Results
With AWQ quantization on the 3090:
| Metric | Value |
|---|---|
| Tokens/second | ~25-30 |
| Time to first token | ~200ms |
| Max context | 8192 tokens |
| VRAM usage | ~22GB |
What’s Next
The server is now integrated with my local MCP toolchain. Next step: experimenting with speculative decoding to push throughput higher.
This experiment is part of the ongoing effort to build a fully local AI development environment. No cloud, no API keys, no data leaving the lab.