Deploying Qwen 2.5 32B with vLLM

After months of running smaller models on consumer hardware, it was time to build a proper inference server. The goal: run Qwen 2.5 32B locally with reasonable performance.

The Hardware

The build started with what I had available:

CPU: Ryzen 9 5900X (12 cores, 24 threads)
RAM: 64GB DDR4-3600
GPU: NVIDIA RTX 3090 (24GB VRAM)
Storage: 2TB NVMe for models

Quantization Strategy

Full precision 32B models need ~64GB of VRAM. With a 3090’s 24GB, quantization is mandatory. I went with AWQ (Activation-aware Weight Quantization) for the best quality-to-size ratio.

# Pull the AWQ-quantized model
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ

vLLM Configuration

vLLM handles the inference efficiently with continuous batching:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-32B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.95,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048,
)

Docker Compose Setup

The final deployment uses Docker for easy management:

version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model Qwen/Qwen2.5-32B-Instruct-AWQ
      --quantization awq
      --gpu-memory-utilization 0.95

Performance Results

With AWQ quantization on the 3090:

Metric	Value
Tokens/second	~25-30
Time to first token	~200ms
Max context	8192 tokens
VRAM usage	~22GB

What’s Next

The server is now integrated with my local MCP toolchain. Next step: experimenting with speculative decoding to push throughput higher.

This experiment is part of the ongoing effort to build a fully local AI development environment. No cloud, no API keys, no data leaving the lab.

NAME

Deploying Qwen 2.5 32B with vLLM

SYNOPSIS

Deploying Qwen 2.5 32B with vLLM

The Hardware

Quantization Strategy

vLLM Configuration

Docker Compose Setup

Performance Results

What’s Next