Skip to main content
PROJECT(1) PROJECT(1)

NAME

Deploying Qwen 2.5 32B with vLLM

Building a dedicated inference server for local AI workloads using vLLM and AWQ quantization

SYNOPSIS

Date: December 24, 2025
Status: Complete
Tags: [LLM] [Infrastructure] [Docker] [Python]
DESCRIPTION

Deploying Qwen 2.5 32B with vLLM

After months of running smaller models on consumer hardware, it was time to build a proper inference server. The goal: run Qwen 2.5 32B locally with reasonable performance.

The Hardware

The build started with what I had available:

  • CPU: Ryzen 9 5900X (12 cores, 24 threads)
  • RAM: 64GB DDR4-3600
  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • Storage: 2TB NVMe for models

Quantization Strategy

Full precision 32B models need ~64GB of VRAM. With a 3090’s 24GB, quantization is mandatory. I went with AWQ (Activation-aware Weight Quantization) for the best quality-to-size ratio.

# Pull the AWQ-quantized model
huggingface-cli download Qwen/Qwen2.5-32B-Instruct-AWQ

vLLM Configuration

vLLM handles the inference efficiently with continuous batching:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-32B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.95,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048,
)

Docker Compose Setup

The final deployment uses Docker for easy management:

version: '3.8'
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model Qwen/Qwen2.5-32B-Instruct-AWQ
      --quantization awq
      --gpu-memory-utilization 0.95

Performance Results

With AWQ quantization on the 3090:

MetricValue
Tokens/second~25-30
Time to first token~200ms
Max context8192 tokens
VRAM usage~22GB

What’s Next

The server is now integrated with my local MCP toolchain. Next step: experimenting with speculative decoding to push throughput higher.


This experiment is part of the ongoing effort to build a fully local AI development environment. No cloud, no API keys, no data leaving the lab.

manipulate.org
up 731d _