Skip to main content

NAME

LLM Inference Server

Dedicated hardware for local AI model deployment and inference

SPECIFICATIONS

Status: Active Started: 2025-12
Tech Stack:
Ryzen 9 5900X RTX 3090 Ubuntu vLLM Docker
Tags: [Hardware] [LLM] [Infrastructure]

STATUS TIMELINE

Proposed
Active
Testing
Complete
DOCUMENTATION

LLM Inference Server

A dedicated machine for running large language models locally, without relying on cloud APIs.

Motivation

Cloud AI APIs are convenient but come with costs:

  • Per-token pricing adds up fast
  • Data leaves your control
  • Latency to external servers
  • Dependency on third-party uptime

Building a local inference server solves all of these.

Hardware Specs

ComponentModelNotes
CPURyzen 9 5900X12 cores for preprocessing
RAM64GB DDR4-3600Model loading headroom
GPURTX 3090 24GBAWQ quantized 32B models
Storage2TB NVMeFast model loading
CaseMeshify CGood airflow for 24/7 operation

Software Stack

  • OS: Ubuntu 22.04 LTS
  • Container Runtime: Docker with NVIDIA Container Toolkit
  • Inference Engine: vLLM with OpenAI-compatible API
  • Model Management: Hugging Face Hub CLI
  • Monitoring: Prometheus + Grafana

Current Capabilities

With the RTX 3090:

  • Qwen 2.5 32B (AWQ): ~25 tokens/second
  • Llama 3.1 8B: ~80 tokens/second
  • Mistral 7B: ~90 tokens/second

Integration Points

The server exposes an OpenAI-compatible API, making it a drop-in replacement for:

  • Claude Code with local models
  • Custom Python applications
  • Browser extensions
  • Any OpenAI SDK client

Future Plans

  1. Add second GPU for larger models or parallel inference
  2. Implement speculative decoding for faster generation
  3. Build MCP server for tool-augmented workflows
  4. Experiment with fine-tuning on personal data
manipulate.org
up 731d _