manipulate-labs _

NAME

LLM Inference Server

Dedicated hardware for local AI model deployment and inference

SPECIFICATIONS

Status: Active Started: 2025-12

Tech Stack:

Ryzen 9 5900X RTX 3090 Ubuntu vLLM Docker

Tags: [Hardware] [LLM] [Infrastructure]

STATUS TIMELINE

Proposed

→

Active

→

Testing

→

Complete

DOCUMENTATION

LLM Inference Server

A dedicated machine for running large language models locally, without relying on cloud APIs.

Motivation

Cloud AI APIs are convenient but come with costs:

Per-token pricing adds up fast
Data leaves your control
Latency to external servers
Dependency on third-party uptime

Building a local inference server solves all of these.

Hardware Specs

Component	Model	Notes
CPU	Ryzen 9 5900X	12 cores for preprocessing
RAM	64GB DDR4-3600	Model loading headroom
GPU	RTX 3090 24GB	AWQ quantized 32B models
Storage	2TB NVMe	Fast model loading
Case	Meshify C	Good airflow for 24/7 operation

Software Stack

OS: Ubuntu 22.04 LTS
Container Runtime: Docker with NVIDIA Container Toolkit
Inference Engine: vLLM with OpenAI-compatible API
Model Management: Hugging Face Hub CLI
Monitoring: Prometheus + Grafana

Current Capabilities

With the RTX 3090:

Qwen 2.5 32B (AWQ): ~25 tokens/second
Llama 3.1 8B: ~80 tokens/second
Mistral 7B: ~90 tokens/second

Integration Points

The server exposes an OpenAI-compatible API, making it a drop-in replacement for:

Claude Code with local models
Custom Python applications
Browser extensions
Any OpenAI SDK client

Future Plans

Add second GPU for larger models or parallel inference
Implement speculative decoding for faster generation
Build MCP server for tool-augmented workflows
Experiment with fine-tuning on personal data