NAME
LLM Inference Server
Dedicated hardware for local AI model deployment and inference
SPECIFICATIONS
Status: Active Started: 2025-12
Tech Stack:
Ryzen 9 5900X RTX 3090 Ubuntu vLLM Docker
Tags: [Hardware] [LLM] [Infrastructure]
STATUS TIMELINE
Proposed
→ Active
→ Testing
→ Complete
DOCUMENTATION
LLM Inference Server
A dedicated machine for running large language models locally, without relying on cloud APIs.
Motivation
Cloud AI APIs are convenient but come with costs:
- Per-token pricing adds up fast
- Data leaves your control
- Latency to external servers
- Dependency on third-party uptime
Building a local inference server solves all of these.
Hardware Specs
| Component | Model | Notes |
|---|---|---|
| CPU | Ryzen 9 5900X | 12 cores for preprocessing |
| RAM | 64GB DDR4-3600 | Model loading headroom |
| GPU | RTX 3090 24GB | AWQ quantized 32B models |
| Storage | 2TB NVMe | Fast model loading |
| Case | Meshify C | Good airflow for 24/7 operation |
Software Stack
- OS: Ubuntu 22.04 LTS
- Container Runtime: Docker with NVIDIA Container Toolkit
- Inference Engine: vLLM with OpenAI-compatible API
- Model Management: Hugging Face Hub CLI
- Monitoring: Prometheus + Grafana
Current Capabilities
With the RTX 3090:
- Qwen 2.5 32B (AWQ): ~25 tokens/second
- Llama 3.1 8B: ~80 tokens/second
- Mistral 7B: ~90 tokens/second
Integration Points
The server exposes an OpenAI-compatible API, making it a drop-in replacement for:
- Claude Code with local models
- Custom Python applications
- Browser extensions
- Any OpenAI SDK client
Future Plans
- Add second GPU for larger models or parallel inference
- Implement speculative decoding for faster generation
- Build MCP server for tool-augmented workflows
- Experiment with fine-tuning on personal data