An intelligent local LLM routing system that automatically selects the optimal model for each task. Built on Mac Studio M3 Ultra with 512GB unified memory.
The Stack
Hardware
Mac Studio M3 Ultra with 512GB unified memory
Software
- Ollama for model serving
- Python router for task classification
- FastAPI server (OpenAI-compatible endpoints)
Models
- llama3.1:70b (42GB) - Complex reasoning
- llama3.1:8b (4.9GB) - Code and general tasks
- llama3.2:3b (2GB) - Quick responses
- gemma2:2b (1.6GB) - Simple queries
Intelligent Routing
The router classifies each request and selects the optimal model. Simple queries go to fast lightweight models, complex reasoning goes to the 70B model.
Model selection by task type:
- Simple chat → gemma2:2b (180+ tokens/sec)
- Coding → llama3.1:8b (98 tokens/sec)
- Complex reasoning → llama3.1:70b (14 tokens/sec)
Performance Results
- "Hello!" → gemma2:2b → 1.26s
- "Write a sort function" → llama3.1:8b → 2.07s
- "Analyze quantum computing" → llama3.1:70b → 10.58s
Benchmark results:
- llama3.1:8b: 593 t/s prompt, 98 t/s generation
- llama3.1:70b: 103 t/s prompt, 14 t/s generation
Integration Status
The local brain works with any OpenAI-compatible tool:
- Direct API calls
- Continue.dev, Cursor IDE
- Custom applications
Note: OpenClaw integration is not currently working. The framework doesn't yet support custom provider endpoints for local models. We're hoping for a fix soon.