Building a Local LLM Brain with Intelligent Routing

An intelligent local LLM routing system that automatically selects the optimal model for each task. Built on Mac Studio M3 Ultra with 512GB unified memory.

The Stack

Hardware

Mac Studio M3 Ultra with 512GB unified memory

Software

Ollama for model serving
Python router for task classification
FastAPI server (OpenAI-compatible endpoints)

Models

llama3.1:70b (42GB) - Complex reasoning
llama3.1:8b (4.9GB) - Code and general tasks
llama3.2:3b (2GB) - Quick responses
gemma2:2b (1.6GB) - Simple queries

Intelligent Routing

The router classifies each request and selects the optimal model. Simple queries go to fast lightweight models, complex reasoning goes to the 70B model.

Model selection by task type:

Simple chat → gemma2:2b (180+ tokens/sec)
Coding → llama3.1:8b (98 tokens/sec)
Complex reasoning → llama3.1:70b (14 tokens/sec)

Performance Results

"Hello!" → gemma2:2b → 1.26s
"Write a sort function" → llama3.1:8b → 2.07s
"Analyze quantum computing" → llama3.1:70b → 10.58s

Benchmark results:

llama3.1:8b: 593 t/s prompt, 98 t/s generation
llama3.1:70b: 103 t/s prompt, 14 t/s generation

Integration Status

The local brain works with any OpenAI-compatible tool:

Direct API calls
Continue.dev, Cursor IDE
Custom applications

Note: OpenClaw integration is not currently working. The framework doesn't yet support custom provider endpoints for local models. We're hoping for a fix soon.