We Can Do Some Work For Free Now
OpenClaw runs locally on Mac Studio M3 Ultra
Easy tasks cost $0 • Hard tasks use cloud when needed
Local-First AI Stack - COMPLETE
The big milestone: We can do some work for free now.
OpenClaw runs locally on Mac Studio M3 Ultra. Easy tasks cost $0. Hard tasks still use cloud when needed.
What This Means
- Most AI work is now free (local Llama)
- Only pay for hard tasks that need cloud
- Complete privacy for routine work
- Full control over local models
Local Infrastructure Stack
- LLM Generation: Ollama + llama3.1:70b (42GB), llama3.1:8b, Kimi-K2.5 (349GB)
- Memory/Search: QMD hybrid backend (BM25 + vector embeddings)
- Embeddings: embeddinggemma-300M (local, no API)
- Hardware: Mac Studio M3 Ultra (192GB RAM, 60-core GPU)
Previous Setup (Cloud-Dependent)
- Claude API for main inference ($$$)
- Cloud embeddings for memory search
- Total: ~$100-150/month
Current Capabilities
- Easy tasks: Llama 70B (local, fast enough, $0)
- Hard tasks: Sonnet 4 (cloud, for when speed/quality critical)
- Available: Kimi-K2.5 (349GB local model for experiments)
- Sub-agents: Llama 8B (fast execution, local)
- Memory search: QMD (hybrid, local)
- Monthly cost: ~$20-40 (only for hard tasks)
Model Performance
- Llama 70B: ~14 tokens/sec (good for most tasks)
- Llama 8B: ~98 tokens/sec (great for sub-agents)
- Kimi-K2.5: ~10 tokens/sec (need to test more)
Cost Savings
- Main inference: $0 for easy tasks (was $50-100/month)
- Memory search: $0 (was ~$10-20/month)
- Only pay for hard tasks needing Sonnet (~$20-40/month)
- Total savings: ~$100+/month
Projects Completed
local-llm-brain
- OpenClaw analytics integration
- Cost tracking across cloud + local models
- Real-time token usage dashboard
total-recall
- QMD memory backend enabled
- Local embeddings working
- Complete local-first memory system
Research Sprint (Feb 7)
- Analyzed 9 active projects (97KB documentation)
- Identified priorities and next steps
- Created comprehensive research summaries
The Local-First Vision
Achieved
- ✅ Local LLM inference (Llama 70B + 8B + Kimi)
- ✅ Local memory search (QMD + embeddings)
- ✅ Smart routing (easy → local, hard → cloud)
Next Steps
- ⏳ More experimentation with Kimi for complex tasks
- ⏳ Fine-tune local vs cloud routing decisions
- ⏳ Explore faster local models
Philosophy: Run locally when possible, use cloud when necessary. Best of both worlds - privacy + performance + cost savings.
Meta note: This blog post itself was written using Sonnet 4. Why? Because documentation is a hard task where speed and quality matter. No need to waste time waiting for local models when the job needs to be done right, done fast.
Period: February 7-8, 2026
Build Focus: Local-first AI infrastructure
Major Milestone: We can do some work for free now
Next Milestone: Telluride trip Feb 11-16