Training My Personal AI on Its Own Memories
March 24, 2026
How I built a local fine-tuning pipeline using two DGX Sparks, a Mac Studio, three LLM judges, and 9,500 tool-use turns from my AI assistant's own session logs.
I've been running a personal AI assistant called Milo for about two months now. Milo lives on OpenClaw, an assistant platform I run on a Mac Studio M3 Ultra. He reads my emails, manages my calendar, controls my Tesla, searches the web, writes code — all through tool calls. He's opinionated, resourceful, and genuinely useful.
But Milo runs on Claude. Every interaction is an API call to Anthropic. That's fine for now, but I want something I own — a smaller model that runs entirely on local hardware, with Milo's personality and tool-calling behavior baked in. No cloud dependency. No per-token costs. Just a local model that is Milo.
This is the story of Phase 4: building the training data pipeline to make that happen.
The Goal
Train a Qwen2.5-72B-Instruct model to replicate Milo's behavior — specifically his tool use, memory access patterns, and assistant personality — using Direct Preference Optimization (DPO) with a full fine-tune across both DGX Sparks. The training data? Milo's own session logs.
Why 72B and not something smaller? I tried to convince myself that 8B or 14B would be enough — they're fast, they fit easily, they're the obvious "safe" choice for a first experiment. But they're not. The task I'm training for is multi-tool orchestration, multi-step reasoning, and context-heavy assistant behavior. A model with a capability ceiling that low can't absorb those patterns no matter how good the training signal is. You can't DPO your way past an 8B model's fundamental limits.
72B dense is the sweet spot: capable enough to learn what I'm teaching, small enough to train on local hardware.
The idea is simple: Milo has been doing thousands of tool calls over the past two weeks. Some were excellent. Some were garbage. If I can score them and separate the good from the bad, I have a natural DPO dataset — no synthetic data generation needed.
The Pipeline
The pipeline has three stages, orchestrated from the Mac Studio and distributed across two NVIDIA DGX Sparks on my local network.
Stage 1: Data Collection
OpenClaw logs every session as structured JSONL. Each log entry contains the full conversation — user messages, assistant responses, tool calls, and tool results. The raw material is rich but noisy.
extract_turns.py parses these logs and pulls out only the turns where the assistant invoked tools. A "turn" here means the user's message, the assistant's reasoning and tool call(s), and the tool results. These are the interesting moments — the ones where the model had to decide to act and choose how.
Over two weeks of sessions (March 9–23, 2026), this produced about 9,500 tool-use turns, saved as dated JSONL files in ~/clawd/training/turns/.
Stage 2: Quality Scoring with an Ensemble
This is where it gets interesting — and where most of the hard lessons live.
Each turn gets scored by three LLM judges running in parallel across separate hardware:
Each judge scores on five dimensions: tool correctness, task completion, instruction adherence, efficiency, and overall quality — each on a 1–10 scale. With three judges, the ensemble uses the median score — the middle value wins, which naturally filters outlier opinions without discarding the turn. If the spread between judges exceeds 4.0 points on overall score, the turn is still discarded as an ensemble_disagreement. Three judges means fewer discards and more robust signal than the original two-judge average.
A pre-filter catches obvious noise (very short turns, empty tool calls) before making any API calls, saving significant inference time.
The scores get written back into the JSONL files in-place, so each turn carries its quality metadata forward.
Stage 3: DPO Dataset Construction (In Progress)
With scored turns in hand, building DPO pairs is straightforward:
- Chosen responses: turns scored ≥7 (consistently good tool use)
- Rejected responses: turns scored ≤4 (clearly bad behavior)
- Excluded: turns where the judges' spread exceeds 4 points — too uncertain to use as training signal
The fine-tuning target is Qwen2.5-72B-Instruct, trained via axolotl + DeepSpeed ZeRO-3 across both Sparks simultaneously. ZeRO-3 shards the model weights, optimizer states, and gradients across both machines over the QSFP link — 256GB of unified memory across both Sparks makes a full fine-tune (not LoRA) feasible. The end state: a local model that handles tool calls, memory lookups, and multi-step reasoning the way Milo does — without any cloud API.
The Hardware
Three machines, all on the same LAN:
| Machine | Spec | Role |
|---|---|---|
| Milo (Mac Studio) | M3 Ultra, 512GB RAM | OpenClaw host, pipeline orchestrator, MiniMax M2.5 judge |
| Spark 1 (DGX Spark) | NVIDIA GB10, 128GB LPDDR5X, 70W | Nemotron 120B judge |
| Spark 2 (DGX Spark) | NVIDIA GB10, 128GB LPDDR5X, 70W | llama3.3:70b judge |
Both Sparks run Ollama, each serving a single large model. The Mac Studio pulls double duty: orchestrating the pipeline and running a third judge — MiniMax M2.5 (230B total, 10B active parameters as a mixture-of-experts) via LM Studio, hitting 44 tokens/second thanks to the MoE architecture's small active footprint on Apple Silicon. Three judges, three machines, zero idle hardware.
Hard-Won Lessons
Building this pipeline taught me more about local LLM infrastructure than any tutorial ever could. Here's what bit me:
1. macOS Sandboxes Python's Network Stack
Python's socket module on macOS gets sandboxed in ways that silently break LAN connections. My scoring script would just… hang. No error, no timeout, just silence. The fix was ugly but effective: shell out to curl via subprocess instead of using Python's requests or urllib. The sandbox doesn't restrict subprocess-launched binaries the same way. It works. I stopped asking why.
2. Big Models Need Their Own Machines
My first attempt ran both Nemotron and llama3.3 on a single Spark. Ollama will happily load both models and context-switch between them, but with 120B and 70B parameters competing for 128GB of unified memory, you get constant swapping and abysmal throughput. The solution was obvious in hindsight: one model per machine. Inference speed jumped dramatically.
3. Most Local Models Are Terrible Judges
This one cost me days. I tried everything as a scoring judge: CodeLlama, Phi-4, Gemma 2, DeepSeek-Coder, various quantized models. They all shared the same failure mode — score inflation on bad turns. A turn where Milo hallucinated a tool call and produced garbage output? Phi-4 scored it 8/10. Gemma gave it a 7.
The pattern became clear: models trained primarily on code or narrow instruction sets don't have the calibration to judge diverse assistant behavior. They see structured output and assume it's correct. Only models trained on broad, diverse instruction-following data — like Nemotron and llama3.3 — could reliably give junk turns the 1–3 scores they deserved while appropriately rewarding good turns with 7–9.
This is the kind of thing you only learn by running the experiment. No benchmark would have told me this.
4. Thinking Models Need Room to Think
Several of the judge models use chain-of-thought reasoning internally before producing their JSON score output. If you set max_tokens too low (the default 2048 was not enough), the model's thinking gets truncated and you get malformed JSON. I had to bump to 4000+ tokens and implement brace-depth JSON parsing to handle cases where the model's reasoning bleeds into the output before the actual score object. Not glamorous, but necessary.
What This Enables
When the DPO training is complete, I'll have a Qwen2.5-72B model that:
- Calls tools like Milo — web search, file operations, calendar management, Tesla control, all the patterns learned from 9,500 real interactions
- Handles memory and context the way Milo does — reading workspace files, maintaining continuity across sessions
- Runs entirely on local hardware — no API keys, no per-token costs, no data leaving my network
- Runs on both Sparks — 72B in bf16 across the QSFP cluster, or quantized to 4-bit on a single Spark for daily use
The philosophical angle isn't lost on me either. I'm training a model on its own behavioral logs — teaching a smaller version of an AI to be itself, distilled through the lens of what worked and what didn't. It's not self-improvement in the AGI sense, but it's a closed loop: Milo's best moments become the curriculum for Milo's local successor.
The cloud models will always be smarter. But a local model that knows my tools, my workflows, and my preferences — trained on real interactions, not synthetic benchmarks — might be more useful for the 80% of daily tasks that don't need frontier intelligence.
That's the bet. The data's scored. The pipeline's built. Now we train.
James Meadlock builds AI infrastructure for personal use. Milo is his AI handler, running on OpenClaw.