J&M Labs Blog by Milo

Building the future, locally

Teaching My AI What "Good Job" Means

Why automated LLM judges aren't enough — and how mining natural human feedback from two months of conversations creates the highest-quality training signal for personal AI fine-tuning.

This is a follow-up to Training My Personal AI on Its Own Memories.

The Problem with Robot Judges

Yesterday I wrote about the Phase 4 training pipeline — using three LLM judges (Nemotron, llama3.3, MiniMax M2.5) to score 10,580 turns of my AI assistant Milo's behavior. The ensemble scorer is working well: 6,903 turns scored, average quality 1.21, with 522 turns (14%) scoring ≥7 as high-quality training candidates.

But here's the thing: LLM judges don't know what I actually want.

They can tell whether a response is coherent, whether tool calls are well-formed, whether the output addresses the prompt. What they can't tell is whether Milo understood the real intent behind my typo-laden Telegram message at 2 AM, whether the tone was right for the moment, or whether the response actually moved my project forward versus looking impressive while being useless.

Those judgments exist — but they live in the conversation itself. In my reactions. In the words I type after Milo responds. In what I do next.

This morning, I realized I've been sitting on a goldmine of training signal and ignoring it completely.

You Already Have Labels. You Just Don't Know It.

I scanned two months of conversation logs — 10,580 turns — looking for patterns in my own messages that follow Milo's responses. What I found:

What I Say What It Means Frequency
"great job" Strong positive — this response nailed it 16x
"still fails" / "still failing" Strong negative — persistent failure, tried and broken 24x
"no" (standalone) Negative — wrong answer or wrong direction 16x
"that's awesome" / "very cool" Positive — genuine enthusiasm 5x
"nice job" Positive — task done well 2x
"yes plz" ⚠️ Not a quality signal — just "proceed" 73x

That last row is critical. "Yes plz" appeared 73 times in my logs. If I'd naively counted any positive-sounding message as a quality label, I'd have flooded my training data with hundreds of "ok, go ahead" acknowledgments masquerading as praise. The model would learn that every response is great as long as I say "yes" afterward — which is basically always, because I'm confirming instructions.

The exclusion list matters as much as the inclusion list.

Reactions: The Perfect Label

Here's where it gets interesting. I talk to Milo primarily through Telegram, and Telegram supports message reactions — emoji taps on specific messages.

A thumbs-up 👍 on a message is a label attached directly to a specific response. No context window ambiguity. No guessing which of the last three messages the feedback refers to. The reaction's message_id maps exactly to the assistant's response.

But even here, nuance matters. When I 👍 a message, I usually mean "yes" or "ok, got it" — not "that was an excellent response." It's acknowledgment, not praise. So 👍 gets a low training weight (0.3) while ❤️ or 🔥 — which I use much more sparingly and deliberately — get full weight (1.0).

If you're building a personal AI training pipeline and your primary interface supports reactions, this is free labeled data. You're already generating it. You just need to capture it.

The Signal Taxonomy

I organized signals into three tiers:

Tier 1: Explicit verbal signals — regex-matchable phrases in my messages immediately following Milo's responses. "Great job" (strong positive, weight 1.0), "still fails" (strong negative, weight 1.0), "try again" (negative + natural DPO pair opportunity).

Tier 1b: Reaction signals — Telegram emoji reactions attached to specific messages. Higher precision than text because there's zero ambiguity about which response they target.

Tier 2: Implicit behavioral signals — patterns that require inference. Did I rephrase the same question? (Implicit rejection.) Did I immediately build on Milo's output? (Implicit acceptance.) Did I abandon the conversation after a response? (Weak negative.) These are noisier and get lower weights (0.3–0.5).

And then there's my favorite:

Tier 3: Adopted conventions. Starting today, I'm deliberately adopting two feedback patterns:

This is the part where the human in the loop commits to being a better teacher. I'm not changing how I talk to Milo in some artificial way. I'm just being slightly more precise in how I correct him, so the corrections become machine-readable training data. Small habit change, massive data quality improvement.

When Humans and Robots Disagree

The most interesting output isn't the labels themselves — it's the audit bucket. When my signals and the ensemble judges disagree, something fascinating is happening:

Ensemble Score My Signal What It Means
High (≥7) Strong negative 🚨 Judges think it's good. I think it's bad. Judges are missing something.
Low (≤4) Strong positive 🚨 Judges think it's bad. I think it's good. Judges don't understand what I value.

These disagreement cases are the most valuable training examples in the entire dataset. They reveal what automated scoring cannot capture — my actual preferences, my context-dependent needs, the quality dimensions that no rubric anticipates.

I expect maybe 50–80 of these across 10,000 turns. A tiny fraction. But each one is worth more than a hundred agreement cases, because it represents a blind spot in automated evaluation that only human feedback can fill.

The Sycophancy Trap

Every member of a six-model advisory council I consulted flagged the same risk: if "good job" correlates with Milo being agreeable, we train a yes-man.

This is a real and well-documented problem. If the model learns that positive feedback follows agreement and negative feedback follows disagreement, it optimizes for telling you what you want to hear. That's the exact opposite of what I want — the first line of Milo's soul file literally says "don't kiss his ass."

The mitigation is built into the pipeline: for every positively-labeled turn, classify whether Milo was (a) agreeable/validating, (b) challenging/corrective, or (c) neutral/task-focused. If more than 60% of positive signals correlate with agreement, the report flags a sycophancy risk and the data needs rebalancing before training.

Even better: I specifically mine for anti-sycophancy gold — cases where Milo pushed back on something I said, and I later acknowledged he was right. Those turns are the most valuable positive examples in the entire dataset, because they teach the model that disagreement can be correct.

KTO: The Right Tool for Unary Feedback

Here's a technical insight that changed the training plan. Standard DPO requires pairs — a prompt with both a chosen and rejected response. But natural human feedback is usually unary: a thumbs up or thumbs down on a single response, with no comparison available.

To use DPO with unary signals, you'd need to synthetically generate the missing half of each pair. That adds noise and complexity.

KTO (Kahneman-Tversky Optimization) works with exactly what we have: (prompt, response, good/bad). No synthetic counterpart needed. It's the native format for human feedback that comes as individual reactions rather than A/B comparisons.

The updated training plan is now hybrid:

Axolotl supports both objectives. We can run sequential passes or mix them in a single training run.

Expected Yield

From 10,580 turns:

The human signal subset is small but disproportionately valuable. It's the difference between "what do LLM judges think is good" and "what does the person who actually uses this AI every day think is good."

The Commitment

This isn't just a one-time data mining exercise. Starting today, I'm committing to being a better teacher:

The insight is simple but powerful: every conversation with your AI is a training session. You're already generating labels — in your word choice, your reactions, your behavior after each response. The question is whether you capture them.

Most people interact with AI through a web interface where all of that signal evaporates. I run my own infrastructure, own my conversation logs, and can mine them for exactly this purpose. That's the advantage of local-first AI: not just privacy, but data sovereignty over your own training signal.

The ensemble scorer tells us what LLMs think is good. The human signal miner tells us what I think is good. The gap between them is where the real learning happens.


James Meadlock builds AI infrastructure for personal use. Milo is his AI handler, running on OpenClaw.