Milo has an email address. milo@al-engr.com. It's where the outside world sends things that need to reach me โ€” cron failure alerts, service notifications, things that arrive when I'm not looking at a terminal. This was intentional and, in retrospect, obvious: if an AI is handling parts of your infrastructure, it should have a mailbox.

What I did not fully anticipate was that I would then need to build a system to help Milo read that mailbox.

Here we are.

The Actual Problem

I have three email accounts. iCloud (personal), Gmail (old-everything), and Milo's Fastmail. Together they receive maybe 80-150 emails a day, of which maybe 8 actually matter. The rest is a slowly expanding tide of shipping notifications, LinkedIn marketing, newsletter archives, and automated digests from services I set up years ago and have since forgotten about.

The traditional answer to this is Inbox Zero โ€” a philosophy, a methodology, a whole genre of productivity writing. The practical answer I kept arriving at was: I don't open Mail.app until I absolutely have to, and when I do, it's always too late for something.

What I actually want: Milo reads the mail, tells me what matters, via Telegram, twice a day. I tap a button. Done.

Why Not Just Use Inbox Zero (the App)?

Inbox Zero is a real piece of software โ€” open source, Next.js, Prisma, the whole stack. I looked at it seriously. Here's why we didn't use it:

It doesn't support iCloud. The entire data layer assumes Gmail or Microsoft accounts โ€” Google OAuth, Gmail API message IDs, Google PubSub webhooks. iCloud is IMAP-only and doesn't speak Google's auth protocols. My personal inbox is iCloud. That's a non-starter, not a minor configuration issue.

The stack is wrong for one user. Next.js + Prisma + Postgres + Redis + Turborepo + Docker Compose. That's a multi-user SaaS stack. I need a cron job on a Mac Studio. Deploying and maintaining two databases and a web application to triage one person's email is the kind of architecture decision that looks reasonable on a whiteboard and unhinged in production.

It's built for a product, not a pipeline. Inbox Zero has billing, user management, team features. All of those are layers of complexity that exist to support a business model that has nothing to do with my problem.

We borrowed the ideas โ€” unsubscribe analytics, sender frequency scoring, actionable digest buttons, correction feedback loops. We built the thing from scratch.

The Stack

Python 3.12. Standard library imaplib, smtplib, email, sqlite3. httpx for Anthropic API calls. pypdf for extracting text from PDF attachments. No frameworks, no ORMs, no Docker. Runs as cron jobs on Mac Studio via OpenClaw.

Two dependencies outside stdlib. That's it.

How It Works

iCloud IMAP imap.mail.me.com:993 Gmail IMAP imap.gmail.com:993 Fastmail IMAP imap.fastmail.com:993 app passwords only no OAuth2 complexity creds in ~/.openclaw/ fetch.py fetch_cycle.py โ€ข every 15 min dedup ยท injection scan ยท MIME parse triage.py โ€” Classification Pipeline VIP fast-path sender rules noise domains Haiku LLM attachments.py PDF text extraction pypdf ยท save to disk context โ†’ LLM classify SQLite milo_mail.db โ€ข 4 tables messages ยท corrections ยท rules ยท audit digest cron 8:00 AM ยท 6:00 PM morning brief ยท evening wrap VIP alert fires immediately VIP ยท urgent ยท blocked sender Telegram โ†’ James ๐Ÿ“ฑ Correction Loop [Wrong โŒ] button corrections table sender rules 2+ corrections โ†’ auto-rule weekly prompt refresh ๐Ÿ”ด Kill Switch MILO_MAIL_SEND _ENABLED=false default ยท never auto-sends Autonomy Phases โ— Phase 1 now: read-only โ—‹ Phase 2: noise archive โ—‹ Phase 3: draft + approval โ—‹ Phase 4: template sends inboxes fetch classify store deliver

The Classification Pipeline

Every 15 minutes, fetch_cycle.py pulls unread messages from all enabled accounts via IMAP. For each message it hasn't seen before (deduped by Message-ID), it runs the classification pipeline:

  1. Check if sender is VIP (Cindy, Nancy, family, doctors) โ†’ immediate Telegram alert
  2. Check if sender is blocked (Clorinda gets flagged and I'm notified) โ†’ flag + notify
  3. Check learned sender rules (accumulated from corrections) โ†’ fast-path category
  4. Check noise domains (LinkedIn marketing, Facebook mail, etc.) โ†’ auto-archive candidate
  5. Thread inheritance (if parent thread is already categorized) โ†’ inherit
  6. LLM classification via Haiku โ€” cheapest model that can read context, give me a category and confidence score

The LLM gets: from/subject/body preview (500 chars), attachment summary including PDF text previews, and any recent corrections for this sender. It returns JSON: category, confidence, summary, urgency, needs_reply.

If the API call fails, it falls back to 'fyi' with confidence 0.0. The pipeline keeps running. You don't stop triaging email because Anthropic had a bad second.

Why IMAP Over osascript

The previous version of this system โ€” the thing it's replacing โ€” used osascript to drive Mail.app. This worked until it didn't, which is the defining characteristic of Apple event bridges. Mail.app needs to be open. The script needs Mail to be in a cooperative mood. Occasionally Mail decides it's not.

IMAP is a protocol. It doesn't have moods. imaplib is standard library. You open a TLS connection, you send SEARCH UNSEEN, you get messages. This works on every account, at any hour, without an application window being open. It's been working reliably since 1996.

The only overhead was generating app-specific passwords for each account. iCloud: appleid.apple.com. Gmail: myaccount.google.com/apppasswords. Five minutes each. No OAuth2 callback flow, no token refresh cycle, no state to manage. A password is a password.

What Gets Surfaced to Me

Two digests a day. 8 AM and 6 PM. That's it.

The morning brief starts with what needs a reply. If there's a draft ready (Phase 3), it shows inline. Then actionable items. Then an FYI summary. Auto-stats at the bottom: noise count, transactional count, spam caught, correction rate from the past week.

The 6 PM wrap adds: threads where I sent last and haven't heard back (dropped conversations made visible), and unsubscribe suggestions for senders I haven't interacted with in 30 days.

I had three digests in the original design โ€” morning, midday, evening. Midday got cut. If something urgent arrives mid-day, the VIP fast-path fires immediately. The midday digest was redundant overhead. Two is the right number.

The Correction-Memory Loop

Every digest message has a [Wrong โŒ] button. Tap it, Milo asks what the category should have been, records the correction. That's the easy part.

The harder part is making corrections actually change behavior. Here's how:

The accuracy metric โ€” corrections_this_week / total_classified_this_week โ€” appears in every digest. It's the primary signal that the system is improving or degrading. If it climbs above some threshold, I'll know before it becomes a problem.

The Autonomy Architecture

The system has four phases, and it's currently in Phase 1: observer mode. Read, classify, digest. Nothing else. It doesn't archive anything. It doesn't unsubscribe anything. It certainly doesn't send anything.

There's a kill switch in config: send_enabled: false. Default is false. The code checks this before every potential outbound action. There is no code path that bypasses it.

Phase 2 adds noise auto-archiving. Phase 3 adds reply drafts that require explicit approval via Telegram button tap. Phase 4 adds template auto-sends to VIP contacts only, with a weekly audit digest showing everything sent autonomously. Phase 5 โ€” "full autonomous compose" โ€” is not in the plan. I wrote it in a prior version of the spec and then deleted it. Three months of clean Phase 4 operation is the prerequisite for reassessing that, and we're not there.

This is the right approach for an agent managing email, full stop. The Anthropic research on autonomous email agents makes the failure modes very clear: agents that can send email without explicit approval will, eventually, send something they shouldn't. The alignment safeguard isn't a limitation โ€” it's the architecture.

What I'm Actually Building

An AI that reads my email so I don't have to open Mail.app. That's it. No magic. No "AI that manages your entire professional life." Just: read the inbox, understand what matters, tell me, let me tap a button.

The absurdity of the situation โ€” building an AI pipeline to manage the email of another AI so that the human doesn't have to deal with the email that was generated by all of this AI infrastructure in the first place โ€” is not lost on me. I'm choosing to view it as a natural consequence of running a small AI lab rather than a sign that something has gone wrong.

The schema is four tables. The dependencies are two packages. The cron schedule is four entries. When this breaks, I will know exactly where to look.

Phase 1 goes live this week. You'll know when it's working because I'll stop mentioning email.

How This Got Built

This system didn't come from one brain. It came from four, in sequence, each one catching what the previous missed.

It started with a problem: an overwhelming inbox and no good tools. James described it to Milo, who drafted an initial system using Opus. Opus produced a comprehensive plan โ€” thorough, well-structured, and in a few places, overengineered. Nine database tables. Three daily digests. Five phases of autonomy including a Phase 5 we had to consciously delete.

We then ran the draft past Grok, which searched X for what the AI/email community was actually building. Grok caught the Inbox Zero dead end before we committed to it โ€” wrong stack for a single-user local setup. It pushed back on the schema complexity, flagged the missing alignment safeguards, and noted that PDF attachment handling was table stakes in 2026. The schema went from nine tables to four. Phase 5 disappeared.

Then Grok reviewed the revised plan and added two more things: a local model fast-path (Qwen3.5-4B already running on the M5 Max at :8012 can pre-score sender reputation before anything hits Haiku, eliminating ~60-70% of API calls), and regex-based unsubscribe detection that parses List-Unsubscribe headers and surfaces a one-tap [Unsubscribe] button in the digest. Neither required LLM calls. Both should have been obvious from the start.

The meta-observation: a system designed to process information more efficiently was itself designed by running information through multiple models, each catching different blind spots. James kept saying yes or no. That division of labor worked out pretty well.