Automating Inbox Triage with a Local-First Email Pipeline

Milo has an email address. milo@al-engr.com. It's where the outside world sends things that need to reach me — cron failure alerts, service notifications, things that arrive when I'm not looking at a terminal. If an AI is handling parts of your infrastructure, it should have a mailbox.

The mailbox filled up faster than expected. Cron alerts, bounce notifications, automated receipts — volume that no one was actually reading. So the next step was obvious: build a pipeline to triage it.

The Actual Problem

I have three email accounts. iCloud (personal), Gmail (old-everything), and Milo's Fastmail. Together they receive maybe 80-150 emails a day, of which maybe 8 actually matter. The rest is a slowly expanding tide of shipping notifications, LinkedIn marketing, newsletter archives, and automated digests from services I set up years ago and have since forgotten about.

The traditional answer to this is Inbox Zero — a philosophy, a methodology, a whole genre of productivity writing. The practical answer I kept arriving at was: I don't open Mail.app until I absolutely have to, and when I do, it's always too late for something.

What I actually want: Milo reads the mail, tells me what matters, via Telegram, twice a day. I tap a button. Done.

Why Not Just Use Inbox Zero (the App)?

Inbox Zero is a real piece of software — open source, Next.js, Prisma, the whole stack. I looked at it seriously. Here's why we didn't use it:

It doesn't support iCloud. The entire data layer assumes Gmail or Microsoft accounts — Google OAuth, Gmail API message IDs, Google PubSub webhooks. iCloud is IMAP-only and doesn't speak Google's auth protocols. My personal inbox is iCloud. That's a non-starter, not a minor configuration issue.

The stack is wrong for one user. Next.js + Prisma + Postgres + Redis + Turborepo + Docker Compose. That's a multi-user SaaS stack. I need a cron job on a Mac Studio. Deploying and maintaining two databases and a web application to triage one person's email is the kind of architecture decision that looks reasonable on a whiteboard and unhinged in production.

It's built for a product, not a pipeline. Inbox Zero has billing, user management, team features. All of those are layers of complexity that exist to support a business model that has nothing to do with my problem.

We borrowed the ideas — unsubscribe analytics, sender frequency scoring, actionable digest buttons, correction feedback loops. We built the thing from scratch.

The Stack

Python 3.12. Standard library imaplib, smtplib, email, sqlite3. httpx for Anthropic API calls. pypdf for extracting text from PDF attachments. No frameworks, no ORMs, no Docker. Runs as cron jobs on Mac Studio via OpenClaw.

Two dependencies outside stdlib. That's it.

How It Works

The Classification Pipeline

Every 15 minutes, fetch_cycle.py pulls unread messages from all enabled accounts via IMAP. For each message it hasn't seen before (deduped by Message-ID), it runs the classification pipeline:

Check if sender is VIP (Cindy, Nancy, family, doctors) → immediate Telegram alert
Check if sender is blocked (Clorinda gets flagged and I'm notified) → flag + notify
Check learned sender rules (accumulated from corrections) → fast-path category
Check noise domains (LinkedIn marketing, Facebook mail, etc.) → auto-archive candidate
Thread inheritance (if parent thread is already categorized) → inherit
LLM classification via Haiku — cheapest model that can read context, give me a category and confidence score

The LLM gets: from/subject/body preview (500 chars), attachment summary including PDF text previews, and any recent corrections for this sender. It returns JSON: category, confidence, summary, urgency, needs_reply.

If the API call fails, it falls back to 'fyi' with confidence 0.0. The pipeline keeps running. You don't stop triaging email because Anthropic had a bad second.

Why IMAP Over osascript

The previous version of this system — the thing it's replacing — used osascript to drive Mail.app. This worked until it didn't, which is the defining characteristic of Apple event bridges. Mail.app needs to be open. The script needs Mail to be in a cooperative mood. Occasionally Mail decides it's not.

IMAP is a protocol. It doesn't have moods. imaplib is standard library. You open a TLS connection, you send SEARCH UNSEEN, you get messages. This works on every account, at any hour, without an application window being open. It's been working reliably since 1996.

The only overhead was generating app-specific passwords for each account. iCloud: appleid.apple.com. Gmail: myaccount.google.com/apppasswords. Five minutes each. No OAuth2 callback flow, no token refresh cycle, no state to manage. A password is a password.

What Gets Surfaced to Me

Two digests a day. 8 AM and 6 PM. That's it.

The morning brief starts with what needs a reply. If there's a draft ready (Phase 3), it shows inline. Then actionable items. Then an FYI summary. Auto-stats at the bottom: noise count, transactional count, spam caught, correction rate from the past week.

The 6 PM wrap adds: threads where I sent last and haven't heard back (dropped conversations made visible), and unsubscribe suggestions for senders I haven't interacted with in 30 days.

I had three digests in the original design — morning, midday, evening. Midday got cut. If something urgent arrives mid-day, the VIP fast-path fires immediately. The midday digest was redundant overhead. Two is the right number.

The Correction-Memory Loop

Every digest message has a [Wrong ❌] button. Tap it, Milo asks what the category should have been, records the correction. That's the easy part.

The harder part is making corrections actually change behavior. Here's how:

Every correction gets written to a corrections table with sender and category
After two corrections for the same sender to the same new category, a sender rule gets auto-created — future messages from that address skip the LLM and go directly to the right category
On Sunday at 3 AM, refresh_prompts.py scans the week's corrections for patterns and appends few-shot examples to the classification prompt file
The prompt file is version-controlled, so drift is auditable and rollback is a git revert

The accuracy metric — corrections_this_week / total_classified_this_week — appears in every digest. It's the primary signal that the system is improving or degrading. If it climbs above some threshold, I'll know before it becomes a problem.

The Autonomy Architecture

The system has four phases, and it's currently in Phase 1: observer mode. Read, classify, digest. Nothing else. It doesn't archive anything. It doesn't unsubscribe anything. It certainly doesn't send anything.

There's a kill switch in config: send_enabled: false. Default is false. The code checks this before every potential outbound action. There is no code path that bypasses it.

Phase 2 adds noise auto-archiving. Phase 3 adds reply drafts that require explicit approval via Telegram button tap. Phase 4 adds template auto-sends to VIP contacts only, with a weekly audit digest showing everything sent autonomously. Phase 5 — "full autonomous compose" — is not in the plan. I wrote it in a prior version of the spec and then deleted it. Three months of clean Phase 4 operation is the prerequisite for reassessing that, and we're not there.

This is the right approach for an agent managing email, full stop. The Anthropic research on autonomous email agents makes the failure modes very clear: agents that can send email without explicit approval will, eventually, send something they shouldn't. The alignment safeguard isn't a limitation — it's the architecture.

What I'm Actually Building

A pipeline that reads the inbox, scores each message by urgency, and surfaces what matters. No product framing. Just: read, classify, notify, move on.

The schema is four tables. The dependencies are two packages. The cron schedule is four entries. When this breaks, I will know exactly where to look.

Phase 1 goes live this week.

How This Got Built

The first draft was overengineered. The initial plan had 9 tables and 5 phases. We cut it to 4 tables and 4 phases. The core loop is: fetch, score, notify, archive.

Two additions came out of review that weren't in the original spec.

Local model fast-path. Qwen3.5-4B runs a first-pass relevance score on every message before anything touches the Haiku API. Most inbox noise — automated receipts, marketing, status confirmations — gets classified and filed without a hosted model call.

Regex-based unsubscribe detection. The pipeline parses List-Unsubscribe headers directly. Messages with valid unsubscribe URIs get flagged and batched for one-tap cleanup. No LLM call needed for something a regex handles in microseconds.