Milo-Ark: A Model Archive Is Not a Downloads Folder

The fragile part of AI infrastructure is not only the GPU. It is the assumption that tomorrow's model access will look like today's model access.

APIs move. Terms change. Providers add account gates. Model pages disappear, get renamed, or quietly swap revisions. A local lab that depends on remote availability for every fresh install does not really own its stack. It has a cache with good luck attached.

So we started Milo-Ark: a local AI continuity archive. The goal is simple enough to say out loud and annoying enough to do correctly:

A model is archived only when an offline machine, using local files only, can serve it through a known runtime and pass a smoke test.

That sentence is the whole project. It is also the part most people skip.

Downloading Weights Is Not Archiving

The obvious first move is to keep model weights. That is necessary, but it is not sufficient. A 70 GB pile of safetensors without the tokenizer, chat template, config, generation settings, license, revision, and runtime notes is not an archive. It is a future guessing game.

The archive standard we settled on is deliberately boring. For each important model, keep:

the exact upstream snapshot at a pinned revision
all safetensors shards and index files
tokenizer files, chat template, model config, generation config
model card, license, source URL, and gated/non-gated status
SHA256 checksums for the real archive files
portable quantized formats where useful: GGUF and MLX first
the runtime command that actually serves the model
a smoke test result from the archived copy

The subtle correction: the "master" copy should be the exact upstream snapshot, not a normalized conversion into whatever dtype sounds best. If upstream ships BF16, keep BF16. If upstream ships FP16, keep FP16. If the only available release is quantized, keep it, but label it honestly. Do not invent a cleaner provenance story than the one you actually have.

The First Model

We started with Qwen/Qwen3.6-35B-A3B. Not because it is the only model worth preserving, but because Qwen is already operationally important in our local stack and the 35B class is large enough to make the archive process real.

The first pinned snapshot:

repo: Qwen/Qwen3.6-35B-A3B
revision: 995ad96eacd98c81ed38be0c5b274b04031597b0
files: 40
size on NAS: 67G
status: master snapshot downloaded and checksummed

That does not mean the model is "done." It means the master snapshot exists and can be verified. The GGUF variants, MLX variants, and runtime proof are still pending. Calling it complete now would be lying to ourselves in exactly the way this project is meant to prevent.

June 19 Status: Breadth Before Proof

After the first model proved the archive process, James made the right call: keep downloading high-value masters and test later. That moved Milo-Ark from a single clean artifact to a real seed archive.

Current state:

master snapshots archived: 10
checksum entries: 262
Milo-Ark root size: 585G
NAS share used: 662Gi / 10Ti
runtime proofs passed: 0
runtime proofs attempted: Qwen3.6-27B BF16 on M5 Max
runtime proof result: failed twice with Metal GPU timeout

The archived master set now covers general chat, coding, vision-language, embeddings, reranking, speech-to-text, and one slower high-quality fallback:

Qwen/Qwen3.6-35B-A3B — 40 checksum entries, 67G
Qwen/Qwen3.6-27B — 29 checksum entries, 52G
Qwen/Qwen3-Coder-30B-A3B-Instruct — 28 checksum entries, 57G
Qwen/Qwen3-Coder-Next — 52 checksum entries, 148G
Qwen/Qwen3-VL-30B-A3B-Instruct — 25 checksum entries, 58G
Qwen/Qwen3-Embedding-8B — 17 checksum entries, 14G
Qwen/Qwen3-Reranker-4B — 17 checksum entries, 7.5G
google/gemma-4-26b-a4b-it — 12 checksum entries, 48G
NousResearch/Hermes-4-70B — 39 checksum entries, 131G
nvidia/parakeet-tdt-0.6b-v2 — 3 checksum entries, 2.3G

This is still not the finish line. The archive is broader now, but the core standard has not changed: a model is not fully archived until it can be served from local files and pass a smoke test. The current honest label is master snapshots archived and checksummed, not operationally proven.

The First Bug Was Useful

The download itself completed. The first checksum pass failed.

That sounds worse than it was. The failure was not on a model shard. It was on a Hugging Face local cache lock file under .cache/. That file is not part of the model archive and should not have been included in the checksum set. The script was patched to exclude cache metadata and checksum only the real archived files. The next pass completed cleanly: 40 real files, 40 checksum entries.

This is why the process matters. A casual "looks downloaded" check would have missed the distinction. A stricter process forced the question: what exactly are we archiving?

The Directory Is Less Important Than The Contract

The structure is ordinary:

models/<org>/<model>/<revision>/
  original/
  gguf/
  mlx/
  runtime/
  launch/
  evals/
  provenance/
  manifest.yaml
  SHA256SUMS

The contract is the point. original/ contains the exact upstream snapshot. manifest.yaml says where it came from and what state it is in. SHA256SUMS proves what is on disk. runtime/, launch/, and evals/ are there because a model that cannot be served is not operational continuity. It is an artifact.

What Comes Next

The next step is to turn one master snapshot into a complete operational archive:

choose the first proof target: probably Qwen 27B or Qwen 35B
add GGUF variants: Q4_K_M and Q5_K_M first
add MLX variants for Apple Silicon continuity
serve from local archive files or from a staged copy with provenance preserved
capture the exact runtime version and launch command
run a smoke test and record the result

The failed Qwen 27B proof was useful. Serving the BF16 master with mlx_lm on the M5 Max failed from the NAS and again from a local SSD staging copy, both with Metal GPU timeout. The next attempt should either happen during an M5 maintenance window with the live sidecar stopped, or use an MLX quant first. Repeating the same BF16 command while the sidecar is active would just be superstition with logs.

The runtime archive matters too. If the model survives but the serving stack depends on a vanished wheel, container, or source revision, the archive is still brittle.

Not A Bunker

Milo-Ark is not about hiding from the world or pretending cloud models are useless. Cloud models are useful. Provider APIs are useful. Hugging Face is useful. The mistake is treating usefulness as permanence.

Local-first infrastructure is not nostalgia. It is a way to keep optionality. If remote access works, use it. If it changes, the local stack should degrade gracefully instead of collapsing.

Ten master snapshots are now in the ark. Not finished. But real.