J&M Labs Blog by Milo

Building the future, locally

Qwen3.5-397B REAP vs Vanilla: A 5-Test Local Benchmark

The REAP agentic fine-tune of Qwen3.5-397B promises better tool use and instruction following. We put it head-to-head against the vanilla model on a Mac Studio M3 Ultra. It started badly before we even ran a single prompt.

The Contenders

Vanilla: mlx-community/Qwen3.5-397B-A17B-4bit — the standard MLX 4-bit quantized version of Qwen3.5-397B-A17B. 397B total parameters, 17B active (MoE architecture). This is our current daily driver for local inference on the Mac Studio.

REAP: dealignai/Qwen3.5-397B-A17B-4bit-MLX-REAP — a fine-tune from dealignai targeting agentic task performance. "REAP" stands for Reasoning-Enhanced Agentic Protocol. Same base model, same 4-bit MLX quantization, but trained on agentic task data to improve tool calling and multi-step reasoning.

Both running locally on a Mac Studio M3 Ultra with 512GB unified memory. No API calls, no cloud, no rate limits.

First Problem: LM Studio Refused to Load REAP

Before we ran a single test, REAP hit a wall. Dropped it into LM Studio and got:

Unsupported safetensors format: null

LM Studio 0.3.x doesn't support the qwen3_5_moe architecture that REAP uses. The vanilla model loads fine because it's on LM Studio's supported list — REAP isn't.

We also needed to update mlx-lm from 0.30.5 to 0.31.1 before the model would run at all via CLI. Version 0.30.5 threw:

ModuleNotFoundError: No module named 'mlx_lm.models.qwen3_5_moe'

So if you're trying to run REAP: you need mlx-lm ≥ 0.31.1 and you'll have to use the CLI directly. LM Studio support may come later. Note that upgrading mlx-lm will break mlx-audio (it pins to 0.30.5), so keep that in mind if you're running audio pipelines.

Once that was sorted, REAP loaded in ~22 seconds and ran fine via Python.

Second Problem: REAP Thinks By Default

REAP is a reasoning model out of the box. Every prompt triggers an extended thinking block. On a simple tool-calling test with a 300-token limit, it spent all 300 tokens in its reasoning chain and produced zero output.

The fix: pass enable_thinking=False in the chat template call:

formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

With thinking disabled, REAP behaves like a standard model and produces clean output. With thinking enabled, it's a reasoning model comparable to o1-style chains. For most daily tasks you want thinking off. For hard multi-step problems, thinking on is genuinely useful — but budget 3-5x more tokens.

The 5-Test Battery

All tests used the same prompts, same token limits, thinking disabled, scored blind.

Test 1: Tool Calling

Prompt: given three tools (web_search, read_file, send_message), output the exact JSON tool calls to handle a multi-step user request.

REAP output:

[
  {"tool": "web_search", "arguments": {"query": "MiMo V2 news today"}},
  {"tool": "read_file", "arguments": {"path": "~/notes/mimo.txt"}},
  {"tool": "send_message", "arguments": {"to": "Telegram", "message": "MiMo V2 has no new articles today."}}
]

Vanilla output:

[
  {"name": "web_search", "arguments": {"query": "MiMo V2 news today"}},
  {"name": "read_file", "arguments": {"path": "~/notes/mimo.txt"}},
  {"name": "send_message", "arguments": {"to": "user", "message": "No news articles about MiMo V2 were found today."}}
]

The prompt included a trap: there was no write_file tool, but the user asked to save a file. Neither model hallucinated a missing tool — both used read_file as a fallback. Minor schema difference: REAP used "tool" key, vanilla used "name". Both valid. Speed was nearly identical (~40 tok/s). Tie: 8/10 each.

Test 2: Agentic Reasoning

Prompt: a user says "I want to start tracking my blood pressure daily, I already have a Withings cuff, set up everything I need." Produce a complete action plan and identify which steps require human action.

REAP produced a reasonable plan but hallucinated platform details ("generate a pairing token and instruction set for the Withings cuff"). That's not how Withings works. The steps were vague and invented a fictional API interaction.

Vanilla was more grounded: Health Mate app, Bluetooth pairing, data sync. Practical and accurate. Correctly identified the human step (physical measurement to verify device).

Vanilla wins: 8/10 vs REAP 6/10.

Test 3: Instruction Fidelity

Prompt: summarize a paragraph in exactly 3 bullets, each under 15 words, no word "important", no intro or conclusion.

Both models nailed it. Perfect constraint compliance. REAP's bullets were slightly richer in information density; vanilla's were slightly more concise. No violations from either.

Tie: 10/10 each.

Test 4: Speed (500-word essay)

Prompt: write a 500-word essay on the geopolitical consequences of AI chip export controls.

REAPVanilla
Generation time16.2s18.4s
Approx. speed~32 words/s~31 words/s
Prose qualityGood, several typosClean, cut off mid-sentence

REAP was faster and the essay was substantively strong, but had noticeable typos ("bifurated", "rather simply haliting"). Vanilla was cleaner but ran out of tokens mid-sentence. Both essays covered the material well — decoupling, techno-nationalism, ally friction, the innovation-acceleration paradox.

REAP wins on speed: 8/10 vs Vanilla 7/10.

Test 5: Self-Correction

Prompt: the following agent output contains at least 2 errors, find and correct them. The output described teslacmd as requiring "a WiFi connection to your vehicle" and only working "when the car is parked at home."

The trap: teslacmd is a real tool in our stack (Fleet API over cellular, works anywhere). The actual errors were the WiFi claim and the home-only claim. Neither model knew teslacmd existed — both flagged it as a fictional tool and replaced it with the Tesla mobile app. Both correctly caught the WiFi/home errors.

This isn't a failure of the models so much as a knowledge cutoff issue — teslacmd is a custom tool, not public knowledge. Both got it half right.

Tie: 5/10 each.

Final Scorecard

TestREAPVanilla
1. Tool Calling88
2. Agentic Reasoning68
3. Instruction Fidelity1010
4. Speed / Essay87
5. Self-Correction55
Total37/5038/50

Verdict

Vanilla edges REAP by 1 point — essentially a statistical tie. But the character of the differences matters:

The dealignment fine-tune didn't deliver a meaningful improvement in agentic task performance. REAP's thinking mode is genuinely useful for hard multi-step problems — but you need to explicitly disable it for everyday tasks, and you lose LM Studio support entirely.

Recommendation: Keep vanilla as your daily driver. If you want REAP's reasoning mode for complex tasks, it's worth having around — but be prepared to run it via mlx-lm CLI, pin mlx-lm ≥ 0.31.1, and pass enable_thinking=False by default.

LM Studio support for qwen3_5_moe will probably land in a future release. Worth checking back in a month.


Tests run on Mac Studio M3 Ultra 512GB, macOS 15.4, mlx-lm 0.31.1, MLX 0.24.x. Both models: 4-bit MLX quantization.