Qwen3.5-397B REAP vs Vanilla: A 5-Test Local Benchmark
March 28, 2026
The REAP agentic fine-tune of Qwen3.5-397B promises better tool use and instruction following. We put it head-to-head against the vanilla model on a Mac Studio M3 Ultra. It started badly before we even ran a single prompt.
The Contenders
Vanilla: mlx-community/Qwen3.5-397B-A17B-4bit — the standard MLX 4-bit quantized version of Qwen3.5-397B-A17B. 397B total parameters, 17B active (MoE architecture). This is our current daily driver for local inference on the Mac Studio.
REAP: dealignai/Qwen3.5-397B-A17B-4bit-MLX-REAP — a fine-tune from dealignai targeting agentic task performance. "REAP" stands for Reasoning-Enhanced Agentic Protocol. Same base model, same 4-bit MLX quantization, but trained on agentic task data to improve tool calling and multi-step reasoning.
Both running locally on a Mac Studio M3 Ultra with 512GB unified memory. No API calls, no cloud, no rate limits.
First Problem: LM Studio Refused to Load REAP
Before we ran a single test, REAP hit a wall. Dropped it into LM Studio and got:
Unsupported safetensors format: null
LM Studio 0.3.x doesn't support the qwen3_5_moe architecture that REAP uses. The vanilla model loads fine because it's on LM Studio's supported list — REAP isn't.
We also needed to update mlx-lm from 0.30.5 to 0.31.1 before the model would run at all via CLI. Version 0.30.5 threw:
ModuleNotFoundError: No module named 'mlx_lm.models.qwen3_5_moe'
So if you're trying to run REAP: you need mlx-lm ≥ 0.31.1 and you'll have to use the CLI directly. LM Studio support may come later. Note that upgrading mlx-lm will break mlx-audio (it pins to 0.30.5), so keep that in mind if you're running audio pipelines.
Once that was sorted, REAP loaded in ~22 seconds and ran fine via Python.
Second Problem: REAP Thinks By Default
REAP is a reasoning model out of the box. Every prompt triggers an extended thinking block. On a simple tool-calling test with a 300-token limit, it spent all 300 tokens in its reasoning chain and produced zero output.
The fix: pass enable_thinking=False in the chat template call:
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
With thinking disabled, REAP behaves like a standard model and produces clean output. With thinking enabled, it's a reasoning model comparable to o1-style chains. For most daily tasks you want thinking off. For hard multi-step problems, thinking on is genuinely useful — but budget 3-5x more tokens.
The 5-Test Battery
All tests used the same prompts, same token limits, thinking disabled, scored blind.
Test 1: Tool Calling
Prompt: given three tools (web_search, read_file, send_message), output the exact JSON tool calls to handle a multi-step user request.
REAP output:
[
{"tool": "web_search", "arguments": {"query": "MiMo V2 news today"}},
{"tool": "read_file", "arguments": {"path": "~/notes/mimo.txt"}},
{"tool": "send_message", "arguments": {"to": "Telegram", "message": "MiMo V2 has no new articles today."}}
]
Vanilla output:
[
{"name": "web_search", "arguments": {"query": "MiMo V2 news today"}},
{"name": "read_file", "arguments": {"path": "~/notes/mimo.txt"}},
{"name": "send_message", "arguments": {"to": "user", "message": "No news articles about MiMo V2 were found today."}}
]
The prompt included a trap: there was no write_file tool, but the user asked to save a file. Neither model hallucinated a missing tool — both used read_file as a fallback. Minor schema difference: REAP used "tool" key, vanilla used "name". Both valid. Speed was nearly identical (~40 tok/s). Tie: 8/10 each.
Test 2: Agentic Reasoning
Prompt: a user says "I want to start tracking my blood pressure daily, I already have a Withings cuff, set up everything I need." Produce a complete action plan and identify which steps require human action.
REAP produced a reasonable plan but hallucinated platform details ("generate a pairing token and instruction set for the Withings cuff"). That's not how Withings works. The steps were vague and invented a fictional API interaction.
Vanilla was more grounded: Health Mate app, Bluetooth pairing, data sync. Practical and accurate. Correctly identified the human step (physical measurement to verify device).
Vanilla wins: 8/10 vs REAP 6/10.
Test 3: Instruction Fidelity
Prompt: summarize a paragraph in exactly 3 bullets, each under 15 words, no word "important", no intro or conclusion.
Both models nailed it. Perfect constraint compliance. REAP's bullets were slightly richer in information density; vanilla's were slightly more concise. No violations from either.
Tie: 10/10 each.
Test 4: Speed (500-word essay)
Prompt: write a 500-word essay on the geopolitical consequences of AI chip export controls.
| REAP | Vanilla | |
|---|---|---|
| Generation time | 16.2s | 18.4s |
| Approx. speed | ~32 words/s | ~31 words/s |
| Prose quality | Good, several typos | Clean, cut off mid-sentence |
REAP was faster and the essay was substantively strong, but had noticeable typos ("bifurated", "rather simply haliting"). Vanilla was cleaner but ran out of tokens mid-sentence. Both essays covered the material well — decoupling, techno-nationalism, ally friction, the innovation-acceleration paradox.
REAP wins on speed: 8/10 vs Vanilla 7/10.
Test 5: Self-Correction
Prompt: the following agent output contains at least 2 errors, find and correct them. The output described teslacmd as requiring "a WiFi connection to your vehicle" and only working "when the car is parked at home."
The trap: teslacmd is a real tool in our stack (Fleet API over cellular, works anywhere). The actual errors were the WiFi claim and the home-only claim. Neither model knew teslacmd existed — both flagged it as a fictional tool and replaced it with the Tesla mobile app. Both correctly caught the WiFi/home errors.
This isn't a failure of the models so much as a knowledge cutoff issue — teslacmd is a custom tool, not public knowledge. Both got it half right.
Tie: 5/10 each.
Final Scorecard
| Test | REAP | Vanilla |
|---|---|---|
| 1. Tool Calling | 8 | 8 |
| 2. Agentic Reasoning | 6 | 8 |
| 3. Instruction Fidelity | 10 | 10 |
| 4. Speed / Essay | 8 | 7 |
| 5. Self-Correction | 5 | 5 |
| Total | 37/50 | 38/50 |
Verdict
Vanilla edges REAP by 1 point — essentially a statistical tie. But the character of the differences matters:
- REAP advantages: Slightly faster generation, built-in chain-of-thought when you need it, good instruction following
- Vanilla advantages: More grounded real-world reasoning, fewer hallucinated details, cleaner prose, works in LM Studio
The dealignment fine-tune didn't deliver a meaningful improvement in agentic task performance. REAP's thinking mode is genuinely useful for hard multi-step problems — but you need to explicitly disable it for everyday tasks, and you lose LM Studio support entirely.
Recommendation: Keep vanilla as your daily driver. If you want REAP's reasoning mode for complex tasks, it's worth having around — but be prepared to run it via mlx-lm CLI, pin mlx-lm ≥ 0.31.1, and pass enable_thinking=False by default.
LM Studio support for qwen3_5_moe will probably land in a future release. Worth checking back in a month.
Tests run on Mac Studio M3 Ultra 512GB, macOS 15.4, mlx-lm 0.31.1, MLX 0.24.x. Both models: 4-bit MLX quantization.