02:27
37d ago
r/LocalLLaMA· rssEN02:27 · 05·08
→Fast local AI engine for Apple Silicon, optimized for agentic use
A developer released lightning-mlx, claiming it is the fastest local AI engine for Apple Silicon. On a MacBook Max M5 with 128GB RAM, Qwen3.6-27B hit 40.67 tok/s and Qwen3.6-35B-A3B hit 220.86 tok/s. It targets coding agents, tool calling, and short-turn workflows.
#Agent#Code#Inference-opt#Apple
why featured
HKR-H/K/R all pass, but this is a Reddit self-post with author-run benchmarks and no third-party reproduction. Useful for local agents, yet source strength keeps it below featured.
editor take
A dev claims 220 tok/s on MacBook M5 with Qwen3.6-35B MoE, but the post returned a 403 — no code or benchmark details to verify yet.
sharp
lightning-mlx claims Qwen3.6-35B-A3B reaches 220.86 tok/s on a MacBook Max M5 with 128GB RAM. If that number reproduces, local Apple Silicon agents get a serious runtime option; but the Reddit body is blocked by 403, so the repo, quantization, batch size, prompt length, prefill rate, and TTFT are not disclosed.
My first read is not “fastest local engine.” My read is that local inference benchmarks are finally moving toward agent workloads. A lot of local LLM tooling still optimizes for decode tok/s because it is easy to screenshot. llama.cpp, MLX, Ollama, and LM Studio all get judged that way. That is fine for chat. It is a poor proxy for coding agents. A coding agent reads files, calls tools, edits, runs tests, then starts another short generation. The expensive pain is often the fixed cost around each turn, not the raw stream speed after generation starts.
That makes the positioning interesting. The summary says lightning-mlx targets coding agents, tool calling, and short-turn workflows. That is the right place to attack. A 40.67 tok/s Qwen3.6-27B run and a 220.86 tok/s Qwen3.6-35B-A3B run tell us less than tool-turn wall time would. I want to see time from tool result arrival to first new token. I want prefill throughput at 4k and 16k context. I want warm-cache versus cold-cache numbers. The current article gives none of that.
I also do not trust a single tok/s claim without the model mechanics. Qwen3.6-35B-A3B sounds like an MoE model with roughly 3B active parameters. If so, 220.86 tok/s should not be compared directly with a dense 27B model at 40.67 tok/s. MoE decode is cheaper by design. Apple Silicon’s unified memory and high bandwidth do help here, and MLX is a natural fit for that hardware. Still, “fastest” depends on quantization, KV cache layout, speculative decoding, batching, and whether the benchmark was warmed.
The outside comparison is MLX itself. Since Apple released MLX in late 2023, the community has been rebuilding capabilities llama.cpp already had: quantization paths, better cache handling, broader model support, and server integrations. llama.cpp remains stronger as a cross-platform baseline. MLX has the hardware-native advantage on Mac. lightning-mlx becomes useful if it removes per-turn overhead for agents, not if it adds another nice CLI around a fast decode loop.
I have two doubts. First, the machine is a MacBook Max M5 with 128GB RAM. That is a premium local box, not the median developer laptop. If the same engine falls apart on M4 Pro 48GB or M3 Max 64GB, the result is more demo than daily workflow. Second, model quality is absent. Qwen3.6-27B at 40 tok/s does not mean it competes with Claude Sonnet or GPT-class remote models on large-repo edits. Speed lowers iteration cost. It does not supply planning accuracy, tool discipline, or regression safety.
So I would track this, but I would not accept the claim yet. The next useful artifact is a reproducible table: lightning-mlx versus MLX-LM versus llama.cpp, same Qwen3.6-27B, same 4-bit or 8-bit setup, same 4k and 16k prompts, reporting prefill, TTFT, decode, and full tool-turn latency. Without that, 220.86 tok/s is a good screenshot, not an engineering conclusion.
HKR breakdown
hook ✓knowledge ✓resonance ✓
70
SCORE
H1·K1·R1