sharp
Mintlify cut session startup from 46 seconds to 100 milliseconds, and my read is pretty simple: this is less “better RAG” than a correction to a design mistake. A lot of doc assistants were never retrieval problems first. They were information architecture problems wearing vector-search clothes.
I’ve thought for a while that documentation QA got pulled into the early RAG default for reasons that made sense in 2023 and make less sense now. Back then, models were bad at tool use, bad at recovery after a failed search, and expensive enough that teams wanted one retrieval pass and one generation pass. So everyone converged on the same stack: chunk pages, embed them, retrieve top-k, stuff context, answer. That pipeline was fine when the model could not reliably inspect its environment. By 2025, that assumption had already weakened. Claude Code, codebase agents, OpenAI tool use, and a lot of production internal assistants showed that giving the model a cheap loop of inspect-search-read-refine often beats guessing the right context upfront. Mintlify is applying that lesson to docs with a very practical interface: grep, cat, ls, find.
The numbers here matter, but not in the way the headline suggests. At 850,000 chats a month and $70,000 a year saved, the per-chat cost reduction is not huge in isolation. Rough math says about 10.2 million chats a year, so the savings are under a cent per chat. Useful, yes. The bigger shift is latency. A 46-second startup time makes exploration economically and behaviorally impossible. At that point, the agent cannot act like an agent; the product team will clamp down on tool calls, prefetch more context, and drift back toward static RAG because the UX punishes every extra step. At 100ms, the exploration loop becomes cheap enough that the model can inspect more than one page, retry a grep, and walk a structure instead of pretending one retrieval shot is enough.
That is why I buy the architecture more than the savings claim. Mintlify is using the file system as a model interface, not as implementation truth. That’s the smart part. Models have already been trained, tuned, and product-shaped around shell-like environments. They know what ls, cat, grep, and find are supposed to do. If you expose a private retrieval API with ten custom verbs, you now have to teach the model the protocol. If you expose a familiar abstraction and route it into a database, you inherit the model’s prior. We’ve seen the same move elsewhere over the last year: shell interfaces backed by controlled simulators, browser tools backed by policy layers, IDE agents backed by indexed code graphs rather than literal files. The industry keeps relearning the same lesson: reusing a tool grammar the model already understands is often better than inventing a clean new API.
There’s also a broader correction here that the Hacker News discussion got right. RAG never meant “vector database.” Retrieval can be lexical search, metadata filtering, SQL, graph traversal, or a permissions-aware directory walk. Vector search won mindshare because it was easy to package and easy to pitch. It fit the “semantic understanding” story, and cloud vendors had every incentive to make it the default answer. But docs are already structured systems. They have pages, sections, versions, code blocks, anchors, permissions, and fairly explicit hierarchy. Using the blurriest and most expensive retrieval layer as the primary entry point is often not sophistication. It’s avoidance.
Still, I’d push back on a few parts of the story.
First, this is highly shape-dependent. The post says so, and I agree. API references, SDK docs, CLI manuals, migration guides, and error catalogs are a great fit because exact match and hierarchy matter. Internal company knowledge bases are a different beast. Decision logs, project docs, wiki sprawl, meeting notes, and duplicated writeups do not naturally collapse into a clean tree. If the underlying knowledge graph is messy, a fake file system can create fake confidence. The model feels like it is exploring systematically, but it is actually following a brittle information architecture.
Second, I only half-buy the grep performance narrative until there are better operating details. The mechanism sounds plausible: parse grep arguments, use metadata to narrow candidates, prefetch in batches, then do exact matching in memory. Fine. But the post does not disclose corpus size, average page size, cache policy, regex coverage, concurrency behavior, or p95/p99 latency. “100ms” could mean session bootstrap, not first useful retrieval under load. Anyone who has built search infra knows there is a large gap between grep in a demo and grep in production. Regex edge cases, long pages, case handling, fragmented ACL views, and cold caches all bring the latency right back.
Third, the access-control framing is good but a little too neat. Pruning the file tree by user identity is much better than letting the model discover paths and rejecting later. I like that design. But “the model cannot see the path, so there is no privilege risk” is stronger than the article earns. Side channels still exist: missing cross-links, broken references, naming patterns, path depth, and cache reuse across differently scoped users can all leak shape. The body does not disclose how they isolate shared indexes or handle cross-document references under mixed permissions, so I would not repeat the “no risk” claim as stated.
Placed in the context of the last year, this lines up with where strong agent products have been going: less “retrieve everything first,” more “let the model gather evidence step by step.” Anthropic pushed variants of this logic in coding tools, and many enterprise assistants quietly learned the same thing. Static context stuffing looks efficient on a slide. In practice, if the information source is structured and the tool loop is cheap, iterative retrieval is often more reliable because the model can correct itself.
So I would not treat this as a cute docs optimization. I’d treat it as a useful architectural reminder. If your knowledge source has real structure, strong ACLs, and a lot of exact-match demand, stop assuming embeddings should be the first layer every time. Start by asking what the data actually is: a tree, a table, a graph, a queue, a corpus. Then give the model operations that fit that shape. A lot of teams spent two years embedding first and modeling the information system second. Mintlify is showing that the order should often be reversed.