10:48
59d ago
FEATUREDHacker News Frontpage· rssEN10:48 · 04·16
→AI cybersecurity is not proof of work
antirez argues AI bug finding is bounded by model intelligence level I, not by brute-force sampling alone; for the same code, execution paths eventually saturate. His concrete example is the OpenBSD SACK bug: weaker models fail even with unlimited tokens because they do not connect window validation, integer overflow, and the NULL branch. The key variable is model quality and access speed, not just more GPU.
#Reasoning#Safety#Benchmarking#antirez
why featured
High-quality commentary with HKR-H from the contrarian headline, HKR-K from the OpenBSD SACK mechanism and firsthand test, and HKR-R because it hits the 'more sampling vs better models' debate in AI security. Not a product, research release, or multi-source event, so it stays mid
editor take
antirez is right to break the “more sampling equals more capability” story. In vuln research, token count is a bad proxy for understanding.
sharp
antirez anchors the argument on one concrete condition: weaker models fail to connect three facts in the OpenBSD SACK bug. I buy the core claim. Vulnerability discovery is not a pure coverage problem; it is a representation and causal-composition problem.
The strongest line in the piece is the saturation claim. Sample the same code 100, 1,000, or 10,000 times and the early gains come from exploring candidate paths. After that, you mostly buy repetition, noise, and prettier hallucinations. Yes, the raw program state space is large. The bottleneck is the much smaller set of meaningful states the model can reach and reason through reliably. The article gives a reproducible enough mechanism: start-window validation, integer overflow, and the NULL branch. A weak model can gesture at each one separately, then fails at composition. Once the break is there, more tokens just replay the same miss.
That lines up with a lot of “agentic security” demos from the last year. The pattern is familiar: the model scans code, a tool fuzzes inputs, another system surfaces suspicious traces, and the model writes the report. One real issue lands, and the whole stack gets marketed as brute-force AI discovery. I don’t buy that framing. In many cases, the fuzzer found the anomaly, the static rule boxed in the risky region, and the model translated the result into a readable narrative. Mixing those together overstates the role of token volume and GPU count. antirez is useful here because he separates “found a bug” from “recognized a bug mechanism.” Those are not the same thing.
The wider context also supports him. The systems that have produced credible security work lately were rarely pure LLM sampling machines. They were LLMs tied to execution feedback, constraint checking, symbolic hints, test harnesses, or exploit validation loops. I’m not going to pretend I verified every recent paper again before writing this, but the pattern has been consistent: sampling alone hits a wall fast; sampling plus verifier loops keeps improving. That is the one place where I’d extend his model. Calling the cap “model intelligence I” is directionally right, but incomplete. In practice the ceiling looks more like intelligence times tool quality times feedback latency. A strong model without a verifier still invents things. A weaker model with a tight loop can sometimes be dragged into usefulness.
I also have one pushback on his wording about stronger-but-still-insufficient models being less likely to claim there is a bug because they hallucinate less. That feels plausible for this exact bug. I’m not sure it generalizes. Mid-tier models in security often do not become simply more cautious; they become better at producing coherent wrong analyses. If you do not score them against exploitability, crash reproduction, or patch-diff validation, false negatives and false positives can both get misread. The title and body give the thesis, but they do not disclose a broader eval set, sample size, model roster, or temperatures. So I would not turn that sentence into a general law yet.
There is also a market read here. This essay is a cold shower for the “more parallel agents equals more security output” pitch. That story works for shallow classes of work: misconfig detection, known bug patterns, dependency hygiene, broad triage. It breaks on deeper logic bugs. What you are buying is not linear production; you are buying a search process that saturates quickly. The firms that win here will not be the ones with the biggest raw sampling budget alone. They will be the ones with access to stronger frontier models, faster routing into those models, and better automated validation of exploitability. Compute still matters. In this domain it looks more like an amplifier than the engine.
So my read is blunt: stop charting security capability as token throughput. The OpenBSD SACK example is pointing at a threshold structure, not a cost curve. A weak model does not become a strong model by running longer. The body does not disclose Mythos success rates, cost, or operating envelope, so I can’t say how close this is to repeatable commercial performance. But the narrative that “more GPU automatically yields more high-quality vulns” has already oversold itself, especially for logic bugs.
HKR breakdown
hook ✓knowledge ✓resonance ✓
82
SCORE
H1·K1·R1