sharp
FrontierFinance moves benchmarking in the right direction by making the test ugly in the way real work is ugly: 25 financial modeling tasks, five model types, and more than 18 hours of skilled human labor per task on average. That framing matters more than the headline result. The abstract says human experts beat current state-of-the-art systems on average score and deliver client-ready outputs more often. I’m not surprised. If these tasks genuinely require spreadsheet construction, source checking, assumption linking, revisions, and presentation quality, today’s systems usually fail in the last mile. They can draft a lot. They still struggle to finish work a client would trust without cleanup.<br><br>The good part here is the combination of long horizon, computer use, and domain workflow. Over the last year, we’ve seen adjacent attempts in other domains: SWE-bench for software, OSWorld for computer-use, GAIA for multi-step general assistance, plus a growing pile of agent evaluations that try to move beyond one-shot QA. Finance has been oddly under-benchmarked given how often people cite it as “high AI exposure.” This paper at least acknowledges that professional finance work is not a string-output problem. A model can know what a DCF is and still fail to produce a usable model because the assumptions are inconsistent, the comps are sloppy, the sensitivity table is wrong, or the deck formatting signals “junior error” to any real reviewer.<br><br>That said, I have real reservations, and they are not minor. First, 25 tasks is still a small sample. It is enough for a research probe, not enough for a stable industry barometer. “Financial modeling” covers very different workflows: three-statement models, DCFs, LBOs, merger models, project finance, maybe regulatory reporting depending on what they included. The abstract does not disclose the task mix, class balance, data provenance, or whether tasks reflect buy-side, sell-side, corporate finance, or accounting-heavy work. Without that, average score can hide a lot.<br><br>Second, the abstract leaves out the most important implementation details: which systems were tested, what tool permissions they had, whether they got browser access, spreadsheets, Python, retrieval, long rollouts, retries, or human scaffolding. That gap is decisive. If you restrict an agent’s tools and then show humans outperform it on long financial tasks, the result is directionally true but less informative. If you gave full computer use, large budgets, and enough time, then the result becomes much stronger. Right now the snippet does not say.<br><br>Third, I’m wary of the “client-ready” label unless the paper is very explicit. In finance, client-ready is not just correctness. It includes formatting discipline, footnotes, disclosure hygiene, source traceability, consistency across tabs, and the tacit style norms of a specific firm. That standard is partly subjective. If the rubric and inter-rater agreement are strong, great. If not, the benchmark may be measuring institutional polish as much as financial reasoning. That is still useful, but it is a narrower claim than “models cannot do finance work.”<br><br>My bigger takeaway is about evaluation philosophy. A lot of model vendors still lean on short-horizon benchmarks because they are cheap, reproducible, and easy to market. Professional labor is expensive for the opposite reason: it lives in long chains of execution, where context drifts, files change, assumptions break, and mistakes compound. FrontierFinance is valuable if it forces the field to admit that job displacement is not governed by trivia recall or single-turn reasoning. It is governed by long-run execution, error recovery, tool reliability, and deliverable quality. That pattern already shows up in coding agents and research agents. Systems can often get through 70% to 80% of the work, then stumble on the part professionals actually get paid for.<br><br>So I would not read this paper as “AI is weak in finance.” I’d read it as “older benchmarks were too light.” High exposure does not mean near-term full automation. The more plausible path is workflow fragmentation: data gathering, first-pass modeling, comps collection, sensitivity outputs, formatting cleanup. Agents will absorb those pieces first. Humans will keep the assumption choices, exception handling, review loops, and client accountability for longer. If FrontierFinance later expands beyond 25 tasks and discloses the system list, tool permissions, and scoring reliability in detail, it could become a serious stress test for professional-use agents. From the abstract alone, I buy the direction. I do not buy any broad labor-market conclusion drawn from this version yet.