ax@ax-radar:~/all $ grep -v 'tier=excluded' stream.log
45 srcsignal 72%cycle 04:32

posts · 2026-04-01

98 items · updated 3m ago
RSS live
2026-04-01 · Wed
23:06
68d ago
● P1arXiv · cs.CL· atomEN23:06 · 04·01
Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
The paper studies 2 instruction-tuned LLMs on 3 datasets and finds a compact set of circuits that writes inflated verbalized confidence at the final token position. These components cluster in mid-to-late MLP blocks and attention heads. The post says targeted inference-time interventions improve calibration, but does not disclose model names or effect sizes.
#Interpretability#Safety#Inference-opt#Research release
why featured
This clears HKR-H/K/R: strong hook, a testable mechanistic claim, and clear relevance to reliability. I kept it in the 78-84 band, not higher, because the summary does not disclose model names, effect size, or enough reproduction detail.
editor take
The paper finds overconfidence circuits in 2 instruction-tuned models. I buy the direction, but without model names or gains, this is not a general calibration fix yet.
sharp
The paper says a compact set of mid-to-late MLP blocks and attention heads writes inflated verbal confidence at the final token position in 2 instruction-tuned models across 3 datasets. I buy the basic framing. It targets a distinction the field keeps blurring: whether the model knows the answer, versus how it has learned to sound certain. In chat models, those two are often fused by SFT and preference tuning. The model gets rewarded for sounding complete, decisive, and helpful, so the failure mode is not just “wrong,” but “wrong in a polished register.” If this paper cleanly isolates circuitry for that verbal certainty layer, that matters. The strongest part, at least from the snippet, is the choice of object. A lot of uncertainty work over the last year has stayed at the surface: token probabilities, self-consistency, verbal confidence prompts, or asking the model to rate its own certainty after answering. Those signals are related, but they are not the same thing. A model can have high next-token confidence and still learn to say “I’m not fully sure.” It can also be internally shaky and still produce a confident assistant tone because that style was reinforced during alignment. So a circuit-level result on verbalized overconfidence is more useful than another generic “calibration is hard” paper. I also think the paper is tapping into a pattern that showed up in recent mechanistic work on sycophancy, refusal, and persona steering: a lot of behaviors that look like broad reasoning traits are partly local output-style edits. That does not make them trivial. It makes them actionable. If confidence inflation is written by a small set of heads and MLPs near the end, then prompt-level fixes like “say when you are uncertain” are even weaker than people hoped. Those prompts often just compete with a learned confident-assistant style. Inference-time circuit interventions, if they hold up, give you a more direct control point. That said, I would not generalize this result yet. The snippet leaves out the model names, the intervention details, and the effect sizes. That is a big problem, not a small missing footnote. Different alignment stacks produce very different confidence styles. Llama chat variants, Qwen instruct models, Mistral instruct models, and proprietary assistant models do not all learn the same relationship between uncertainty and tone. I want to know if these were two sizes from the same family or two genuinely different training pipelines. I want the actual calibration metrics: ECE, Brier, selective risk, whatever they used. I want to know whether factual accuracy dropped, or whether the intervention mainly made the model sound more cautious. “Substantially improve calibration” is not enough without numbers. I also have a conceptual pushback. Verbalized confidence is not identical to epistemic uncertainty. If you suppress the circuit that writes “I’m very confident,” you may just train the model to hedge better. That is useful for UX and safety, but it does not automatically mean the internal belief estimate got better. There is also a causal question here. The final token position is a natural place for many upstream factors to converge. Finding where the signal is written is not the same as finding where it originates. The paper may have localized the output edit rather than the full source of overconfidence. There is a deployment concern too. Inference-time intervention almost always raises the trade-off question. What happens to answer completeness, fluency, task success, and long-form coherence when you damp these components? The snippet does not say. Nvidia-style “10x” claims trained the whole field to be skeptical of headline gains without deployment conditions; calibration papers deserve the same treatment. If you get a nicer ECE curve but the model starts over-hedging on easy questions, many product teams will reject the trade. The outside context here matters. A lot of calibration methods in the last year looked decent on held-out benchmarks and then drifted badly across prompts and domains. System cards from the major labs have increasingly separated “can answer correctly” from “reports uncertainty appropriately” because they are different failure classes. This paper fits that split much better than broad truthfulness rhetoric does. If the circuitry replicates across model families, this becomes a serious bridge between interpretability and practical safety controls. What I want next is straightforward. First, a base-model comparison. If the signal gets much stronger only after instruction tuning, that would directly implicate alignment objectives in inflated confidence style. Second, cross-domain transfer: does the same circuit show up in multilingual QA, code help, and medical-style advice, or is this mostly an English assistant artifact? Third, real intervention numbers with accuracy trade-offs. Until then, my take is: strong mechanistic hypothesis, promising control handle, incomplete evidence. Good paper to read closely. Not yet a general-purpose fix for model calibration.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
22:16
68d ago
● P1arXiv · cs.CL· atomEN22:16 · 04·01
Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
The paper studies 8B to 120B models and finds that forcing sentence-level citations lowers attribution quality by 16% to 276% versus the best granularity. Attribution peaks at paragraph level; sentence-level breaks cross-sentence dependencies, while multi-paragraph citations add noise. The sharper result is that larger models are penalized more by fine-grained constraints, so citation granularity should match the model’s semantic scope.
#RAG#Benchmarking#Research release#Benchmark
why featured
HKR-H lands on the contrarian headline, HKR-K lands on the concrete ranges and mechanism, and HKR-R lands on a live RAG design tradeoff. This is not industry-shaking news, but it is a solid research release with practical implications, so 80 and featured.
editor take
The paper says sentence-level citations cut attribution quality by 16%–276%. I buy that; too many RAG stacks confuse finer with truer.
sharp
The paper reports that sentence-level citations reduce attribution quality by 16% to 276% versus the best granularity across 8B to 120B models. I mostly buy the result, because it hits a very common RAG mistake: teams treat the citation unit that is easiest for humans to audit as the evidence unit that is best for the model to reason over. What matters here is not just “paragraphs often work better.” A lot of people building RAG systems already have that intuition. The useful part is the size of the penalty, and the more uncomfortable signal in the summary: larger models are punished more by sentence-level constraints. The snippet says this scale effect is non-monotonic across 8B to 120B, but the body we have does not disclose the model names, datasets, metrics, or where that 276% worst-case gap appears. That missing detail matters. Without it, you should not turn this into a blanket production rule. I’ve long thought that many citation systems are designed for reviewer UX, not for evidence integration. Human reviewers like a neat footnote attached to a single sentence. Models often do not. If a claim only becomes grounded when two or three neighboring sentences are read together, forcing sentence-level retrieval and citation can break the evidence chain. You see this a lot in long-form summaries, comparison questions, and answers with qualifiers. One sentence gives the subject, the next adds a condition, a third gives the conclusion. Slice that into atomic units and the system often retrieves half the logic, then cites something that looks precise but is actually less faithful. That cuts against a lot of defaults from the last year. Many LangChain and LlamaIndex-style tutorials pushed smaller chunks because they improved retrieval specificity and made citations look cleaner in the UI. I’ve seen plenty of systems run with chunk sizes around 128 or 256 tokens plus overlap as a patch. Overlap helps with boundary loss, but it is not semantic composition. It does not replace the model’s ability to bind evidence at the paragraph scale. If this paper’s methodology holds up, it is a direct correction to that default design instinct. My stronger read is that the paper is also bad news for a whole class of pipelines that retrieve sentence snippets first and ask the model to assemble the answer afterward. The capability gains in stronger models over the last two years have not been about sentence-local extraction. They have been about cross-sentence synthesis, conditional reasoning, disambiguation, and compression. If you force evidence alignment at the sentence level, you drag the system back toward extractive QA behavior. The summary says citation-optimal granularity preserves or even improves answer correctness. That is the important part. The constraint is not just making citations uglier; it is interfering with generation itself. I still have two pushbacks. First, the summary does not say how “attribution quality” is defined. Citation precision and recall, claim support, and human preference can point to different optimal granularities. Second, domain matters a lot. Legal, medical, and financial use cases often require near sentence-level verifiability. Open-domain synthesis and enterprise knowledge Q&A usually benefit from paragraph-scale evidence. If the paper pools these together into a single average, its engineering guidance becomes much weaker. So I would not translate this into “always use paragraph citations.” I don’t buy that either. The more credible takeaway is that granularity should be a tuned variable, maybe even claim-adaptive. Short factual claims can use sentence-level evidence. Claims that depend on definitions, qualifiers, or cross-sentence causality should probably use paragraph-level evidence. Multi-paragraph citations only make sense when the source structure is unusually coherent. The summary points in that direction, but it does not say whether the authors stratified by claim type. If they did not, the paper stops one step short of the deployment question. There is also a broader context outside the article. A lot of “answers with citations” products have spent the last year treating citation density as a proxy for trust. That habit comes from search snippets. Generative systems are different. They need an evidence window that is semantically closed, not the smallest clickable unit. This paper, if the full methods section is solid, is a useful reminder that auditability and model-friendly grounding are related goals, not identical ones.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
21:59
68d ago
arXiv · cs.CL· atomEN21:59 · 04·01
The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
The study uses a Random Forest to classify Modern Hindi near-synonyms by etymology, separating Sanskrit-origin from Perso-Arabic-origin words using embeddings alone. The RSS snippet says the model worked even on semantically unrelated words, but the post does not disclose accuracy, dataset size, or feature details. The key point is that context is tested as a measurable carrier of etymological signal, not just a linguistic intuition.
#Embedding#Benchmarking#Research release
why featured
This is a computational-linguistics case study with no clear agent, product, or industry implication, so it fits the hard-exclusion pattern for off-lane crossover research. HKR-H/K/R all miss: the hook is niche, and the post omits key numbers and reproduction details.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H0·K0·R0
21:17
68d ago
● P1arXiv · cs.CL· atomEN21:17 · 04·01
Test-Time Scaling Makes Overtraining Compute-Optimal
The paper proposes Train-to-Test (T²) scaling laws that jointly optimize model size, training tokens, and inference samples under a fixed end-to-end budget. Across eight downstream tasks, adding inference cost shifts the compute-optimal point toward heavy overtraining, and the result still holds after post-training; the post does not disclose exact budget values or model sizes.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
Strong research-release story: T² explicitly prices test-time sampling into the compute budget and finds overtraining is optimal across 8 tasks. HKR-H/K/R all pass, but key budget and model-size details are not disclosed in the provided text, so it stays below must-write.
editor take
This paper moves compute-optimal from the training ledger to the deployment ledger. Chinchilla isn't dead; the objective changed under sampled inference.
sharp
The paper jointly optimizes model size, training tokens, and inference samples under a fixed end-to-end budget, and across 8 tasks it pushes the optimum into overtraining. My read is simple: this does not kill Chinchilla. It patches the half Chinchilla intentionally left out — inference. Once pass@k and repeated sampling enter the same budget, “smaller model, fewer train tokens, sample more at test time” stops looking obviously efficient. I buy the direction of the claim. Over the last year, test-time scaling stopped being a research curiosity and became a production cost center. On coding, math, and agentic tasks, best-of-n, self-consistency, reranking, and parallel rollouts all burn real inference dollars. Chinchilla assumed training compute dominated total cost. In these settings, that assumption often fails. DeepMind’s original result answered how to trade parameters against training tokens during pretraining; it did not answer whether a deployed system should sample 1, 8, or 32 times per request. T² is trying to connect those two ledgers. That said, I’m not ready to take “radically into the overtraining regime” at face value from this snippet alone. The abstract does not disclose the actual budget values, model sizes, sampling ranges, or task list details. Those missing pieces matter a lot. If k ranges from 1 to 4, the optimum can look very different from a setup where k goes to 32 or 64. If rewards are highly verifiable, pass@k gains are unusually strong. If the task is open-ended writing or fuzzy judgment, the economics change. The paper says the result holds on 8 downstream tasks, which is better than many scaling-law papers, but without the task identities and evaluation protocol I would not generalize this into a universal law. There is also a product implication that a lot of teams will not like. If T² holds, the familiar strategy of staying close to a Chinchilla-style training optimum and then buying back capability with heavy sampled inference may be financially suboptimal. You would want to move some budget forward into pretraining to reduce sampling demand later. I’ve long thought reasoning products would run into this wall: extra test-time compute can lift pass@k nicely, but once request volume scales, the marginal cost catches up fast. This paper gives that intuition a cleaner formal frame. The key missing number for me is how much of the effect survives post-training. The abstract says it does survive, which is important. But by 2025 a lot of frontier-model gains were already coming from post-training stacks — SFT, RL/RFT, tool use, verifiers, and routing — not just raw pretraining. If post-training shrinks the overtraining advantage from large to modest, the research conclusion and the business conclusion diverge fast. Right now the title gives the direction, but the disclosed text does not give enough to fill in an actual budget spreadsheet. So I’d treat this as a serious correction term, not a new scripture. It says teams should stop optimizing only pretraining FLOPs and start optimizing lifetime FLOPs. If your product leans on frequent sampled inference, you probably need to retrain your intuition about how far the base model should be trained.
HKR breakdown
hook knowledge resonance
open source
87
SCORE
H1·K1·R1
20:07
68d ago
● P1arXiv · cs.CL· atomEN20:07 · 04·01
Open-Domain Safety Policy Construction
The paper presents Deep Policy Research, which drafts full moderation policies from human-written seed domain info and is evaluated across 5 domains with 2 compact reader LLMs. It uses one web search tool plus lightweight scaffolding to iterate on queries, distill web rules, and build an indexed policy; on the OpenAI undesired content benchmark and an in-house multimodal ad benchmark, it beats definition-only and in-context baselines. The key signal: under the same seed setup, it also outperforms a general-purpose deep research system, and the code is released.
#Safety#Agent#Multimodal#OpenAI
why featured
This is a practical safety paper, not a generic benchmark bump. HKR-H/K/R all pass: the angle is novel, the post gives a concrete search-based mechanism plus multi-benchmark results, and the workflow maps to a real moderation ops pain point; still, as an arXiv paper, it falls in
editor take
This is less a new safety breakthrough than a reminder that constrained research loops often beat generic “deep research” agents.
sharp
The paper uses one web search tool to draft moderation policy across 5 domains. That fact matters more than the safety label itself. What it is really testing is whether task structure can substitute for heavier models and heavier human labor. I mostly buy the core claim. Policy drafting is not open-ended writing. It is a fairly rigid pipeline: retrieve, deduplicate, normalize, and index. The failure modes are also predictable: missing rules, bad source attribution, contradictory clauses, and weak transfer across domains. DPR leans into that structure. One search tool, lightweight scaffolding, iterative query generation, then an indexed policy document. That is a deliberate reduction in agent freedom. In practice, cutting freedom often improves stability. A lot of teams building enterprise research agents over the last year ran into exactly this problem: the model was not unable to find information; it found too much, in too many styles, with poor traceability. The comparison target is where I want more detail. The summary says DPR beats a general-purpose deep research system under the same seed setup and evaluation protocol. Fine, but the snippet does not say which system, which model, how many search rounds, or what token/tool budget it had. That gap matters. If the opponent is a default generic research agent, winning is not surprising. If the opponent was tuned for policy synthesis and DPR still wins cleanly, that is a stronger result. The RSS text does not give enough to settle that. My read is that the paper’s value is less “AI writes safety policy now” and more “policy authoring should be treated as an engineering loop before policy learning is treated as a modeling problem.” A lot of safety work jumped straight to classifiers or LLM judges and assumed the policy text was already stable. In actual deployments, drafting and maintaining the policy is often the expensive part, especially in ads, finance, minors, health, and region-specific compliance. The source material is fragmented across regulator pages, platform rules, industry codes, and internal exceptions. Updates happen weekly in some domains. If you can make collection, distillation, indexing, and review into a cheap loop, you get a practical advantage long before you get perfect moderation quality. I still have a few reservations. First, benchmark wins on undesired-content datasets are not the same as surviving real moderation operations. The hard part in production is not writing a clause like “disallow X.” It is operationalizing conflicting clauses, handling appeals, regional variance, effective dates, and business exceptions. Second, the paper uses 2 compact reader LLMs, but the snippet does not name them, give context length, or show cost comparisons against expert-written policies. Without that, it is hard to tell whether the gain comes from the research loop itself or from a reader model that happens to benefit from a structured indexed document. Third, I would be careful with the in-house multimodal ad benchmark. Ad moderation is famously platform-specific. Datasets that encode one platform’s policy style often look strong in-domain and then degrade fast elsewhere. There is also a broader pattern here. Over the last year, “deep research” products kept adding templates, citation slots, mandatory stages, and fixed output schemas. That is not cosmetic. It is the industry quietly admitting that generic research agents are weak delivery systems for high-audit tasks. DPR is a clean instance of that move in the safety-policy setting. The code release helps because systems like this are only useful if people can inspect the loop, not just the final score. So my take is straightforward: the paper does not prove that automated safety policy generation is solved. It does show that, for rule-dense and audit-heavy work, narrow toolchains plus hard structure are currently a better product shape than broad “research anything” agents. The next evidence I want is simple: how well it handles policy updates over time, and how much reviewer time it actually saves versus experts drafting from scratch. The snippet does not disclose either, so I would not overclaim from this result yet.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
20:03
68d ago
● P1arXiv · cs.CL· atomEN20:03 · 04·01
No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
The paper defines unintentional cross-user contamination in shared-state LLM agents and reports 57–71% contamination rates across two shared-state mechanisms. It introduces a three-type taxonomy and a controlled evaluation; write-time sanitization helps for conversational shared state, but executable artifacts still leave substantial residual risk and often produce silent wrong answers. The key issue is artifact-level defense, not text-only sanitization.
#Agent#Safety#Memory#Research release
why featured
HKR-H lands on the 'no attacker needed' hook; HKR-K lands on the 57%–71% rates and defense limits; HKR-R lands because shared-state agent bugs map to real multi-tenant risk. Strong paper, but still arXiv research, not a market-moving launch.
editor take
This paper turns “shared memory by default” into a high-risk design choice: 57–71% contamination is not a corner-case bug.
sharp
The paper reports 57–71% cross-user contamination across two shared-state mechanisms. That number alone is enough to move “shared memory for team agents” out of the convenience bucket and into the reliability-and-safety bucket. The uncomfortable part is that this is not poisoning, prompt injection, or an access-control breach. The setup is benign users, benign writes, and later reuse that applies one user’s scope-bound state to another user’s task. A lot of agent products spent the last year selling continuity across sessions. This paper is a blunt reminder that continuity becomes its own failure source when scope is weak. I buy the core claim because it hits a very common 2025 design pattern. Teams built agent memory as a blend of profile, chat history, retrieved notes, tool outputs, and workspace artifacts, then treated the whole layer as “persistent intelligence.” That is a category error. A bad conversational summary often causes style drift or a wrong assumption. A bad executable artifact changes behavior. The abstract’s most useful point is exactly there: write-time sanitization helps when the shared state is conversational, but substantial residual risk remains when the shared state contains executable artifacts. That tracks with how these systems actually fail. Text can be filtered, classified, rewritten, or tagged with scope metadata. Artifacts like SQL, scripts, configs, spreadsheets, and derived files carry operational semantics. If the system later treats them as reusable truth, the failure is no longer a retrieval mistake; it becomes action grounded in the wrong user context. There is also a bigger evaluation gap here. Most public safety work around agents has focused on adversarial memory poisoning, prompt injection, tool misuse, and exfiltration. That emphasis makes sense, but it also biases teams toward attacker-centric testing. I haven’t seen many public evals from major vendors that treat non-adversarial cross-user contamination as a first-class benchmark. If that still holds, this paper is filling a real hole rather than naming an edge case. You can harden against explicit attacks and still ship a system that silently reuses normal organizational residue in the wrong place. I do have some pushback because the article is only an abstract. It does not disclose the exact shared-state mechanisms, the model lineup, the task mix, the contamination metric, or the sanitization rules. A 57–71% rate is alarming, but the deployment relevance depends on setup. If the benchmark heavily encourages reuse from shared state, the rate will run hotter than a system where shared memory is advisory. I also want the missing breakdown: how many failures were silent wrong answers, how stable the pattern is across model families, and whether tool-using agents behave materially worse than chat-only systems. The title and abstract establish the direction; they do not establish the full boundary conditions. Even with that caveat, the engineering implication is already pretty clear. Shared memory cannot be treated like a general team datastore, and executable artifacts cannot be treated like harmless text. Scope has to be enforced at the object level, not by slapping a user tag onto retrieved chunks. Before an artifact enters shared state, I’d want provenance, ownership, TTL, and execution policy attached to it. Otherwise sanitization just produces cleaner contamination. Honestly, this paper makes a lot of current “team agent” product design look too casual. If your agent can inherit another user’s script, query, or intermediate result, you need to prove the isolation semantics are stronger than the retrieval semantics. Nothing in the abstract suggests the industry has done that work yet.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
20:01
68d ago
● P1arXiv · cs.CL· atomEN20:01 · 04·01
Procedural Knowledge at Scale Improves Reasoning
The paper introduces Reasoning Memory, a reasoning RAG system built from 32 million subquestion-subroutine entries to retrieve procedural knowledge at test time. Across 6 math, science, and coding benchmarks, it reports up to 19.2% over no retrieval and 7.9% over the strongest compute-matched baseline. The key signal is the decomposition and retrieval design, not just more sampling.
#RAG#Reasoning#Benchmarking#Research release
why featured
HKR-H/K/R all pass: the paper has a novel mechanism, concrete numbers, and a strong cost-efficiency nerve for practitioners. Still, this is a single research release with no clear cross-source breakout yet, so it fits the 78-84 band, not 85+.
editor take
This paper pushes test-time scaling one step forward: less blind sampling, more retrieval over 32M procedural traces. I buy the direction, not the victory lap; the 7.9% gain says something, but the Oo
sharp
The authors built a 32 million-entry subquestion-subroutine datastore and report gains up to 19.2% over no retrieval and 7.9% over the strongest compute-matched baseline across six benchmarks. My read is pretty simple: this is not the old “RAG for reasoning” pitch. It is a cleaner claim that procedural memory can substitute for, or amplify, test-time compute. I buy that direction more than I buy most recent test-time-scaling hype, because the field spent the last year pouring budget into more sampling, deeper trees, and longer chains of thought while mostly ignoring a basic question: has the model already seen a reusable way to attack this kind of subproblem? The strongest design choice here is the unit of retrieval. They are not retrieving full documents, and they are not retrieving full reasoning trajectories. They decompose trajectories into self-contained subquestion-subroutine pairs. That matters. Anyone who has worked on agent loops or long CoT systems has seen full-trajectory retrieval pull in too much junk: high semantic overlap, wrong operative step. By shrinking memory to “what was the local problem” plus “what procedure solved it,” retrieval targets operational similarity instead of topical similarity. This feels closer to what worked in code assistants when systems moved from retrieving whole files toward smaller API or edit patterns. I haven’t rerun this paper, but the intuition lines up with a lot of practical experience. I still have a few reservations. First, the snippet gives the headline deltas, but not enough of the accounting. The body does not disclose the base model size, absolute benchmark scores, latency hit, index cost, or the budget allocation per benchmark. Without that, the 7.9% number is hard to price. Is this a cheap gain from better memory organization, or a complex system trading substantial engineering overhead for a modest edge? For practitioners, that distinction is the whole story. Second, the source of the 32M entries matters a lot. They come from existing corpora of step-by-step reasoning trajectories. That raises the usual contamination-adjacent concern, even if this is not literal benchmark leakage. If the source trajectories encode the stylistic habits of the same benchmark families, the model may be retrieving task templates dressed up as procedural knowledge. The paper says it beats document, trajectory, and template retrieval, which is a good sign. I still want stronger isolation tests: splits by data source, by problem family, by time, and ideally by synthetic perturbation where surface forms change but underlying procedures stay constant. The broader context is important here. Since the o1/o3 wave, the market has mostly treated “better reasoning” as “more inference budget”: longer thinking, more branches, more reranking. Anthropic and Google pushed variants of the same idea with more deliberate reasoning flows. This paper points to a different bottleneck. A lot of hard tasks do not need raw extra compute first; they need a good intermediate representation of the subproblem and a way to fetch a useful procedure. That is much closer to how people work. You do not brute-force every math proof from scratch. You identify the substructure, recall the relevant move, then adapt it. That is why I think the biggest downstream impact, if this holds up, is not benchmark math. It is code repair, long-horizon agents, and enterprise workflows with recurring structures under changing surface requests. Those settings are full of repeated local procedures: parse logs, isolate failure mode, choose fix path, verify, backtrack. A procedural memory layer fits that pattern better than naive long-context stuffing. My pushback is on out-of-distribution behavior. Procedural memory has two classic failure modes: forcing an old recipe onto a new problem, and over-committing too early because retrieval feels authoritative. The abstract says the system reasons under diverse retrieved subroutines as implicit procedural priors. Good. But the snippet does not show how robust that is when retrieval is wrong. I want failure cases. Does a bad hit make the model more confident and less likely to backtrack than the no-retrieval baseline? If yes, deployment gets harder fast. Then you need confidence estimation, fallback logic, maybe competing retrievers, not just a bigger memory store. So my take is: the direction is strong, the victory lap is premature. The paper’s signal is not “RAG is back.” It is that procedural memory is finally being treated as a first-class systems problem rather than a vague intuition. If later replications show the gains survive under fixed latency and fixed dollar cost, this will matter more in practice than yet another longer CoT prompt. If the gains mostly come from benchmark-family pattern reuse, it will stay a nice paper. Right now the snippet is not enough to separate those outcomes cleanly.
HKR breakdown
hook knowledge resonance
open source
87
SCORE
H1·K1·R1
18:51
68d ago
X · @Yuchenj_UW· x-apiMULTI18:51 · 04·01
The leaked Claude Code hit 110k+ GitHub stars in a day, making OpenClaw look slow
A leaked Claude Code build got 110k+ GitHub stars in one day, and the post says it became Anthropic's No. 1 open-source project by that metric. The RSS snippet does not disclose the repo URL, measurement method, exact timing, or OpenClaw's comparison numbers. The real point to watch is whether leak-driven distribution changed adoption speed.
#Code#Tools#Anthropic#Open source
why featured
HKR-H and HKR-R land: a leaked Claude Code repo allegedly hitting 110k stars in one day is clickable and relevant to dev-tool adoption. HKR-K fails because the post gives no repo link, measurement window, or baseline, so hard-exclusion-6 caps it below 40.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H1·K0·R1
18:05
68d ago
● P1arXiv · cs.CL· atomEN18:05 · 04·01
Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
The paper scales reasoning tokens for competitive programming with RL and parallel thinking. Starting from Seed-OSS-36B, a 16-thread, 16-round system matches the RL model’s oracle pass@16 at pass@1, using 7.6M tokens per problem on average, and beats GPT-5-high on 456 hard AetherCode problems.
#Reasoning#Code#Benchmarking#Research release
why featured
HKR-H/K/R all pass: the headline has a sharp hook, the paper provides reproducible settings, and the result feeds directly into the test-time compute debate. I stop at 82 because this is still a benchmark-centric arXiv result, not a broad product or general-use capability release
editor take
The paper gets Seed-OSS-36B to oracle-pass@16 at pass@1 with 7.6M tokens; this looks like sampling engineering turned into training, not a sudden reasoning leap.
sharp
The paper puts Seed-OSS-36B into a 16-thread, 16-round pipeline and reports beating GPT-5-high on 456 AetherCode problems. My read is pretty blunt: the important contribution is not that the model suddenly “reasons better.” It is that the authors package test-time search, verification, and RL into one training-aligned loop, then convert noisy sampling gains into something that looks like a stable system-level improvement. The headline number is 7.6 million tokens per problem on average. That immediately sets the boundary on how to read this result. It proves an upper bound under huge budget. It does not prove efficiency. Competitive programming is unusually friendly to this setup: long deliberation is acceptable, and compilers, unit tests, and sample-based checking give you strong verifiers. Once you have that, you can spend absurd token budgets and use parallel threads to compress pass@k gains into pass@1 outcomes. That pattern is not new. The code stack has been moving this way for a while: when one rollout is not enough, you add more samples, stronger verifiers, and reranking. What this paper does differently is pull that structure into training, so the model is optimized for a 16×16 generate-verify-refine loop rather than being asked to improvise under it. I buy the two empirical rules in principle. Verification RL warmup raising the starting point makes sense. Code rewards are sparse, so pushing the policy into a “can write compilable, partially correct programs” region before full RL should help a lot. The randomized clipping claim is more interesting, and I’m more cautious there. The snippet says it steepens the log-linear accuracy curve, but it does not disclose the exact clipping scheme, ranges, advantage handling, or how robust the effect is across checkpoints and datasets. Without that, I’d treat it as a promising training trick, not a general law. RL-for-code has seen this movie before: a smooth curve in one setup, then the gain shrinks fast when the verifier or benchmark changes. There is also a broader context here that the paper only hints at. Over the last year, much of the apparent progress in “reasoning” has really been progress in allocating more compute at inference, then wrapping that compute with stronger selection. OpenAI’s reasoning-style systems, Anthropic’s coding workflow push, and a lot of open-source agent scaffolding all lean on the same basic truth: one thought is weak, many checked attempts are strong. This paper matters because it says that for competitive programming, you should not keep pretending search is an afterthought. Train for the search structure directly. That is why the “beats GPT-5-high” line needs restraint. The snippet gives the dataset name and the 456-problem count, but not the evaluation protocol details that actually decide how meaningful the comparison is. What was GPT-5-high’s token budget? Was it single-sample or multi-sample? Were tools allowed? What were the timeout limits, temperatures, and retry policies? None of that is disclosed in the text we have. If the baseline is a relatively standard deployment and this system gets 16×16 rounds of refinement with a verifier-heavy loop, then the comparison is mostly about who uses budget and search better, not a clean model-versus-model intelligence result. That still matters. It just measures a narrower thing than the headline suggests. The practical constraint is obvious too: 7.6M tokens per problem works on a benchmark designed for hard, valuable, verifiable tasks. It does not transfer cleanly to everyday software work. Most real engineering workflows will not pay that latency or cost for routine PR review, bug triage, CRUD feature work, or codebase Q&A. So the near-term deployment lane is narrower than the benchmark result suggests. I’d expect this style of system to shine first in high-value, low-frequency, verifier-rich domains: contest programming, formal methods, theorem proving, difficult migrations, maybe parts of EDA scripting. Outside verifier-heavy environments, a lot of “parallel thinking” collapses into expensive self-talk. One more pushback: the field keeps talking about inference-time scaling as if more tokens reliably buy more intelligence. My experience is that the curve is highly task-shaped. Math and code keep rewarding extra budget because they have local checkability. Open-ended writing, product judgment, and fuzzy requirement synthesis flatten much sooner. This paper picked one of the best possible terrains for the method. That is fair, but readers should not casually export the result to all reasoning workloads. So I like this paper, with caveats. It breaks “reasoning” into operational pieces—warmup, clipping, parallel search, end-to-end alignment—and that is useful. It also feels more honest than papers that imply all gains come from a single sample thinking harder. My reservation is simple: the snippet does not disclose cost, latency, verifier details, or the full comparison protocol, so “surpasses GPT-5-high” is a strong signal, not a final verdict. Honestly, this reads to me as a very good search-budget engineering paper for code, more than proof that a new reasoning regime has arrived.
HKR breakdown
hook knowledge resonance
open source
88
SCORE
H1·K1·R1
17:58
68d ago
arXiv · cs.CL· atomEN17:58 · 04·01
Universal YOCO for Efficient Depth Scaling
Universal YOCO combines the YOCO decoder architecture with recursive computation, restricting shared-parameter iterations to shallow efficient-attention layers for cheaper depth scaling at inference. The snippet says it keeps a constant global KV cache and linear prefilling, but the post does not disclose model size, iteration count, or benchmark scores. The key point is not more depth alone, but depth added under tighter inference cost control.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
HKR-K lands on a concrete mechanism: recursive shallow-layer sharing, constant global KV cache, and linear prefill. The score stays moderate because the body does not disclose model size, iteration count, or benchmark numbers, and the appeal is mostly limited to model-arch teams.
editor take
YOCO-U puts recursion into shallow attention layers to buy more depth with a constant global KV cache; the idea is solid, the evidence is thin.
sharp
YOCO-U makes a very specific bet: keep recursion inside shallow efficient-attention layers, keep the global KV cache constant, keep prefilling linear, and use that package to buy more inference-time depth at a lower serving cost. I buy the direction. Test-time scaling has been useful for reasoning, but the constraint was never “can we loop more.” The constraint was that every extra pass tends to drag latency, memory, and KV growth with it, which turns inference-time compute into a luxury feature. The problem is that this paper, at least from the snippet, is still mostly a mechanism claim. We get the architecture story, but not the numbers that decide whether this survives contact with deployment. The body here does not disclose model size, iteration count, training budget, context lengths, throughput, latency, memory footprint, or concrete benchmark deltas. “Highly competitive” is doing a lot of work. Competitive against a standard decoder-only Transformer? Against the original YOCO? Against recurrent Transformer variants? Against other efficient long-context designs? Right now that part is undisclosed. I’d place this in a broader pattern from the last 18 months. A lot of labs have been pushing on two fronts at once: use more test-time compute to improve reasoning, and redesign attention or memory so that extra compute does not blow up serving economics. You saw one branch in explicit long-thinking products and another in papers on recurrence, latent iteration, state-space hybrids, and linear-attention variants. The shared issue is simple: extra computation often improves scores, but the system bill grows faster than the benchmark gain. YOCO-U is interesting because it does not apply recursion across the whole stack. It confines the loop to shallow layers, which feels like an engineer’s compromise rather than a paper trick. I do have a strong pushback here: a constant global KV cache does not automatically mean lower end-to-end cost. Serving cost is not just KV. Once you introduce shared-parameter iterations, you also introduce questions about serial dependence, kernel scheduling, batching efficiency, compiler friendliness, and the ugly asymmetry between prefill and decode. If those loops reduce hardware utilization, the theoretical gain can evaporate. We have seen this movie before. Plenty of architectures looked elegant in complexity terms and then landed as modest wins or even regressions on real GPU pipelines. I have not seen wall-clock latency or tokens/sec here, so I’m not ready to credit the efficiency claim beyond the architectural level. Another thing I would want before getting excited is a serious ablation. The snippet says the combination is better than YOCO alone or recursion alone. Fine, but then show the curves: original YOCO, full recursion, shallow recursion, different iteration counts, short context, long context, equal-compute comparisons, and memory at decode. Without that, “synergistic effect” is still narrative. If the gains only show up on long-context benchmarks and disappear in short-context high-batch serving, then this is a niche research result, not a universal inference recipe. Still, my read is net positive. This paper is aimed at a real bottleneck that many teams now feel in practice: they want the benefits of test-time scaling without detonating KV growth and latency. That is a more grounded target than simply training a larger dense model and calling it progress. But the missing details are not minor details. They are the whole verdict: parameter count, iteration count, exact benchmark scores, latency, throughput, VRAM, and like-for-like comparisons with standard Transformers and base YOCO. Until those show up, I’d file YOCO-U under “promising systems idea,” not “proven path for depth scaling.”
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H0·K1·R0
17:52
68d ago
● P1arXiv · cs.CL· atomEN17:52 · 04·01
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
YC-Bench evaluates 12 AI agents in a one-year simulated startup task spanning hundreds of turns, and only 3 models consistently beat the $200K starting capital. Claude Opus 4.6 posts the top average final funds at $1.27M, while GLM-5 reaches $1.21M at 11x lower inference cost; scratchpad is the only cross-truncation memory mechanism and the strongest success predictor. The key gap is failure mode: adversarial client detection errors drive 47% of bankruptcies, and frontier models still break on long-horizon execution issues such as over-parallelization.
#Agent#Benchmarking#Memory#Claude
why featured
HKR-H/K/R all pass: the startup-simulation setup is clickable, and the paper provides concrete numbers plus a specific failure taxonomy. I keep it at 82 because this is an arXiv benchmark, not a major product/model launch or a broad cross-source news event.
editor take
YC-Bench lands a clean hit on agent hype: top models can grow capital, then still collapse on memory and anti-fraud over long horizons.
sharp
YC-Bench evaluates 12 models over a one-year startup simulation, and only 3 reliably finish above the $200K starting capital. I buy this benchmark’s premise because it targets the part of agent performance that marketing keeps blurring: after a few hundred turns, does the system still know what it is doing? The hardest numbers in the snippet are straightforward. Claude Opus 4.6 averages $1.27M in final funds. GLM-5 reaches $1.21M. GLM-5 does it at 11x lower inference cost. That already says two useful things. First, frontier models are opening real gaps on long-horizon economic tasks, not just inching ahead on static evals. Second, “best” and “best business choice” are different rankings. If that 11x cost gap holds under the same tool budget and prompting regime, many teams will care more about return per dollar than the top line score. My main takeaway is not “Claude wins” or “GLM is cheaper.” It is that scratchpad is described as the only mechanism that persists information across context truncation, and also the strongest predictor of success. That is a sharp result. For the past year, agent stacks have sold long-term memory in every flavor: vector retrieval, event logs, profile stores, episodic memory, graph memory. YC-Bench is basically saying the thing that most strongly correlates with not failing is still the agent writing itself usable notes. That should make people uncomfortable. A lot of memory systems store history. Fewer preserve strategic continuity. There is useful outside context here. Benchmarks like SWE-bench, GAIA, and browsing-heavy evals mostly stress problem solving, tool use, retrieval, and short-to-medium execution chains. They matter, but they do not pressure the same failure modes as a simulated business with payroll, contract choice, delayed feedback, and adversarial clients. We already saw the broad shape of this problem in the AutoGPT era: goals drift over time, local progress hides global decay, and bad early choices compound. Newer coding and browser agents improved the wrapper, but long-horizon coherence is still where systems break. YC-Bench moves that failure into a financial simulation, which is closer to how agents will actually lose money in production. The 47% bankruptcy share from adversarial client detection errors is the number I keep coming back to. It suggests the weak point is not just memory. It is risk modeling under partial observability. Giving an agent more tools or more parallel workers does not produce a stable operator by default. The snippet explicitly mentions over-parallelization, and that tracks with what many teams learned the hard way: parallelism helps when tasks are separable, but it creates damage when work items compete for budget, depend on order, or share hidden constraints. In this benchmark that shows up as payroll and contract selection. In enterprise deployments it turns into support escalations, procurement mistakes, or bad code rollout sequencing. I do have pushback. Right now we only have an RSS-level description, not the full paper details. Three seeds per model is thin. I have not seen variance, prompt scaffolding, tool permissions, context window settings, or the token cost of the scratchpad itself. The adversarial clients matter a lot too. If they follow repeated templates, part of the result becomes pattern recognition rather than robust strategic judgment. The snippet also says scratchpad is the only cross-truncation memory mechanism. That is a strong design choice, but it also means the benchmark may be measuring whether a model can self-maintain a working notebook more than whether a broader memory architecture can help. Even with those gaps, this benchmark is useful because it shifts the conversation from “can the agent do the task” to “can it survive 200 turns without compounding its own mistakes.” If the open-source release is solid, the best follow-up is not another leaderboard screenshot. It is ablations: how much performance drops without scratchpad, whether bigger context windows reduce that drop, and what happens to return and bankruptcy rate as worker parallelism moves from 1 to 8. Those numbers would tell practitioners a lot more than another claim about general intelligence.
HKR breakdown
hook knowledge resonance
open source
88
SCORE
H1·K1·R1
17:39
68d ago
● P1arXiv · cs.CL· atomEN17:39 · 04·01
Embarrassingly Simple Self-Distillation Improves Code Generation
The paper tests simple self-distillation: sample a model’s own solutions with specific temperature and truncation settings, then fine-tune on them with standard SFT. On LiveCodeBench v6, Qwen3-30B-Instruct rises from 42.4% to 55.3% pass@1, with larger gains on harder problems, and the method also transfers across 4B, 8B, and 30B Qwen and Llama variants. It uses no verifier, teacher model, or RL.
#Code#Fine-tuning#Benchmarking#Research release
why featured
HKR-H/K/R all pass: the hook is a very simple recipe with a large code-bench gain, the paper gives concrete settings and numbers, and the claim hits the industry's cost-vs-RL nerve. Still, this is a single arXiv result without broad external validation, so it lands in high-end 'f
editor take
Qwen3-30B-Instruct jumps from 42.4% to 55.3% pass@1 on LiveCodeBench v6. I buy the simplicity; I don’t buy the result fully without tighter leakage and eval details.
sharp
Qwen3-30B-Instruct lifts LiveCodeBench v6 pass@1 from 42.4% to 55.3%, and if that number holds up, this paper hits a nerve in code post-training: a lot of usable capability is already sitting inside the model’s own output distribution, and you may not need a verifier, RFT, or a stronger teacher to surface it. My read is that SSD turns test-time luck into train-time habit. The core intuition has been around for a while. Best-of-n, rejection sampling, STaR-style self-training, and a lot of synthetic-data work all lean on the same fact: a model often “knows” more than its pass@1 suggests, but one decode path fails to extract it. Code makes that especially visible because pass@k is often much higher than pass@1. The interesting part here is not the philosophy. It is the brutal simplification of the pipeline: sample from the model itself under chosen temperature and truncation settings, then do plain SFT on those outputs. That is operationally attractive for teams that do not have a verifier stack or a frontier teacher model. I’m still not ready to fully buy the headline result. The body here is only an RSS snippet, and the missing details are exactly the ones that decide whether this is durable or just neat. How were sampled solutions selected? “No verifier” does not mean “no filtering.” How was contamination controlled against LiveCodeBench v6? What was the time split, the dedup policy, the handling of near-duplicate problem statements, template reuse, and public solution traces? Code evals have burned the field enough times that a 12.9-point absolute gain should trigger skepticism first, celebration second. The proposed mechanism is more interesting than the branding. The paper ties gains to a precision-exploration conflict in decoding, then claims SSD reshapes token distributions contextually: suppress distractor tails where precision matters, preserve diversity where exploration matters. That tracks with a lot of observed code behavior. I’ve always thought code generation fails less from total ignorance than from poor commitment timing. High temperature often sends the model down a coherent but wrong branch. Greedy decoding locks in too early. If SSD really writes a better compromise back into the weights, then this is fixing a mismatch between model knowledge and decoder behavior, not just adding more synthetic tokens. The broader context matters. Over roughly the last year, most code-model gains have come from two expensive playbooks. One is RL or RFT with execution feedback, unit tests, or process rewards. The other is large synthetic-data pipelines driven by stronger teacher models. The first is expensive in infra and training stability. The second is expensive in teacher access and data governance. If SSD transfers across 4B, 8B, and 30B Qwen and Llama variants, including instruct and thinking versions, its practical value is not “here is the new SOTA.” Its value is that open-model teams get a much cheaper post-training recipe. You do not need GPT-5-class distillation teachers. You do not need a fully built execution sandbox to move baseline pass@1 upward. I still have a pushback on the narrative. The snippet says gains concentrate on harder problems. Fine, but “harder” by what definition? Difficulty buckets inside LiveCodeBench, empirical solve rates, or some handcrafted tags? Not disclosed here. The transfer to thinking models is also a bigger claim than it looks. Thinking variants usually differ in sample length, truncation behavior, and training targets. Without seeing per-model hyperparameters, sample budgets, and total token costs, I would not call this universal yet. Honestly, the most important implication is not the 55.3% number. It is the reminder that some post-training gains do not come from smarter reward design at all. They come from reorganizing probability mass the model already has but decodes poorly. If replications land, I’d expect this to spread first in code, then in math and tool-use tasks. Code is the cleanest testbed because correctness is discrete and bad tokens are punished hard. My remaining doubts are twofold. First, eval cleanliness. Second, whether the gain comes from the SSD mechanism specifically or simply from feeding the model more high-quality self-generated code tokens. The right ablations matter a lot here: same token budget with naive diverse self-sampling, high-temperature-only, low-temperature-only, and cross-benchmark checks on HumanEval+, MBPP, EvalPlus, or SWE-bench-style coding subsets. None of that is in the snippet. So my stance is simple: the idea is credible, the implementation looks appealing, and the result is big enough that I need the boring details before I treat it as settled.
HKR breakdown
hook knowledge resonance
open source
88
SCORE
H1·K1·R1
17:29
68d ago
● P1arXiv · cs.CL· atomEN17:29 · 04·01
Screening Is Enough
The paper introduces Multiscreen, which filters keys with an explicit threshold and matches a Transformer’s validation loss with about 40% fewer parameters. The snippet says it keeps strong long-context perplexity, beats a larger Transformer in retrieval with about 92% fewer parameters at training length, and cuts inference latency by up to 3.2× at 100K context.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
This is a research release with a practical systems claim: explicit key screening yields 40% fewer params and up to 3.2x lower latency at 100K, so HKR-H/K/R all pass. I kept it below the top band because the disclosed evidence is still paper-level; full reproduction and serving成本
editor take
This is not just another linear-attention paper. It goes after softmax attention’s core assumption: relevance is relative, not absolute.
sharp
Multiscreen filters keys with an explicit threshold, matches a Transformer’s validation loss, and uses about 40% fewer parameters. My read: this is not a mere efficiency patch. It is attacking a very old assumption inside attention itself — that relevance is only defined by relative competition among keys. The snippet gives three headline numbers. About 40% fewer parameters at comparable validation loss. Up to 3.2× lower inference latency at 100K context. At training context length, a Multiscreen model with about 92% fewer parameters beats a larger Transformer on retrieval accuracy. Those are strong claims. I’m not ready to take the full victory lap, because the RSS text leaves out the parts that decide whether this survives contact with real workloads: how the threshold is learned, what fraction of keys gets dropped, which retrieval tasks were used, and whether the latency number is prefill, decode, or end-to-end on specific hardware. Why this paper matters anyway: softmax attention has a structural quirk people have tolerated for years. It must distribute a unit mass across all keys, even when most keys are junk for the current query. That makes irrelevance a relative concept. Noise still gets some share of the budget; it just gets a smaller share. In retrieval-heavy settings, long-context settings, and cache-heavy inference, that is a strange default. Multiscreen flips the rule. A key either clears a threshold or it doesn’t. That sounds simple, but conceptually it is a bigger move than another approximation to softmax. That puts the paper in an interesting spot relative to the last year of long-context work. One camp, like FlashAttention, keeps standard attention semantics and just computes them more efficiently. Another camp, like Mamba-style state-space models, replaces attention entirely. A third camp uses sparse or retrieval-augmented schemes to avoid looking at every token. Multiscreen sits between those lines: it keeps the query-key interface but changes the meaning of relevance from ranked allocation to binary screening plus aggregation. If that holds up, adoption is easier than for a full architecture swap, because the surrounding Transformer stack changes less. I do have two real doubts. First, thresholded mechanisms often run into distribution-shift trouble. A threshold that behaves well at one length or token distribution can get brittle out of distribution. The snippet says “little to no degradation” beyond training context, but no curves are shown here, and curves matter more than a sentence. Second, the “92% fewer parameters beats a larger Transformer in retrieval” result is the kind of line that depends heavily on task design. Needle retrieval, passkey retrieval, multi-hop retrieval, and noisy-document retrieval are not interchangeable. Until I see the exact benchmark mix, I would not generalize this into “better language modeling” or “better reasoning.” One line in the snippet deserves more attention than the latency claim: stable optimization at substantially larger learning rates. A lot of attention alternatives fail not because inference is bad, but because training becomes fragile. If screening smooths optimization enough to raise practical learning rates, the upside is bigger than faster 100K inference. It changes training economics. I’ve seen this movie before with linear-attention and sparse-attention papers: strong extrapolation plots, then weak uptake because mixed-precision stability, kernel support, and pretraining behavior were not good enough. Multiscreen will face the same filter. So I’m cautiously positive, not sold. The title, “Screening Is Enough,” is doing a lot of work. From the snippet alone, I can say this looks like a serious attempt to redefine attention in a way that matches what many practitioners wanted all along: irrelevant tokens should be rejectable, not merely down-weighted. I cannot say it has earned a production-grade replacement verdict yet. To get there, the paper needs to show the threshold-learning mechanism, sparsity distributions, extrapolation curves across context lengths, and the exact hardware/batch setup behind the 3.2× latency number. Without that, this is a strong research signal, not a settled systems result.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
17:08
68d ago
arXiv · cs.CL· atomEN17:08 · 04·01
Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning
Brainstacks reports continual multi-domain tuning on TinyLlama-1.1B and Gemma 3 12B IT with frozen MoE-LoRA stacks, reaching 2.5x faster convergence than a parameter-matched single LoRA. The method combines 4-bit QLoRA, top-2 routing, residual stack boosting, randomized-SVD null-space constraints, and an outcome-based meta-router; experiments span 4-5 domains and 9-10 stacks. The key result is transfer of cognitive primitives rather than domain knowledge: medical prompts route to chat+math stacks in 97% of cases despite zero medical data in those stacks.
#Fine-tuning#Reasoning#Inference-opt#Research release
why featured
HKR-K passes on concrete results: 2.5x faster convergence and 97% routing of medical prompts to non-medical stacks. Still this is a specialist continual-learning/PEFT paper with no clear product or agent on-ramp, so hard-exclusion-technical-accessibility caps it below 40.
HKR breakdown
hook knowledge resonance
open source
44
SCORE
H0·K1·R0
16:55
68d ago
arXiv · cs.CL· atomEN16:55 · 04·01
The Overlooked Repetitive Lengthening Form in Sentiment Analysis
The paper releases Lengthening, an 850k multi-domain dataset built to test repetitive lengthening form (RLF) in sentiment analysis. It also proposes the two-stage ExpInstruct tuning framework and reports that fine-tuned PLMs beat zero-shot GPT-4 on classification, while the post does not disclose exact scores; code and sample data are linked. The key point is that RLF is treated as a document-level sentiment signal, not just noisy informal text.
#Fine-tuning#Benchmarking#Interpretability#GPT-4
why featured
Only HKR-K clearly passes: the paper adds an 850k-sample RLF dataset, a two-stage ExpInstruct setup, and code, but no exact metrics are disclosed here. Narrow academic relevance keeps it in all, not featured.
editor take
The paper ships an 850k RLF dataset. I buy the dataset; I don't buy the “beats zero-shot GPT-4” line without scores or eval conditions.
sharp
The paper releases an 850k RLF sentiment dataset, but the body does not disclose the exact scores, prompts, temperature, or class balance behind the “beats zero-shot GPT-4” claim. That gap matters. The valuable part here is the task definition and dataset construction, not the leaderboard line. I’ve always thought sentiment analysis gets treated too casually in the LLM era, as if a general chat model can just absorb it for free. In practice, performance gets shaky once you move from clean text to expressive spelling: repeated letters, stretched vowels, duplicated punctuation, all-caps, emoji stacks. “soooo good” is not just “so good” with noise added. Depending on context, it can signal intensity, irony, emphasis, performative exaggeration, or group style. This paper is useful because it isolates repetitive lengthening form as a phenomenon instead of washing it away in preprocessing. That part tracks with older NLP and linguistics work. Long before frontier model evals took over the conversation, people had already shown that emoji, punctuation repetition, and elongation carry emotional intensity. The annoying habit in classic pipelines was to normalize that away. “coooool” becomes “cool,” and the model loses information before training even starts. If Lengthening is built to preserve those surface forms across domains, that alone is a meaningful contribution. It forces people to admit that normalization is not neutral; sometimes it is label destruction. I’m less convinced by the comparison framing. A fine-tuned PLM beating zero-shot GPT-4 on a narrow classification task is not a shocking result. We saw versions of that all through 2023 to 2025 in hate speech, emotion classification, stance detection, and short-text sentiment. Give a supervised encoder a tightly scoped dataset and it often beats a general-purpose model used zero-shot. That does not tell you the encoder “understands” the phenomenon better in a broad sense. It tells you the benchmark rewards task-specific fitting. Those are different claims. The missing setup details are the issue. What prompt was used for GPT-4? Was it zero-shot with plain label instructions, or did it include explanation-first prompting? Were labels balanced? How much domain overlap exists between train and test? Did they compare against few-shot GPT-4 or only zero-shot? The body snippet gives none of that. Without those conditions, the win over GPT-4 is directionally interesting but weak as evidence. ExpInstruct is the part I take more seriously. The paper says fine-tuned PLMs beat GPT-4 on performance but not on explanation, and then uses a two-stage instruction-tuning setup to improve both performance and explainability for open models with limited samples. That is a better research instinct than chasing accuracy alone. RLF is exactly the kind of phenomenon where “correct label” can hide shallow reasoning. A model can output positive or negative by memorizing lexical co-occurrence while missing the intensification mechanism entirely. If their explainability setup actually tests whether the model identifies elongation as a sentiment-strength cue, that has practical value for moderation, VoC analysis, and social listening. Still, I have some doubts. “Explainability” in recent papers is often the mushiest part of the stack. Was it human-judged? Rule-based overlap? Another LLM acting as judge? The snippet does not say. If the explanation metric is soft, then “matches GPT-4 in explainability” is a much weaker claim than it sounds. There is also a language generalization problem. The paper frames RLF as an overlooked form, but from the snippet this still looks heavily English-centered. That matters because elongation behaves differently across languages and communities. English “soooo,” Japanese repetition patterns, Arabic orthographic play, and Chinese forms like repeated particles or elongated punctuation are not interchangeable signals. If the corpus is mostly English, the conclusion should stay narrow: this is about English online sentiment cues, not a universal theory of expressive lengthening. The body does not disclose language coverage, so I would not generalize it for them. The broader context is that model evaluation has drifted hard toward reasoning, coding, and agent benchmarks over the last year. That makes papers like this easy to underrate. But edge-case linguistic phenomena are exactly where production systems fail quietly. Brand monitoring, UGC moderation, review summarization, and customer feedback pipelines all ingest messy expressive text. If a model collapses “I hate thisssss” into the same affective weight as “I hate this,” the error is operational, not academic. So my take is simple: the sturdy contribution is the dataset and the framing of RLF as preserved signal. The weakest part is the GPT-4 comparison because the article snippet withholds the numbers and evaluation conditions that would make that claim meaningful. I’d want to inspect three things before buying the paper’s headline: domain splits, normalization policy, and the exact explanation-eval protocol. The linked code and sample repo are a plus. The missing scores are not a small omission; they are the difference between a useful benchmark paper and a benchmark paper with a marketing sentence attached.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
16:37
68d ago
arXiv · cs.CL· atomEN16:37 · 04·01
CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance
The paper presents CARE for short-horizon ICU organ dysfunction worsening prediction and evaluates it on MIMIC-DOS, built from discordant sign-symptom cases in MIMIC-IV. A remote LLM emits structured categories and transitions without seeing patient data, while a local LLM acquires evidence and makes final decisions; the post does not disclose metric values, and the key point is this privacy split between planning and data access.
#Agent#Reasoning#Safety#MIMIC-IV
why featured
HKR-K passes on a concrete privacy split: a remote LLM outputs structured labels and state transitions, while a local LLM inspects records without exposing them. Tier is excluded under hard-exclusion-traditional science + AI crossover: this is centered on ICU prediction, with no清
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
16:07
68d ago
arXiv · cs.CL· atomEN16:07 · 04·01
Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics
An arXiv paper tests author identification on 52,796 Books3 books and 28,439 PG-19 books, finding measurable author “fingerprints” in novelty-curve dynamics across 759 and 1,821 authors. Book-level scalar dynamics identify 43% of authors above chance; chapter-level sliding-window SAX motifs reach 30x-above-chance attribution and complement, rather than duplicate, book-level features. The key point for practitioners: genre confounds the signal, but roughly one-quarter of authors retain fingerprints within genre.
#Benchmarking#Interpretability#Books3#PG-19
why featured
HKR-H and HKR-K pass: the paper uses novelty-curve dynamics for author identification and reports 52,796 Books3 books, 28,439 PG-19 books, 43% above chance, and 30x random at chapter level. HKR-R is weak because the link to products, agents, or deployment is indirect, so this is
editor take
This is stylometry repackaged as novelty dynamics. I buy the method, not the “fingerprint” claim until genre, era, and corpus leakage are pinned down.
sharp
The paper tests 52,796 Books3 books and 28,439 PG-19 books, then reports 43% of authors above chance and chapter-level motifs reaching 30x chance. My read: this is a real result, but the framing runs ahead of the evidence. It does not suddenly discover “narrative fingerprints.” It takes an old problem, authorship attribution, and re-expresses it through information-theoretic novelty curves. That is useful. It shifts attention from static lexical markers toward temporal dynamics: how surprise accumulates, how quickly it changes, how circuitous a narrative path looks over a long text. For long-form generation work, that is a better lens than another bag-of-words classifier. Still, I would push back on the word fingerprint. In stylometry, strong numbers often melt once you move across domain, era, publication format, or editorial process. That has been true since the classic Burrows’s Delta / function-word era, and it is still true in the newer wave of “detect AI writing from style” papers. Many of those looked great in-distribution and then collapsed on cross-domain tests. This paper already admits the biggest issue: genre is a confound, and only about one-quarter of authors retain a signal within genre. I actually find that more informative than the headline “30x above chance.” It says the effect is real for some authors, not universal in the strong biometric sense that “fingerprint” suggests. I also want harder metrics than the snippet provides. “30x above chance” sounds dramatic because the chance baseline over 759 or 1,821 authors is tiny. That does not tell me whether the absolute accuracy is deployment-grade. The RSS snippet does not give top-1, top-k, macro-F1, calibration, or performance by author sample count. Without those, I cannot tell whether this is a strong attribution system or a statistically clean but operationally narrow effect. Same problem with complementarity: the snippet says chapter-level SAX motifs complement book-level scalar features, but it does not disclose the ablation size or fusion gain. There is also a corpus issue here. Books3 and PG-19 are long-form, published-book distributions. Chaptering, editorial normalization, and narrative length all help a dynamics-based method. Move this to blogs, newsletters, fanfic, journalism, or documents rewritten by an LLM, and I would expect performance to drop. Books3 adds another layer of discomfort: it is not a neutral benchmark. It sits close to distributions many foundation models likely saw during pretraining, and it carries the usual copyright baggage. That does not invalidate the paper, but it should make readers more cautious about treating the result as a general law of authorship. The outside context that matters to me is where this could land in practice. For provenance and rights workflows, this kind of signal is attractive precisely because it is weak and complementary. Nobody serious should use it as sole evidence, but as one layer alongside lexical stylometry, metadata, draft history, and watermark-like cues, it has legs. For model evaluation, this is even more interesting. We spend a lot of time measuring coherence, factuality, and retrieval faithfulness in long-form outputs. We spend far less time measuring whether generated text has a distinct novelty trajectory or whether every model converges to the same bland pacing after 3,000 words. A novelty-curve framework gives researchers a handle on that. So I buy the method more than the narrative. If the authors want the “fingerprint” label to stick, they need at least three harder tests: beat strong stylometry baselines head-to-head; transfer across corpora instead of staying inside book-publishing distributions; and survive paraphrase plus human editing, including LLM rewrites. Until then, I would file this under “promising temporal stylometry,” not “author identity solved.”
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
15:39
68d ago
● P1arXiv · cs.CL· atomEN15:39 · 04·01
Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines
The paper uses 4 matched conditions to split second-pass gains in multi-LLM pipelines into 3 additive parts: re-solving, scaffold, and content. Across 2 model pairs and 3 benchmarks, MCQ gains look closer to stronger-model re-solving, while code tasks still benefit from two-stage prompting and weak draft content can hurt. The key variable is task structure and draft quality, not revision by default.
#Reasoning#Code#Benchmarking#arXiv
why featured
Strong HKR-H/K/R: the paper challenges a default pipeline belief and backs it with 4 matching conditions, 3 gain components, and results on 2 model groups across 3 benchmarks. Not P1 because it is an arXiv preprint with limited experimental scope and no production evidence.
editor take
The paper splits second-pass gains across 2 model pairs and 3 benchmarks. My read: a lot of “revision wins” are just the stronger model solving again.
sharp
The paper decomposes a claim the field has been hand-waving for too long: multi-LLM revision pipelines do not earn their gains from “correction” by default. With 4 matched conditions, it separates second-pass gains into re-solving, scaffold, and content; across 2 model pairs and 3 benchmarks, MCQ gains mostly look like re-solving by the stronger model, while code still benefits from a two-stage setup and weak draft content can actively hurt. I buy that framing. It is more useful than the usual “reviewer model improves draft model” papers because it asks where the delta actually comes from. That matters because a lot of agent and self-refinement work in the last year quietly smuggled in a bad baseline. If model B is stronger than model A, then “A drafts, B revises” often gets compared against A alone, not B direct. Once you compare against “just send the prompt to B,” a chunk of the supposed pipeline magic disappears. This paper is basically formalizing that complaint. On constrained tasks like MCQ, that tracks with what many teams have seen in practice: a second pass has very little room to add structure, so the main effect is just giving the better model another shot. If your production workflow still routes trivia-style or classification-style prompts through a weak-first/strong-second stack, you are probably paying orchestration tax for no real algorithmic gain. The code result is the part I find more important. The paper says even semantically null drafts can provide useful scaffolding. That matches a broader pattern from coding agents: structure often carries more value than content. File layout, function stubs, signatures, test shape, decomposition into subproblems—those can reduce search space even when the draft logic is wrong. I have seen the same intuition behind planning-heavy coding prompts, repo-map generation, and scratchpad-first agents. The draft does not need to be correct; it needs to make the problem legible. That is a much narrower and more actionable claim than “revision helps code.” I do have a pushback, and the article snippet does not give enough detail to resolve it. The title and summary disclose 2 model pairs and 3 benchmarks, but not the exact models, benchmark sizes, cost overhead, latency, or variance. Those details matter a lot here. A decomposition like this can look clean on MCQ and competitive programming, then get messy on long-horizon software tasks where drafts serve as memory, not just scaffold. SWE-bench-style debugging, browser agents, and spec-to-code tasks have very different failure modes from standalone competitive programming. If the evaluation is mostly short-form tasks, then the paper is identifying a real effect, but not the whole operational picture. I also want to see the scaffold/content split under stronger current models. My memory is that many 2025 agent papers already found that weaker intermediate reasoning became less useful as frontier models improved, while external structure—tools, tests, retrieval, typed constraints—stayed useful. If that trend continues, this paper points to a design rule: stop fetishizing draft prose, invest in artifacts. Plans, schemas, test cases, execution traces, and partial programs are safer interfaces than natural-language “thoughts” from a weak model. A bad artifact is easier to detect and overwrite than a plausible but misleading explanation. So I think the paper lands a needed correction on pipeline design. “Revision” is too coarse a category. For some tasks, second-pass pipelines are just expensive rerouting to a better model. For code, the win is often the scaffold, not the draft’s semantic content. If the full paper does not report token cost and latency, that is a major omission, because the practical question is not only whether second pass helps, but whether it beats direct strong-model routing per dollar and per second.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
15:28
68d ago
X · @Yuchenj_UW· x-apiMULTI15:28 · 04·01
In this Codex vs. Claude Code AI coding war, rate limit reset frequency is Prometheus's fire
The post frames Codex vs. Claude Code around rate-limit reset frequency, arguing the tool that gives developers more resets wins this token economy. The post does not disclose reset intervals, quota numbers, plan tiers, or any measured comparison. The real variable here is supply mechanics, not a vague model-quality duel.
#Code#Tools#Codex#Claude Code
why featured
HKR-H and HKR-R pass: the angle is clicky and hits a real developer nerve on rate-limit economics. HKR-K fails because the post provides no numbers, examples, or reproducible test, triggering hard-exclusion-6 for zero-sourcing commentary, so importance is capped at 39.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H1·K0·R1
14:58
68d ago
arXiv · cs.CL· atomEN14:58 · 04·01
Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization
The paper introduces VRF, which models user preferences as variational distributions instead of point estimates and beats all baselines on 3 benchmarks. It uses a variational encoder, Wasserstein matching to shared probabilistic preference bases, and a variance-attenuated loss; the post does not disclose exact score gains.
#Alignment#Fine-tuning#Research release
why featured
HKR-K passes on mechanism novelty: variational user preference distributions, probabilistic bases, and uncertainty-aware loss. hard-exclusion-technical-accessibility applies because the story is method-dense, lacks a generalist on-ramp, and does not disclose concrete benchmark改善幅
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
14:55
68d ago
arXiv · cs.CL· atomEN14:55 · 04·01
Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts
The paper builds a multimodal pipeline to analyze state-funded coverage of the Israel-Hamas war on YouTube Shorts, using 2,300+ videos and 94,000+ visual frames. It combines transcription, aspect-based sentiment analysis, and semantic scene classification; transcript sentiment varies by outlet and over time, while visual scene cues track real-world events. The key point for practitioners: domain-adapted small models beat large transformers and LLMs on sentiment analysis, but the post does not disclose exact model names or scores.
#Multimodal#Vision#Benchmarking#YouTube
why featured
There is one concrete HKR-K fact: on 2,300+ Shorts and 94k frames, domain-tuned small models reportedly beat larger Transformer/LLM baselines for sentiment. But this is media-studies analysis using AI, with no product, agent, or model-iteration implication, so hard-exclusion-4/4a
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
14:50
68d ago
● P1arXiv · cs.CL· atomEN14:50 · 04·01
Do Phone-Use Agents Respect Your Privacy?
The paper introduces MyPhoneBench and evaluates five frontier phone-use agents on 10 mobile apps and 300 tasks for privacy behavior. It defines privacy compliance as permissioned access, minimal disclosure, and user-controlled memory, then audits over-requested permissions, deceptive re-disclosure, and unnecessary form filling. The key result: ranking by success alone differs from ranking by success plus privacy, so success-only evaluation overstates deployment readiness.
#Agent#Safety#Benchmarking#Freedom Intelligence
why featured
HKR-H/K/R all pass: the privacy question is a strong hook, and the paper adds a concrete 10-app, 300-task, 5-model benchmark with audit-style evaluation. It stays below P1 because this is a single arXiv benchmark paper, with impact centered on research and deployment-safety讨论.
editor take
MyPhoneBench puts a number on the obvious gap: phone agents that finish tasks still fail basic privacy discipline, which makes “works in demos” a weak deployment signal.
sharp
MyPhoneBench lands because it refuses the usual dodge. It evaluates five frontier phone-use models across 10 apps and 300 tasks, then shows that task success, privacy-compliant completion, and later-session use of saved preferences are separate capabilities. No model wins all three. That matters more than the paper’s headline framing. For the last year, phone-agent demos have trained people to read high completion rates as a proxy for deployability. This paper breaks that shortcut. The sharpest part is not some exotic attack. It is the boring failure mode: data minimization. Agents keep filling optional personal fields that the task does not require. A lot of teams would classify that as harmless over-helpfulness. On a phone, it is not harmless. The device sits on top of payments, contacts, addresses, identity data, photo libraries, and app-specific permissions. Once an agent learns the habit of “empty field means fill it,” privacy failure stops being an edge case and becomes a default behavior pattern. The paper’s setup also seems well chosen for that claim: instrumented mock apps, rule-based auditing, and observable trajectories for permission requests and form entries. That is much stronger than vague red-team anecdotes. This also fills a hole in the current benchmark landscape. WebArena, OSWorld, AndroidWorld, and related agent benchmarks have mostly centered on completion and robustness. Safety shows up, but often as prompt injection, escalation, or broad policy refusal. MyPhoneBench isolates privacy loss inside benign tasks, which is closer to real deployment pressure. Most users are not asking agents to survive an adversarial capture-the-flag. They are asking them to book, search, submit, edit, and configure. In practice, a lot of production incidents come from over-collection, sticky permissions, and bad defaults, not cinematic attacks. That is why I think this benchmark is directionally more useful than another leaderboard about whether an agent can navigate a settings page. I still want more detail before taking the ranking claim too far. The snippet does not disclose which five models were tested, the actual score spreads, or how success and privacy are combined when they say rankings reshuffle. A dramatic reorder and a two-point shuffle are very different stories. The memory piece also needs more than the abstract gives. “User-controlled memory” sounds right, but the hard questions are operational: can the user inspect what was stored, revoke it per app, prevent cross-app carryover, and verify deletion? The summary does not say. My pushback is mostly for the surrounding industry narrative. A lot of agent builders still treat privacy as a policy layer you bolt on after navigation works. I think that view is already obsolete for phone use. Permission timing, field-level disclosure, and memory retention are core policy-learning problems, not UI polish. If your evaluation stack only tracks success, you will optimize agents into being aggressively helpful and quietly unsafe. I have not verified whether the benchmark runs on real iOS/Android permission stacks or mainly on simulated apps. That gap matters for external validity. Still, as an evaluation framework, this is more honest than most “AI can use your phone” demos. It forces a basic admission: finishing the workflow is not the same as respecting the user.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
13:24
68d ago
arXiv · cs.CL· atomEN13:24 · 04·01
KUET at StanceNakba Shared Task: StanceMoE, a Mixture-of-Experts Architecture for Stance Detection
KUET presents StanceMoE for actor-level stance detection and reports 94.26% macro-F1 on 1,401 annotated English texts from StanceNakba 2026 Subtask A. The model fine-tunes BERT, adds six expert modules, and uses context-aware gating to route weights by input. The part to watch is signal decomposition, not just another BERT variant.
#Fine-tuning#Benchmarking#KUET#StanceNakba
why featured
This mainly clears HKR-K: the summary includes dataset size, 94.26 macro-F1, and a 6-expert gated BERT design. It reads as a shared-task benchmark report with limited product or workflow relevance, so HKR-H and HKR-R miss; tier stays all, not featured.
editor take
KUET posts 94.26 macro-F1 on 1,401 texts. This reads like shared-task tuning, not a step-change in stance modeling.
sharp
KUET reports 94.26 macro-F1 on 1,401 texts, and I don’t buy the result yet. The score is high, but the setup screams shared-task sensitivity: on datasets this small, one to two points often come from split choices, label balance handling, or preprocessing tricks rather than a durable modeling gain. The abstract tells a neat story. Start with a fine-tuned BERT encoder, add six experts for semantic orientation, lexical cues, clause focus, phrase patterns, framing, and contrastive discourse, then let a context-aware gate weight them dynamically. My problem is that the snippet omits the parts that decide whether this is a method or just a leaderboard artifact. We don’t get parameter count. We don’t get variance across seeds. We don’t get the class distribution. We don’t get the train/dev/test protocol. We don’t get routing stats. We don’t get an ablation showing whether the experts matter individually or whether the gate just acts like another trainable fusion layer on top of BERT. I’ve always thought stance detection is unusually unforgiving to architecture hype. Older SemEval stance, rumor, and hate-related tasks already showed the pattern: BERT-family encoders are very strong in small-data settings, and gains often come from target formulation, context packing, class reweighting, and annotation consistency more than from fancy modules. The abstract here flags one especially tricky condition: the target actor is implicit in the text. That’s important. Once the target is implicit, models can score well by learning event framing and lexical co-occurrence rather than learning stance reasoning toward a specific actor. In plain terms, the model may be reading discourse register well, not actually solving the harder actor-level stance problem. The MoE label also needs pushback. In frontier language models, MoE pays off when you have huge data, meaningful task heterogeneity, and enough scale for routing to discover useful specialization. Here we have 1,401 English examples. Six experts on a tiny dataset sounds less like sparse scaling and more like hand-designed inductive bias plus a learned selector. That is a valid research move, but it should be judged differently. To convince me, I’d want at least three ablations: how much performance drops when the framing expert is removed, how much drops when the contrast expert is removed, and whether routing collapses onto one or two experts for most samples. If routing collapses, the MoE story gets much weaker. Another gap is the baseline set. The abstract says StanceMoE beats traditional baselines and alternative BERT variants, but that phrase is too elastic to carry weight. If the comparison set is vanilla BERT, BiLSTM, and SVM, the win tells me almost nothing. A stronger paper would compare against DeBERTa-v3 style encoders, lightweight modern classifiers, or even NLI-style reformulations if the target schema allows it. I haven’t checked the full PDF tables, so I’m not going to invent what they ran. For now, the title gives a high score, the abstract gives a plausible architecture, and the crucial competitive context is still undisclosed. My read is simple: file this under task-specific engineering, not transferable progress, until it clears three tests. Show multi-seed confidence intervals. Show cross-dataset transfer beyond StanceNakba. Show routing evidence that the six experts are doing distinct work. Without that, 94.26 looks like a strong shared-task submission, not a broader advance in stance modeling.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R0
12:38
68d ago
● P1arXiv · cs.CL· atomEN12:38 · 04·01
LinguDistill: Recovering Linguistic Ability in Vision-Language Models via Selective Cross-Modal Distillation
LinguDistill uses a frozen original LM as teacher and recovers about 10% of the language and knowledge benchmark loss in VLMs without adding adapters. Its key mechanism is layer-wise KV-cache sharing, which lets the teacher access the student's multimodal states, followed by selective distillation on language-intensive data; vision-heavy performance stays comparable. The part to watch: it changes neither architecture nor inference-time parameter count.
#Multimodal#Fine-tuning#Benchmarking#Research release
why featured
This clears HKR-H/K/R: the hook is counterintuitive, and the summary includes a testable ~10% recovery claim plus a concrete distillation mechanism. It stays below 85 because this is a single arXiv research release, with no broader replication, deployment data, or adoption yet.
editor take
LinguDistill recovers about 10% of lost language ability, but this looks like remediation, not a cleaner VLM training recipe.
sharp
LinguDistill recovers roughly 10% of lost language performance, and that matters. I still wouldn’t read this as “VLMs have solved language degradation.” I read it as a fairly honest admission that, in 2026, turning a strong LM into a vision-language model still damages the original linguistic prior often enough that people need repair methods after the fact. The appealing part is clear and concrete. The paper says it adds no adapters and no inference-time parameters. Instead, it uses the frozen original LM as a teacher, shares KV caches layer by layer so the teacher can see the student’s multimodal states, and then applies selective distillation on language-intensive data. That is a smart target. A lot of prior “capability preservation” work effectively inserts another protective structure into the model: alignment layers, side modules, modality-specific branches. Those can work, but they also make the recipe less portable across model families and deployment stacks. LinguDistill is more restrained. It accepts that multimodal adaptation creates representation shift and cross-modal interference, then tries to pull the model back toward its original language behavior without changing runtime architecture. This lands on a problem the field has been dodging for a while. Over the last year, many autoregressive VLMs looked great on instruction-following and multimodal chat benchmarks, but once you probe them on language-heavy or knowledge-heavy tests, the base LM often feels diluted. You can see it in style, factual recall, calibration, and sometimes long-form reasoning. The paper’s framing fits that pattern. My pushback is on the headline number. “Recovering about 10% of the loss” is directionally good, but it is not enough by itself. Ten percent of what absolute drop? If multimodal adaptation cost 20 points and this method restores 2, that is meaningful. If the original loss was 3 points and it restores 0.3, that is much more modest. The snippet does not disclose the benchmark list, absolute scores, base models, or training token counts. So I can’t tell whether this is a practically noticeable repair or a statistically neat refinement. I also have some doubts about the “efficient” framing. No extra inference parameters is good news for deployment teams. That does not make the whole method cheap. Layer-wise KV-cache sharing between teacher and student sounds elegant, but training-time memory, synchronization, sequence length limits, and dual-forward overhead can still be painful. This happens a lot in papers: runtime overhead is near zero, but training complexity moves in the other direction. The body here does not disclose compute budget or compare training cost against adapter-based baselines, so the efficiency claim is only half-grounded. There is another issue that matters more than the paper summary admits: did they recover genuine language ability, or recover benchmark-facing language behavior? Distillation often improves fluency, next-token alignment, and answer style in ways that boost standard language scores. But in VLMs, the hard case is when visual evidence conflicts with textual priors. If the student becomes more teacher-like, does it also become more likely to answer from language priors rather than from the image? The summary says vision-heavy performance stays comparable. Fine. Comparable aggregate performance does not tell me whether image grounding got cleaner, or whether hallucinations in vision-language conflict cases changed at all. I’d want to see image-faithfulness and conflict-set evaluations. Those are not disclosed here. Context from the past year makes this more interesting. A lot of open VLM work, from LLaVA-style stacks to newer Qwen-VL variants, has shown the same tradeoff in practice: multimodal capability improves, but the original LM’s “native” language behavior softens unless the recipe is carefully tuned. Closed labs rarely publish the degradation directly, so papers like this are one of the few places where the field gets an explicit repair framing instead of a polished benchmark table. So my take is pretty simple. This paper is useful because it treats language erosion in VLMs as a first-class systems problem, not a cosmetic benchmark issue. But I would not oversell the result. It shows that some of the damage is recoverable without changing inference-time architecture. It does not show that the current multimodal training path is clean, and it definitely does not prove the recovered model is better grounded when language priors and visual evidence disagree.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
12:27
68d ago
arXiv · cs.CL· atomEN12:27 · 04·01
Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding
The paper introduces EmoScene, a benchmark of 4,731 context-rich scenarios annotated with an 8D emotion vector from Plutchik's basic emotions. In zero-shot tests on six instruction-tuned LLMs, the best Macro F1 is 0.501; a Bayesian post-processing method using emotion co-occurrence adds +0.051 Macro F1 on Qwen2.5-7B. The key point is joint modeling of emotion dependencies, not independent label prediction.
#Reasoning#Benchmarking#Qwen#Research release
why featured
This scores on HKR-K: a new benchmark, explicit dataset size, and a testable gain from Bayesian post-processing. HKR-H and HKR-R are weak, and there is no clear product or agent implication, so it fits all rather than featured.
editor take
EmoScene drops best Macro F1 to 0.501 on 4,731 scenarios. I buy the harder setup, but the +0.051 Bayesian bump also says the benchmark carries strong label priors.
sharp
EmoScene pushes the best zero-shot Macro F1 across six instruction-tuned models down to 0.501, and that number already tells the story: multi-emotion understanding in long-form scenarios is still nowhere near solved. The paper’s main move is sensible. Instead of another short-text label benchmark, it uses 4,731 context-rich scenarios with an 8D Plutchik emotion vector. I buy that setup more than the usual sentence-level emotion tagging, because a lot of older benchmarks let models coast on lexical cues. In actual interactions, emotion depends on role structure, event order, sarcasm, conflicting goals, and social context. Treating each label as independent has always been a simplification bordering on self-sabotage. My read is that this is more of an evaluation correction than a capability breakthrough. A 0.501 Macro F1 does not prove current LLMs are bad at emotion. It says many earlier datasets made the task too shallow. The closest contrast in my head is the older generation of emotion datasets like GoEmotions: useful, larger, and widely adopted, but mostly built around short comments rather than scenario reasoning. That is a different problem class. I have not verified the exact prompting setup used for all six models here, and the snippet does not disclose per-model breakdowns, decoding constraints, thresholding choices, or confidence intervals. Without those details, it is hard to tell whether 0.501 reflects a genuinely hard benchmark, a brittle evaluation protocol, or both. The Bayesian post-processing result is the part I would treat carefully. The authors use emotion co-occurrence statistics for joint posterior inference and report a +0.051 Macro F1 gain for Qwen2.5-7B. That is a meaningful lift for a lightweight add-on. It also raises the obvious question: how much of the gain comes from modeling real emotional interdependence, and how much comes from exploiting dataset priors? If a relatively simple co-occurrence layer moves the score that much, then base model outputs are underusing label structure, but it also suggests the benchmark contains a strong enough dependency pattern that a prior can cash in on it. That does not invalidate the method. In fact, it highlights a blind spot in a lot of current evaluation: we train and score emotion models as if labels were independent when they clearly are not. Still, I would want out-of-domain tests or at least split-wise robustness checks before reading +0.051 as a strong generalization claim. The snippet does not say whether they tested distribution shift, rare emotion combinations, or domain transfer. I also have some doubts about benchmark scale. 4,731 examples is respectable for a research release, but not especially large for an 8D multilabel scenario task with long-tail combinations. Macro F1 is sensitive to rare classes, and emotion annotation is notoriously subjective around boundary cases. The article body does not disclose annotator agreement, human ceiling, class imbalance, or comparisons against dedicated emotion classifiers. Those are not side details; they determine whether a 0.05 gain is a robust signal or a thresholding artifact dressed up as reasoning. So my stance is pretty simple: this paper is useful because it fixes the problem framing, not because it proves a new modeling stack has cracked emotional reasoning. Over the last year, the field has spent so much oxygen on agents, tool use, and coding that social-affective reasoning often gets treated as a demo-layer capability. EmoScene is a good reminder that once you move from “spot the emotion word” to “infer a structured emotional state from a situation,” even decent instruction-tuned models still struggle. If someone uses this benchmark next month to claim a model has achieved advanced emotional understanding, I would ask for three numbers first: per-class results, human agreement or ceiling, and out-of-distribution performance. The snippet gives none of them.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R0
12:10
68d ago
MIT Technology Review· rssEN12:10 · 04·01
The Download: gig workers training humanoids, and better AI benchmarks
MIT Technology Review’s April 1 Download highlights two AI threads: Micro1 has hired thousands of gig workers across 50+ countries to record household chores for humanoid robot training. It also argues current AI benchmarks miss real-world use and cites Angela Aristidou’s Human–AI, context-specific evaluation; the post does not disclose concrete metrics or results.
#Robotics#Benchmarking#Micro1#MIT Technology Review
why featured
This is a two-item roundup, not a deep report. HKR-H comes from the hidden-labor hook; HKR-K/R come from the concrete 50+ countries detail and the benchmark-validity debate, but the post gives no metrics or experimental results, so it stays in all.
editor take
Micro1 hired thousands across 50+ countries to film chores. This is less a robot story than data labeling escaping the screen and entering the home.
sharp
Micro1 hired thousands of gig workers in 50-plus countries to record household chores, and that pushes the robotics data pipeline from cloud labeling into private homes. My read is simple: humanoid robotics is not bottlenecked by one more VLA paper right now; it is bottlenecked by cheap, continuous, messy long-tail interaction data. Whoever industrializes that supply chain gets a real timing advantage. This looks like the old Scale AI / Appen / Remotasks phase for foundation models, except the data source is far more invasive. Text labeling exposed bias and labor issues. Home-task video collection adds addresses, room layouts, family routines, appliances, faces, children, and anyone else who happens to be present. The article says the jobs pay well locally, but it does not disclose hourly rates, task pricing, retention periods, consent flows, resale rights, or whether bystanders are filtered out. I don’t buy casual use of “informed consent” here. A worker can consent to selling their own task footage; that does not automatically extend to roommates, visitors, or family members whose lives end up in the frame. Technically, this also says something blunt about the state of humanoids: a lot of “general manipulation” still depends on humans showing the world to the model first. Figure, 1X, Agility, Tesla Optimus, and others all talk about broad household or workplace competence, but most public demos still live in curated environments. The hard part at home is not just grasping. It is clutter, occlusion, object variation, sequence variation, failure recovery, and the fact that no two kitchens are arranged the same way. A network like Micro1 matters because it expands distribution coverage across countries, homes, tools, and routines. The article does not disclose dataset size, annotation depth, collection protocol, or whether any force/contact signal is paired with the video, so we should be careful not to overread it. Still, the model here is obvious: use distributed humans to produce the demonstrations roboticists cannot collect fast enough themselves. I also don’t fully buy the implied “more footage equals better robots” story. First, head-mounted iPhone video is a biased viewpoint; it does not match a robot’s chest, wrist, or head camera geometry. Second, many household tasks are contact-rich. Video alone misses force control, slip, weight changes, resistance, and tool feedback. Third, geographic diversity is not the same as training quality. Different cookware, storage conventions, cleaning sequences, and cultural task norms create normalization work, not just free generalization. I haven’t seen a public data card, error taxonomy, or downstream improvement numbers from this piece. Without those, “thousands of workers” is an input metric, not a capability metric. The benchmark half of the newsletter points in the right direction, but I’m cautious about the framing. Angela Aristidou argues for Human–AI, context-specific evaluation, and that diagnosis is fair. Too many benchmarks still assume isolated tasks, short horizons, and one-user interaction, while actual deployment happens inside teams, workflows, and institutions over time. That gap has been obvious for a while. Over the last year, the field has already been moving this way: SWE-bench tried to anchor coding evaluation in real issue resolution; METR and frontier-lab preparedness work kept pushing toward longer-horizon task assessment; agent evaluations increasingly track tool use, handoffs, and failure modes instead of just final answers. My pushback is that “context-specific” can become an escape hatch if nobody pins it down. Once every company says its workflow is unique, benchmarking turns into bespoke consulting and cross-model comparison disappears. Public benchmarks absolutely need repair, but replacing them with loose case studies is not progress. A serious framework needs two layers: a reproducible public substrate, then domain overlays. The substrate handles comparability across models and labs. The overlay tracks real workflow outcomes such as handoff loss, rollback rate, human intervention frequency, completion time, and cost of error. The article gives the concept, but not the metrics, baselines, or experimental design. Only the title-level argument is disclosed so far; the mechanism is not. Put the two threads together and a bigger pattern shows up. Robotics is dragging real life into the training set. Benchmarking people are trying to drag real life back into evaluation. Same underlying correction. AI spent years optimizing on proxies because proxies were cheap. Now those proxies are breaking at the point of deployment. That is why home video labor markets are forming, and it is why static leaderboard scores feel thinner every month. So I read this newsletter less as two separate curiosities and more as one field-level adjustment: AI systems are running into the cost of interfacing with the world. In robots, that cost shows up as distributed human data collection with ugly privacy questions. In evaluation, it shows up as pressure to measure performance inside organizations instead of on sterile test sets. That is the part I take seriously. The rest still needs numbers.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H1·K1·R1
11:36
68d ago
arXiv · cs.CL· atomEN11:36 · 04·01
From Baselines to Preferences: A Comparative Study of LoRA/QLoRA and Preference Optimization for Mental Health Text Classification
This paper compares LoRA/QLoRA supervised fine-tuning with DPO, ORPO, and KTO for mental health text classification, and finds method choice matters more than simply adding preference training. The snippet confirms tests across objectives, adapters, optimizers, context windowing, and class rebalancing; the post does not disclose datasets, model names, or scores. The key takeaway is the reproducible optimization framework, not a single top score.
#Fine-tuning#Benchmarking#Alignment#Research release
why featured
HKR-K passes because the abstract gives concrete methods and variables. But this is still a healthcare-domain text-classification study with no agent, product, or broader workflow implication, so hard-exclusion-4 applies and caps the score below 40.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
11:18
68d ago
arXiv · cs.CL· atomEN11:18 · 04·01
Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention
The paper proposes Stochastic Attention, which randomizes token order before sliding-window attention to turn a fixed local window into a stochastic global one at the same O(nw) budget. Its receptive field reaches full-sequence coverage in O(log_w n) layers versus O(n/w) for SWA. In pretraining and training-free inference on Qwen3-8B and Qwen3-30B-A3B, it beats SWA and matches or exceeds Mixture of Block Attention at similar compute.
#Inference-opt#Benchmarking#Tools#Qwen
why featured
The paper has a real mechanism, concrete complexity claims, and benchmark evidence, so HKR-K passes. But it is still a specialist attention-architecture story with little on-ramp for general AI professionals, triggering hard-exclusion-technical-accessibility and capping it at 39.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H0·K1·R0
11:00
68d ago
● P1MIT Technology Review· rssEN11:00 · 04·01
The gig workers who are training humanoid robots at home
Micro1 hires thousands of contractors across 50+ countries to film chores at home with iPhones and sell that real-world data to humanoid robotics companies. The piece cites $15/hour pay for one worker, says robotics firms spend over $100 million a year on such data, and notes $6 billion+ went into humanoids in 2025. The real issue is data governance: workers know the footage trains robots, but the post shows they often do not know how it is stored, shared, or deleted.
#Robotics#Vision#Tools#Micro1
why featured
This clears HKR-H/K/R: at-home chore videos are a strong hook, and the piece adds numbers on scale, pay, and spend. The sharper industry signal is the hidden data pipeline and weak governance on storage, sharing, and deletion, so it merits featured, not p1.
editor take
Micro1 is turning chores into robot fuel, and the first bottleneck is not model quality but paper-thin consent.
sharp
Micro1 hires thousands of workers across 50-plus countries to film household chores, and my first read is simple: data rights are lagging far behind the money. The piece gives three numbers that matter: one worker earns $15 an hour, robotics firms spend more than $100 million a year on this kind of data, and humanoids pulled in over $6 billion in funding in 2025. Capital is already treating home video collection as infrastructure. Governance still looks stuck at “don’t show your face.” I’ve long thought humanoid robotics would end up creating a new layer of platformized data labor. The reason is practical, not ideological. Simulation can teach locomotion and some manipulation priors, but it still struggles with messy contact, clutter, occlusion, and the ordinary chaos of kitchens and bedrooms. Public video helps with scene understanding, but it does not give you the first-person action traces you need for manipulation policy learning. Head-mounted iPhone footage of dishwashing, folding laundry, and making beds is a pretty direct answer to that gap. On the technical direction, I buy it. What I do not buy is the idea that this becomes clean or well-governed just because the worker knows they are “training robots.” The article says workers often do not know how the footage is stored, shared, or deleted. That is not a side issue. That is the core liability. Once video enters multiple customer pipelines, gets chunked, labeled, used for imitation learning or VLA fine-tuning, and mixed into derived datasets, deletion becomes much harder in practice. The generative AI world already ran this playbook with web data: collect first, train first, negotiate rights later. Here the disputed asset is not a blog post. It is your home, your routines, your possessions, and all the latent signals around them. That matters because “no face shown” is not the same thing as anonymity. A home interior can be identifying. Accent, layout, furniture, reflected surfaces, windows, appliances, even the cadence of someone’s movement can create re-identification risk when enough footage accumulates. The snippet says Micro1 uses AI and human review to strip obvious personal information, but it does not disclose retention periods, downstream customer controls, cross-border transfer terms, or an actual deletion workflow. Those are the details that decide whether this is legitimate data collection or a privacy mess with better branding. There is also a labor-market angle that I think the industry keeps understating. Yes, $15 an hour can be strong pay in parts of Nigeria or India. That does not automatically make consent robust. It changes bargaining power. Workers are not just selling labor time. They are selling access to domestic space and embodied habits. That is closer to surveillance extraction than standard labeling work, even if the task feels mundane. The article hints at this but stops short of saying it plainly. The wider context is familiar if you’ve watched robotics over the last year. A lot of teams have pushed the “world model + teleoperation + internet-scale video” story. But when it comes to manipulation, everyone still runs into the same wall: good action data is scarce. Systems in the RT/OpenVLA family showed how far vision-language-action models can go, but fine manipulation still depends on high-quality demonstrations with contact, failure cases, and environmental variety. So of course companies like Micro1 appear. The demand is real. My pushback is against the implied narrative that outsourced data recording is inherently cleaner than platform scraping. I’m not convinced. Web scraping fights authors and publishers. Home recording reaches into more intimate terrain and creates weaker practical revocation once the data has propagated. That can be worse, not better. I also could not find the commercial proof that would justify some of the excitement here. The article snippet does not show customer benchmarks. Did these home videos improve grasp success by 5 points or 30? Did they improve cross-home generalization, or just produce lots of repetitive chore clips with weak novelty? One worker says generating varied content in a small home is hard, and that point is more important than it looks. If the dataset collapses into a narrow distribution of ironing, folding, and sink work, then scale alone will not solve the generalization problem. Expensive data can still be mediocre data. We learned that in the labeling boom around 2023, when quantity often outran signal. So my read is not “humanoids are about to enter the home.” It is not even “gig work found a new category.” It is that robotics is importing the old internet content bargain into embodied AI, with higher privacy stakes and weaker deletion guarantees. The business will keep growing because the technical need is real. I’m just not convinced the consent model is strong enough to survive scrutiny once these systems move from hype decks into actual deployments.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
10:37
68d ago
X · @op7418· x-apiZH10:37 · 04·01
CodePilot launches the "Pet Assist" feature
CodePilot announced a new "Pet Assist" feature in an RSS-snippet post. The post only claims two things: its completeness is said to exceed Claude Code, and it aims to guide users into a growable agent workflow; the post does not disclose mechanics, availability, pricing, or launch timing. The real question is whether it productizes agent workflows into an iterative layer.
#Agent#Code#Tools#CodePilot
why featured
The post confirms only a feature name and a self-comparison to Claude Code; mechanism, rollout, price, and launch timing are not disclosed. HKR-H/K/R all fail, and hard-exclusion-6 applies because there is no data, example, or reproducible detail.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H0·K0·R0
10:32
68d ago
arXiv · cs.CL· atomEN10:32 · 04·01
LangMARL: Natural Language Multi-Agent Reinforcement Learning
LangMARL brings multi-agent RL credit assignment and policy-gradient updates into language space to address LLM agents' coordination learning in dynamic cooperative settings. The snippet says it adds agent-level language credit assignment and replay-based causal summaries, improving sample efficiency, interpretability, and generalization under sparse rewards; the post does not disclose benchmark names or experiment scale.
#Agent#Reasoning#Interpretability#Research release
why featured
HKR-K passes on mechanism novelty: agent-level credit assignment and replay-based causal extraction in language space. The post does not disclose benchmark scale or gains, and it triggers hard-exclusion-technical-accessibility because the MARL/RL angle has no clear on-ramp for a,
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H0·K1·R0
10:26
68d ago
● P1arXiv · cs.CL· atomEN10:26 · 04·01
To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining
The paper studies how to split a fixed data budget between pretraining and retrieval, using OLMo-2 models from 30M to 3B parameters and up to 100B DCLM tokens. It scans pretraining at 1-150x parameter count and retrieval stores at 1-20x, finds retrieval beats parametric-only baselines across scales, and proposes a 3D scaling framework over model size, pretraining tokens, and retrieval corpus size.
#RAG#Benchmarking#Reasoning#Research release
why featured
This is not a routine benchmark bump. It studies the pretrain-vs-retrieval allocation under a fixed data budget, with 30M-3B OLMo-2 models and up to 100B tokens, yielding a practical scaling rule. Strong HKR-H/K/R, so it clears featured.
editor take
This paper moves RAG from a serving trick toward a training allocation rule, but results at 3B do not transfer cleanly to 70B production.
sharp
The paper trains OLMo-2 models from 30M to 3B parameters on up to 100B DCLM tokens and reports that retrieval beats parametric-only baselines under fixed data budgets. My read is that the important part is not “RAG helps.” RETRO, kNN-LM, and Atlas already made that case years ago. The useful move here is treating model size, pretraining tokens, and retrieval corpus size as one joint allocation problem instead of three separate knobs. That framing is closer to how real teams operate. You do not get infinite clean text and then separately decide whether to add RAG later. You usually have a finite corpus budget, and the actual question is blunt: should this next tranche of data go into pretraining, or should it stay outside the weights and be indexed? The paper at least tries to answer that with a systematic sweep: pretraining at 1-150x parameter count, retrieval stores at 1-20x, across reasoning, scientific QA, and open-domain QA. That is much better than the usual one-model, one-benchmark RAG paper. I still have a big reservation about how far this travels. The top end is 3B parameters. That matters. At 30B or 70B, the tradeoff changes because parametric memory is stronger, long-context behavior changes, and the system cost of retrieval starts competing with raw model quality in a different way. A lot of people learned the hard way from Chinchilla-era scaling claims that results from mid-scale models do not transfer cleanly upward. The snippet also does not disclose error bars, retriever setup, top-k, reranking, chunking strategy, or task-by-task deltas. Without those, I would not turn this into a product rule yet. I also want to push back on the clean headline claim that retrieval wins “across scales.” In a paper setting, that can be true and still hide the hard operational costs. Retrieval adds latency, index maintenance, freshness pipelines, access control, chunk boundary errors, and context pollution. On knowledge-heavy QA, RAG often looks great. On multi-step reasoning, coding repair, or planning, bad retrieval can sink the answer before the model gets a chance. The summary says the evaluation includes reasoning, scientific QA, and open-domain QA, but it does not say whether reasoning gains are robust or just washed upward by strong gains on knowledge lookup tasks. That distinction is the whole story for practitioners. The outside context here is pretty clear. Over the last year, major labs have been converging on layered memory: weights for durable priors, long context for working state, retrieval for fresh facts, tools for execution. This paper fits that trend. What it adds is a candidate scaling surface for how to budget data across those layers. If the full paper later folds in retrieval latency, context-window occupancy, and update frequency as part of the objective, then it becomes much more than a benchmark paper. So I would file this as a recipe paper, not a capability paper. It is asking how to spend a fixed data budget, not proving that retrieval has overtaken pretraining. The title gives you “scaling laws,” but the snippet does not disclose the fitted equations, the inflection points for optimal allocation, or where different tasks flip from “memorize” to “retrieve.” Until those numbers are visible, this is a strong design hint, not a deployable rule.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
09:58
68d ago
arXiv · cs.CL· atomEN09:58 · 04·01
Learning to Hint for Reinforcement Learning
The paper proposes HiLL, which jointly trains a hinter and a reasoner in GRPO to recover learning signal when a rollout group gets identical rewards. It adds hint reliance and a transfer-weighted reward; the post says HiLL beats GRPO and prior hint baselines on multiple benchmarks, but does not disclose exact scores or datasets.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
This is a narrow RL-training paper on GRPO advantage collapse with little on-ramp for general AI readers, so hard-exclusion-technical-accessibility applies. HKR-K gets some credit for a concrete mechanism, but the abstract gives no datasets or scores and HKR-H/R stay weak.
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
09:23
68d ago
arXiv · cs.CL· atomEN09:23 · 04·01
Attention to Mamba: A Recipe for Cross-Architecture Distillation
The paper presents a two-stage distillation recipe that transfers a Pythia-1B Transformer into an attention-free Mamba, reaching 14.11 perplexity versus the teacher's 13.86. It first distills into linearized attention, then into an adapted Mamba; experiments cover 1B scale, 10B tokens, ablations, scaling, and token-allocation sensitivity. The key detail is the initialization and linear-attention bridge, not a hybrid attention-SSM design.
#Reasoning#Inference-opt#Benchmarking#Mamba
why featured
HKR-H/K pass: moving a Transformer into attention-free Mamba is a clear hook, and the paper provides a 2-stage recipe plus 1B, 10B-token, and 14.11 vs 13.86 PPL numbers. HKR-R fails because cost, throughput, and product impact are not disclosed, so this stays all.
editor take
The paper closes the gap to 0.25 perplexity on Pythia-1B, and I only half-buy the victory lap: good recipe, not proof Mamba replaces Transformers.
sharp
The authors distill a Pythia-1B Transformer into a fully attention-free Mamba and land at 14.11 perplexity versus the teacher’s 13.86, under a 1B-scale, 10B-token, two-stage recipe. I think that matters, because the hard part for SSMs has never been the pitch deck about throughput. It has been inheritance. The field has a huge stockpile of pretrained Transformer checkpoints, stable training recipes, and downstream adaptation tooling. If pure Mamba cannot absorb that asset base, it stays a niche architecture no matter how elegant the state-space story looks. That is why this paper is more interesting than another hybrid model. A lot of prior “Transformer to Mamba” progress has quietly solved the problem by putting attention back in somewhere. That helps benchmarks, but it also weakens the claim. Here the authors take the stricter route: distill first into linearized attention, then into an adapted Mamba with principled initialization, and keep the student attention-free at the end. I buy that as a legitimate methods contribution. I also think the bridge choice makes sense. Linearized attention is close enough to the teacher’s inductive bias that the model is not asked to jump directly from softmax attention dynamics into an SSM-style state update. Cross-architecture distillation usually breaks when the intermediate representation geometry is too different; the student can mimic logits without inheriting the teacher’s internal organization. This recipe at least acknowledges that problem instead of hiding it. Still, I would not overread the result. The snippet gives the headline numbers and says downstream performance is preserved, but it does not disclose which downstream tasks, what the variance looks like, how the distillation loss is constructed, or how training budget is split between the two stages. More importantly, it does not give the deployment metrics that would justify a real architecture switch: generation throughput, latency, memory footprint, long-context behavior, kernel maturity, or hardware efficiency. Mamba’s appeal from day one was not “almost the same perplexity as a Transformer.” It was “better scaling and serving characteristics.” Without those numbers, the paper proves transferability, not operational superiority. There is also a broader pattern here. Since the original Mamba wave, the community has kept running into two frictions. First, Transformer training recipes are much more mature. Second, the ecosystem around checkpoints, finetuning, alignment, and evaluation is deeply attention-centric. My memory is that many strong follow-up results over the past year either moved toward hybrid designs or preserved some attention path when benchmark pressure got serious. I have not re-checked every paper here, so take that as contextual recall, not a formal survey. But that is exactly why this result matters: it offers a migration recipe for teams sitting on Transformer weights, not a clean-sheet argument that SSMs already won. My pushback is on cost and generality. Ten billion distillation tokens is not huge relative to pretraining a 1B model, but it is not cheap if the story is “easy model conversion.” If the recipe also needs careful initialization, stage balancing, and architecture-specific adaptation, engineering complexity starts eating into the benefit. The summary says they ran token-allocation sensitivity studies, which is good. But the snippet does not say whether the best split is stable, whether it transfers across teachers, or whether the gains survive on larger instruction-tuned models. That missing detail matters a lot. A recipe that works on Pythia-1B dense language modeling is useful; a recipe that survives model family changes would be a platform result. So my take is straightforward: this is a serious step for cross-architecture distillation, and a cleaner one than the usual hybrid detour. But it does not show that Mamba is ready to replace Transformers in production stacks. It shows that pure Mamba can inherit more than many people assumed. For researchers, the initialization plus linear-attention bridge looks worth reproducing. For practitioners running inference fleets, I would wait for the serving-side evidence before treating this as an architecture turning point.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R0
09:17
68d ago
arXiv · cs.CL· atomEN09:17 · 04·01
Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness
The paper derives TF-IDF-like scores as key terms in a word-burstiness test statistic, where the alternative uses beta-binomial document models with a gamma penalty on precision. The null uses a binomial model and misses over-dispersion. The post says the resulting weighting is comparable to TF-IDF on document classification, but it does not disclose datasets, scores, or significance.
#Benchmarking#Research release
why featured
HKR-K passes because the paper makes a specific theoretical link between TF-IDF and a penalized beta-binomial burstiness test. HKR-H and HKR-R fail, and hard-exclusion-technical-accessibility applies: this is specialist statistical derivation with no product or workflow impact, +
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H0·K1·R0
09:13
68d ago
arXiv · cs.CL· atomEN09:13 · 04·01
TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
The paper introduces TRIMS, which uses lightweight signals from an autoregressive teacher to supervise token reveal order in MDLM training with minimal extra overhead. The abstract says TRIMS improves the accuracy-parallelism trade-off on math and coding benchmarks for LLaDA and Dream, and approaches distillation-based methods at lower training cost; the post does not disclose exact scores or cost numbers. The key point is training-inference trajectory mismatch, not model scale.
#Inference-opt#Fine-tuning#Benchmarking#Research release
why featured
TRIMS contributes a concrete training mechanism for diffusion LMs, so HKR-K passes. But this is still a specialist optimization paper with no disclosed benchmark deltas or cost figures in the summary, triggering hard-exclusion-technical-accessibility and capping it below 40.
HKR breakdown
hook knowledge resonance
open source
41
SCORE
H0·K1·R0
08:32
69d ago
arXiv · cs.CL· atomEN08:32 · 04·01
A Survey of On-Policy Distillation for Large Language Models
This survey organizes on-policy distillation for LLMs along 3 axes: feedback signal, teacher access, and loss granularity, under a unified f-divergence framework. It argues off-policy distillation trains on static teacher data, so students never learn from their own errors and exposure bias compounds at inference; the post does not disclose the number of papers reviewed. The key value is a single taxonomy spanning logit-based, outcome-based, and self-play methods, plus open problems in scaling laws, uncertainty-aware feedback, and agent-level distillation.
#Reasoning#Fine-tuning#Agent#Research release
why featured
This lands mainly on HKR-K: the useful part is a 3-axis taxonomy plus a unified f-divergence framing for on-policy distillation. HKR-H and HKR-R are weak because it is a plain survey with no disclosed paper count, benchmark lift, or near-term product impact, so it stays in all.
editor take
The 3-axis taxonomy is useful. I don't fully buy the “unified” pitch, because objective functions unify more easily than teacher cost and online stability.
sharp
This survey organizes OPD along 3 axes, and I think it lands on the oldest problem in distillation that people keep sidestepping: the student never trains on its own mistakes. Off-policy distillation feeds static teacher traces into training, then asks the student to autoregress on its own at inference. Errors compound. That is not a new failure mode. Seq2seq work called it exposure bias years ago, and imitation learning had DAgger for the same reason. Bringing that framing back into LLM distillation is the right move, and frankly more useful than another round of “just add preference data.” The taxonomy itself is practical. Feedback signal splits into logit-based, outcome-based, and self-play. Teacher access splits into white-box, black-box, and teacher-free. Loss granularity splits into token, sequence, and hybrid. That gives practitioners a decent way to reason about constraints before they reason about method names. If you do not have logits, stop pretending you are doing the same thing as a white-box distiller. If teacher calls are expensive, sequence-level online reranking is not a universal recipe. The title and snippet give the 3 axes, but they do not disclose how many papers were included or how the literature distributes across categories. That matters. This looks more like a map than a quantitative survey. I do have some doubts about the “unified f-divergence framework” layer. For logit matching, sure, that abstraction is natural. Once you move into outcome rewards and self-play, the hard parts are often not the divergence at all. They are credit assignment, rollout depth, teacher query budget, latency, and the way teacher mistakes get amplified through online trajectories. You can write many objectives into one mathematical frame. That does not unify the engineering bottlenecks. I have seen a lot of LLM papers over the last year use elegant unification to smooth over ugly online instability. The outside context here is pretty clear. Frontier labs have been moving toward more online feedback loops, especially for coding and agents, because static distillation is good at making a model answer like the teacher and much worse at making it complete multi-step tasks reliably. After the DeepSeek-R1 wave, reasoning distillation became fashionable again, but most public recipes still lean off-policy: collect teacher traces, train the smaller model, report benchmark gains. That helps. It does not automatically produce interaction robustness. A coding agent that makes a small mistake in step 2 can poison the next 8 tool calls. Token-level KL will not rescue that. So the value of this survey is not that it invents a new method. It states plainly that distillation is shifting from “compress the teacher distribution” to “correct the student on its own trajectories.” That is a meaningful shift for small models, edge deployment, and enterprise inference budgets. If you want low serving cost, you will keep distilling. If you want the student not to fall apart in real tasks, you eventually run into on-policy training. My pushback is simple. The snippet says the survey examines industrial deployments, but it gives no company names, no task classes, no teacher-call costs, and no gain ranges. Without that, “industry deployment” is still a soft claim. I also agree with the paper that distillation scaling laws remain unresolved. We still do not have a clean rule for how teacher strength, student size, and online rollout budget trade off. Until that exists, OPD risks staying a method family that looks conceptually correct and remains economically awkward outside the biggest labs.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R0
08:14
69d ago
arXiv · cs.CL· atomEN08:14 · 04·01
English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization
The paper releases KUTED, an English-to-Central Kurdish S2TT dataset with 91k pairs, 170 hours of audio, 1.65M English tokens, and 1.40M Kurdish tokens. It reports orthographic variation hurts translation quality; with text standardization, a fine-tuned Seamless model reaches 15.18 BLEU on a held-out TED test set and improves the Seamless baseline by 3.0 BLEU on FLEURS.
#Audio#Benchmarking#Fine-tuning#TED
why featured
HKR-K passes: the paper adds a 91k-pair, 170-hour English–Central Kurdish speech corpus and quantifies a +3.0 BLEU lift from orthographic standardization. HKR-H and HKR-R are weak because this is niche speech-translation research with limited impact on mainstream product or model
editor take
KUTED ships 91k English–Central Kurdish pairs. The bigger contribution is fixing writing variation before chasing model scores.
sharp
KUTED releases 91k English–Central Kurdish pairs with 170 hours of audio, and that alone makes this paper more useful than many “new model” papers in low-resource speech translation. My main takeaway is not the 15.18 BLEU score, and not even the +3.0 BLEU on FLEURS. It’s that the authors isolate orthographic standardization as a first-order problem instead of pretending architecture alone will save the task. That sounds basic, but this is exactly where a lot of low-resource MT and S2TT work breaks. People benchmark a model on a language pair with unstable spelling conventions, mixed scripts, inconsistent tokenization, or community-specific variants, then treat the resulting score gap as “model capability.” If the target form itself is not normalized enough for training and evaluation to agree, BLEU gets punished before semantic quality is even measured. In that sense, the paper is doing something more mature than chasing an extra decoder tweak: it is tightening the label space. I buy the claim that orthographic variation degrades performance, especially for Kurdish. Central Kurdish has real writing variation and standardization friction, so a model trained on heterogeneous targets will often learn conflicting surface forms. That usually shows up as noisy decoding and undercounted n-gram overlap. The gain from standardization, then, is often less “the model understands better” and more “the training target and the metric finally point in the same direction.” We’ve seen the same pattern across low-resource ASR and MT over the last year: for African and South Asian languages in particular, text normalization and curation often buy more than another round of model complexity. I do have one pushback. Standardization can easily drift into benchmark laundering if the normalization rules are aggressive enough to collapse legitimate variation into a single “evaluation-friendly” form. The snippet does not disclose the exact rule set, how much was automated versus manually reviewed, or whether native speakers validated the standardized outputs as natural rather than merely consistent. That gap matters. A cleaned target space helps training, but it can also flatten real linguistic diversity and nudge systems toward one sanctioned register. There’s also useful context outside the paper. Meta’s Seamless family and NLLB have spent the last two years proving that broad multilingual pretraining gives you a credible starting point for under-resourced directions. But broad coverage is not the same as depth. For many small-language pairs, the pretrained model gets you the first 70% of the way; the last meaningful jump still comes from corpus hygiene, segmentation, named-entity handling, and orthographic policy. KUTED fits that pattern. The authors fine-tune Seamless, train a Transformer from scratch, and test a cascaded Seamless-ASR-plus-NLLB-MT setup. That is the right experimental shape because it checks whether the bottleneck lives in speech recognition, translation, or data quality. Still, the summary is thin on the numbers I’d want before drawing stronger conclusions. It does not disclose the absolute FLEURS score, the size and separation method of each split beyond “held-out TED,” or the error profile across the three system types. It also does not say much about latency or compute. That matters because end-to-end S2TT versus cascade is not just an academic choice; in practice it changes debuggability, deployment complexity, and how quickly you can patch failures for a low-resource language. I’m also not especially impressed by 15.18 BLEU on its own. For TED-style speech translated into a low-resource target, that is respectable, not deployment-grade. The more important question is transfer: does performance hold outside TED/TEDx speaking style, outside relatively clean English speech, and outside presentation-language syntax? The +3.0 BLEU on FLEURS is the better signal because it hints at broader robustness, but the article snippet does not give enough detail to test that claim hard. So I’d read this as a data-and-standardization paper first, a model paper second. That is not faint praise. In low-resource speech translation, getting the corpus and writing conventions into shape is often the work that actually moves the field. Bigger speech models do not erase that debt; they just expose it faster.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R0
07:21
69d ago
arXiv · cs.CL· atomEN07:21 · 04·01
A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory
The study introduces JUBAKU-v2, a 216-example Japanese benchmark for bias in reasoning under fixed conclusions, targeting in-group and out-group attribution. It is built from attribution theory and Japanese cultural contexts rather than translated English data. The key claim is higher sensitivity than prior Japanese benchmarks, but the post does not disclose model names or metrics.
#Reasoning#Alignment#Benchmarking#JUBAKU-v2
why featured
HKR-K passes on three concrete facts: 216 examples, attribution-theory construction, and native Japanese data. HKR-H is weak and HKR-R is limited because the post discloses no model list, metrics, or deployment consequence, so it lands in all.
editor take
JUBAKU-v2 fills a real gap in Japanese bias evaluation, but 216 samples is too small to take “more sensitive” at face value.
sharp
JUBAKU-v2 uses 216 examples to isolate attribution bias inside reasoning while holding the conclusion fixed. That design choice is smart. Most bias benchmarks still score the endpoint: who the model favored, who it blamed, which answer it selected. They do not separate the mechanism underneath, where the model explains in-group behavior as situational and out-group behavior as dispositional. Framing the benchmark around attribution theory gets closer to the actual cognitive pattern people worry about, not just surface wording. My positive read is straightforward: Japanese bias evaluation does need native construction rather than translated English sets. Benchmarks like BBQ, CrowS-Pairs, and StereoSet were useful in English, but translation often strips out the social cues that matter most in Japanese: politeness levels, indirectness, role hierarchy, and in-group versus out-group framing. In Japanese, those pragmatic signals are not decoration. They are part of the bias substrate. So the paper is directionally right to stop treating Japanese as an English benchmark rendered into another script. I still do not buy the “more sensitive than existing benchmarks” claim yet. The snippet gives no model list, no scoring rubric, no significance testing, no inter-annotator agreement, and no definition of sensitivity. Sensitivity can mean several different things: larger score separation across models, more stable ranking across reruns, higher effect size, or better correlation with human judgment. Those are not interchangeable. With only 216 examples, variance becomes a real problem. If model A beats model B by two or three items, that is not a sturdy ranking unless the paper shows repeated runs and confidence intervals. If they used an LLM judge to score bias in explanations, that adds another bias layer on top of the model under test. There is also a more structural issue here. Evaluating “bias in reasoning” has become harder because frontier models increasingly hide or compress chain-of-thought. OpenAI and Anthropic have both moved toward exposing summaries or short rationales instead of full traces. That means a benchmark like this is often measuring the bias in the model’s visible explanation policy, not necessarily the bias in the latent decision process. Those are related, but they are not the same thing. I think people in alignment sometimes blur that distinction too quickly. The outside context that matters: the field has spent the last year shifting from output-only safety checks to process-oriented evaluation. You can see the same move in reward hacking work, deception probes, and jailbreak audits that inspect intermediate steps rather than final answers. JUBAKU-v2 sits in that trend, and that is why it matters more than its small size suggests. Still, small benchmarks have a bad habit of looking sharp because they are narrowly curated. I have seen this with several safety evals: once you rerun with different prompting or with a new model family, the headline gap shrinks fast. So my current take is favorable on the problem framing and cautious on the benchmark claim. If the full paper later shows model-by-model results, annotation protocol, ablations against translated Japanese sets, and robustness under prompt variation, this could become a useful specialist eval. Without that, it is a promising probe, not yet a benchmark I would anchor model cards or deployment claims on.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H0·K1·R0
06:28
69d ago
arXiv · cs.CL· atomEN06:28 · 04·01
Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Optimsyn optimizes synthetic-data rubrics with influence scores and reports consistent downstream gains across domains, target models, and data generators. It uses an optimizer-aware, gradient-based estimator to score each sample’s training utility, then applies that reward to RL-tune a rubric generator. The key shift is direct target-model feedback; the post does not disclose exact gains or benchmark names.
#Fine-tuning#Alignment#Benchmarking#Research release
why featured
HKR-K passes on mechanism novelty, but HKR-H and HKR-R are weak: this reads as a niche synthetic-data method, and the body does not disclose gains, baselines, or training cost. That keeps it in the mid-60s all band.
editor take
Optimsyn ties rubric search to target-model gradients, and that part is directionally right. But without gains or benchmark names, this is still a research claim, not a recipe.
sharp
Optimsyn makes a pretty clear bet: stop judging synthetic data by rubric aesthetics or judge-model vibes, and score it by actual training utility on the target model. The paper says it uses an influence-style, optimizer-aware estimator to score each synthetic sample, then uses that score as reward to RL-tune a rubric generator. Directionally, that is the right move. It reconnects data generation to the thing we actually care about — downstream learning — instead of relying on proxies like semantic similarity, format compliance, or a separate evaluator model. I buy that premise more than most synthetic-data papers I’ve seen. The field has spent a lot of time acting as if “looks similar to good data” and “helps training” are close substitutes. They are not. Anyone who has run SFT seriously has seen two samples that both look fine to a human, yet produce very different effects on loss curves and generalization. That gap comes from model state, optimizer dynamics, task mix, answer style, and how the sample interacts with the rest of the training set. So the line in the snippet about embedding-near samples having very different influence is believable, and honestly overdue. That said, this is still a thinly evidenced claim from the material provided. We only have the title and an RSS snippet. The snippet does not disclose exact gains, benchmark names, target-model sizes, the number of RL steps, or the compute overhead of the influence estimator. Those omissions matter a lot. Influence-based methods tend to fail less on intuition than on accounting. The question is rarely “does this correlate with utility at all”; the question is “does the gain justify the extra gradient bookkeeping and pipeline complexity.” I’ve seen plenty of elegant data-valuation ideas that deliver a modest lift and then die when someone prices the full loop. The broader context is important here. This paper sits in a lineage that is older than the current synthetic-data hype cycle: influence functions, data attribution, TracIn-style approaches, Data Shapley, and a pile of work trying to answer which examples actually help a target objective. What Optimsyn appears to do is splice that line of work into rubric optimization, which is a smarter insertion point than another “judge the synthetic sample” filter. Optimizing rubrics is lower-dimensional than optimizing individual generations, so it gives you a tractable control surface. That part is clever. I still have a pushback. Optimizing rubrics against target-model feedback creates a strong risk of short-horizon overfitting to one model’s preferences. The snippet claims “strong generalization without task-specific tuning,” but I’m not granting that until I see transfer tests. A rubric that produces high-influence samples for one 7B instruction-tuned model does not automatically transfer to another architecture, tokenizer, optimizer, or even a later checkpoint of the same family. This is one of the recurring problems in synthetic-data systems: the pipeline learns the quirks of the evaluator, then mistakes that for broad usefulness. Here the evaluator is closer to the target model, which is better, but it still does not eliminate the overfitting risk. There is also a productization issue. Training utility is not deployment utility. In medicine, law, finance, and other knowledge-dense domains, the samples that improve benchmark loss the most are not always the samples you want in a production assistant. A utility-maximizing loop can reward narrow stylistic regularities, exploit annotation artifacts, or amplify confident but brittle answer forms. The snippet does not say whether they pair influence rewards with factuality or safety constraints, and that omission is important. If the method only optimizes for what helps the model fit a task objective, it can still produce a dataset you would hesitate to ship. From an industry lens, though, this paper hits the right pressure point. Over the last year, the synthetic-data conversation has shifted from “can we generate lots of data” to “which generated data is actually worth spending training budget on.” The old self-instruct playbook, Evol-Instruct variants, RLAIF pipelines, and judge-filter loops all run into the same wall: volume is cheap; useful volume is not. I’d be shocked if frontier labs were not already doing more sophisticated internal data valuation than what they publish. Optimsyn’s contribution, if the full paper holds up, is not inventing model feedback. It is moving that feedback upstream, from scoring answers to steering rubric creation. My current read is simple: the direction is strong, the mechanism is plausible, and the claim is incomplete. The title gives you “consistent improvements” and “across domains,” but the snippet does not disclose the actual gains, the baselines, or the cost. Without those, this is a promising research interface, not yet an operational recipe. If the full paper shows meaningful lifts under reasonable compute overhead and across genuinely different target models, people building synthetic-data factories should pay attention. If the gains are small or the transfer is weak, this becomes another academically neat loop that won’t survive contact with production training economics.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R0
06:12
69d ago
arXiv · cs.CL· atomEN06:12 · 04·01
MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference
MF-QAT trains one model to stay robust across multiple quantization formats and reports performance close to single-format QAT at each target precision. The paper adds Slice-and-Scale to convert one MXINT8 or MXFP8 anchor checkpoint into lower-precision MXINT or MXFP formats at runtime; the post does not disclose model sizes, benchmarks, or exact accuracy deltas. The part to watch is deployment: one checkpoint spans multiple hardware and runtime constraints without retraining per format.
#Inference-opt#Research release
why featured
The paper adds Slice-and-Scale and a one-checkpoint-to-many-formats claim, so HKR-K passes. But it hits hard-exclusion-technical-accessibility: low-level quantization/numerical-method work with no disclosed benchmark table, error deltas, or generalist on-ramp.
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
04:44
69d ago
● P1arXiv · cs.CL· atomEN04:44 · 04·01
Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
The paper studies 960 sessions across 15 tasks and two model pairs, finding persona-based agent judges indistinguishable from human raters in a Turing-style test. Score quality improves logarithmically with panel size, while unique issue discovery follows a sublinear power law, with score saturation about 2x faster. The key mechanism is ensemble diversity: Big Five persona conditioning and expert judges extend coverage, and ablations show simple prompting is not enough.
#Benchmarking#Alignment#Agent#Research release
why featured
HKR-H/K/R all pass: the paper combines a strong hook with concrete scaling-law results and a practical ablation on structured personas. I kept it at 80 because this is an arXiv evaluation paper, not a major lab release or an industry-moving event.
editor take
The paper moves LLM judges forward with 960 sessions, but “human-like” still does not equal “trustworthy.”
sharp
The paper runs 960 sessions across 15 tasks and finds persona-based agent judges indistinguishable from human raters in a Turing-style test, but I read this as a coverage-scaling result, not as trust in LLM evaluation being solved. That distinction matters. A lot of teams already use LLM judges as cheap stand-ins for preference labeling, red teaming, and regression checks. If you stop at “they look human,” you will overread the result. Looking human only shows these judges can reproduce part of the human rating distribution. It does not prove calibration, bias stability, or transfer across task types. The snippet does not disclose model names, confidence intervals, or agreement statistics, so I would not promote this to settled evaluation science yet. The strongest contribution here is the separation between score quality and issue-discovery coverage. The authors say score quality improves logarithmically with panel size, while unique issue discovery follows a sublinear power law, with score saturation happening roughly twice as fast. That matches what many practitioners already feel in red teaming. A few viewpoints are often enough to rank outputs broadly. Finding corner-case failures is a different game, and panel size keeps getting more expensive. That pattern also lines up with the last two years of LLM-as-a-judge work. MT-Bench, Chatbot Arena style pairwise judging, AlpacaEval, and related methods all showed that model judges are useful for relative ranking. They are much weaker at systematically surfacing diverse failure modes. I remember Anthropic and OpenAI system cards leaning on diverse red-teaming setups rather than pretending one universal judge can do both jobs. I still push back on the phrase “indistinguishable from human raters.” A Turing-style validation is clever, but it tests resemblance, not correctness. Human raters are already biased: verbosity preference, confidence bias, first-impression effects, stylistic favoritism. LLM judges often inherit and amplify those patterns. Work around G-Eval, Prometheus, and judge bias audits made that problem pretty clear. Under that lens, becoming more human-like does not automatically make an agent judge better. It may just make it a more stable reproducer of human evaluation artifacts. The snippet gives no external ground truth like task completion, user retention, factual error rate after review, or downstream business outcomes. Without that anchor, “indistinguishable” is far weaker than “validated.” I do buy the structured-persona result more than the headline result. Simple prompting often creates shallow stylistic variance on top of the same underlying evaluator, so additional judges remain highly correlated. Big Five conditioning is at least a plausible way to induce more orthogonal evaluation functions: conscientiousness pushing rigor, neuroticism pushing risk sensitivity, agreeableness softening tone judgments, and so on. Expert judges acting as adversarial probes also makes sense. The gain in ensembles rarely comes from raw count alone; it comes from low correlation. That is old ensemble-learning logic, now applied to evaluator populations. If the full paper reports inter-judge correlation matrices or diversity metrics, that would be the part I would study first. The snippet does not say. There is also an external-validity problem. Two model pairs and 15 tasks are enough to show a pattern, not enough to assume a universal law. The shape of the discovery curve probably depends on task openness. Open-ended dialogue, agent planning, or long-context retrieval have fat-tailed error spaces. Constrained QA, formatting checks, or unit-tested coding tasks tend to converge faster. If those are pooled together, you can mistake a task-mixture effect for a general scaling law. I have not verified whether the paper stratifies by task family. If it does not, I would treat the power-law claim as provisional. Practically, I would use this paper as a budgeting guide for evaluation systems, not as a license to replace humans wholesale. For ranking models, A/B comparisons, and regression monitoring, a small but deliberately diverse panel is probably enough. For safety review and long-tail defect discovery, panel size should be set by coverage goals, not by marginal score stability. That is the operational takeaway I trust here. The unresolved pieces are still important: model identities, per-task breakdowns, ground-truth anchors, and cost curves are not disclosed in the snippet. Until those show up, this is a useful map for how evaluator ensembles scale, not proof that AI judges are ready to be trusted on their own.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
04:01
69d ago
X · @Yuchenj_UW· x-apiMULTI04:01 · 04·01
I like how the Anthropic Claude Code team is being chill about the code leak.
The post says leaked Anthropic Claude Code repos have reached 70k forks, with Python and Rust versions circulating on GitHub. It adds only the author's view: harness engineering is hard, and a Cursor-like path is product plus harness first, then model training later; leak details and Anthropic's response are not disclosed.
#Code#Tools#Anthropic#Claude Code
why featured
HKR-H and HKR-R land: the leak-plus-chill angle is clickable, and the moat debate matters to code-agent builders. HKR-K fails because the post is mostly opinion; the 70k-forks claim is not substantiated, and leak scope, timeline, and Anthropic's response are not disclosed.
editor take
The post claims the leak hit 70k forks. At that scale, Claude Code stops being internal tooling and becomes field notes; I don’t buy the “they’re chill” framing.
sharp
The post claims the leaked Claude Code repos reached 70k forks, which means Anthropic has likely lost the ability to meaningfully pull the engineering details back. If that number is real, the interesting part is not the leak as spectacle. It’s that one layer of the moat behind code-agent products just got exposed to the market. The snippet gives us only three usable facts: 70k forks, Python and Rust versions on GitHub, and one opinion about harness engineering. It does not disclose the leak source, what commit history was exposed, whether secrets were included, or how Anthropic responded. So I’d keep this at the level of product-engineering impact, not overstate it as a fully characterized security incident. I also don’t buy the “they’re being chill” framing. Once source code is on GitHub and forked at that scale, “calm” often just means “there is no clean containment path left.” Deleting the original repo does very little when mirrors, forks, zip archives, and Discord redistribution are already in motion. This looks less like a classic enterprise source leak that legal can slowly suppress, and more like a one-way spill where the marginal value of enforcement drops fast. Since the article gives no official statement, I’m not going to invent a noble posture for Anthropic. The post’s strongest point is the line about harness engineering being hard. That part tracks. A lot of people still act like coding agents are “just plug Sonnet or GPT into an IDE and add tools.” In practice, the hard part is the harness: context packing, repo indexing, tool routing, retry logic, sandboxed execution, test orchestration, rollback, permission boundaries, checkpointing long jobs, and replayable evals. None of those components is magical by itself. The moat comes from making them behave well together under real latency and failure constraints. Over the last year, much of the user-perceived gap between Cursor, Devin, Windsurf, and weaker coding products has come from that systems layer, not only the base model. There’s a broader pattern here that the post points at, and I think that part is directionally right. From 2024 into 2025, the coding-assistant market kept showing that distribution and workflow lock-in mattered more than having your own frontier model on day one. Cursor did not win early because it had the best proprietary base model. It won because the editor experience was fast, sticky, and integrated into how developers already worked. I remember the company later investing more heavily in training and post-training, though I haven’t verified the exact timeline recently. So yes, more startups will try the “product plus harness first, model later” path. But I wouldn’t overread this into “wrappers are now validated.” That story is too convenient. Seeing Anthropic’s harness code does not hand you the hard assets that actually sustain quality: private user traces, failure logs, internal eval suites, tool telemetry, ranking data, and the iteration cadence that tunes the whole loop. In 2026, post-training is not a casual add-on. You can copy architecture patterns faster than you can copy the data flywheel behind them. That’s the gap a lot of wrapper narratives still gloss over. So who gets squeezed by a leak like this? First, teams pitching opaque “agent orchestration know-how” as if that alone is defensible. If one of the best-known labs has some of its implementation studied line by line, investors and customers get less patient with hand-wavy claims about secret sauce. Second, small products that are basically API shells with thin execution layers. Once the community digests leaked code, open-source reproductions and scaffolds usually appear fast, and those companies will have a harder time defending margins or retention. I still wouldn’t jump to “Anthropic’s moat is gone.” Source exposure is not capability replication. We’ve seen this repeatedly across AI products: seeing prompts, UX, or chunks of implementation does not let you reproduce live production quality. Coding agents depend heavily on model versions, internal tools, eval thresholds, telemetry, and human tuning. The snippet says Python and Rust versions are circulating, but it does not say whether the repos are complete, runnable, or coupled to internal services outsiders can’t access. Without that, any strong claim about competitive parity is premature. My read is that the biggest impact here is educational, not existential. This leak will make more of the market admit that coding agents are not prompt wrappers. They are heavy systems products. That matters because it raises the bar for everyone else. Once Anthropic’s approach gets dissected, users and buyers will expect tighter test loops, better recovery behavior, and more reliable long-horizon execution from the rest of the field. Companies still selling “we use a strong model, therefore we do coding” are going to look thin very quickly.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K0·R1
03:39
69d ago
arXiv · cs.CL· atomEN03:39 · 04·01
Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics
The paper uses a 2x2 factorial decomposition and finds lexical-only overlap exceeds semantic-only overlap across models from 110M to 70B parameters. The confound is concentrated in <=1% of activation dimensions, 18-36% of sparse autoencoder features blend senses, and filtering it improves word sense disambiguation and makes knowledge edits more selective (p=0.002).
#Interpretability#Benchmarking#Alignment#arXiv
why featured
HKR-K passes because the paper adds concrete facts: a 2x2 factorization, ≤1% active dimensions, and 18%-36% mixed-sense SAE features. It still triggers hard-exclusion-technical-accessibility: the story is too specialized in mechanistic interpretability and lacks a clear product,
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
03:39
69d ago
arXiv · cs.CL· atomEN03:39 · 04·01
Execution-Verified Reinforcement Learning for Optimization Modeling
The paper proposes EVOM, an execution-verified RL framework for solver code generation, and reports matching or beating process-supervised SFT on 4 benchmarks and 3 solvers. EVOM treats Gurobi, OR-Tools, and COPT as deterministic verifiers: code runs in a sandbox, execution outcomes become scalar rewards, and GRPO plus DAPO optimize a closed loop. The key point for practitioners is solver transfer: switching the verification environment enables zero-shot transfer, and continued training on a target backend gives lower-cost adaptation.
#Reasoning#Code#Tools#Gurobi
why featured
EVOM is a real research contribution: solver execution becomes the reward signal across 4 benchmarks and 3 solvers. Audience fit is weak; the story depends on optimization-modeling and solver-specific context, so hard-exclusion-technical-accessibility caps it below 40.
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
02:03
69d ago
arXiv · cs.CL· atomEN02:03 · 04·01
Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs
The paper builds a 12-question dataset and a five-level rubric to test multiple contemporary LMs on tacit reasoning in quantum field theory and string theory. Models score near ceiling on explicit derivations in stable frames, but degrade when they must reconstruct omitted steps or satisfy global consistency constraints; the sharper failure is unstable representation selection, not just missing steps.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes because the paper gives a concrete setup: 12 questions, a 5-point scale, and specific failure modes. But it triggers hard-exclusion-4 and brushes hard-exclusion-1: QFT/string-theory is off-lane for this audience and too specialist for generalist AI readers.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H0·K1·R0
02:00
69d ago
OpenAI Blog· rssEN02:00 · 04·01
Gradient Labs gives every bank customer an AI account manager
Gradient Labs announced an AI account manager for bank customers. The title says it is for “every bank customer,” but the article body provides no mechanism, deployment conditions, or other concrete details. With only the headline available, this is best treated as a product-update signal rather than a full release note.
#Agent#Gradient Labs#Product update
why featured
HKR-H and HKR-R pass on the banking-workflow hook, but HKR-K fails because the page discloses model names and '10x growth' only. This is a vendor case study whose takeaway is 'a customer uses OpenAI,' so hard-exclusion-pure-marketing applies.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H1·K0·R1
01:54
69d ago
X · @op7418· x-apiZH01:54 · 04·01
OpenAI's new funding round is said to reach $125 billion
The title and snippet say OpenAI's new funding round reaches $125 billion. The post stresses this is funding amount, not valuation; the post does not disclose investors, round stage, deal terms, or source details. Watch the sourcing and terms, not the hype.
#OpenAI#Sam Altman#Funding#Commentary
why featured
Hard-exclusion-6 applies: zero-sourcing content. The post offers an emotional headline and a $125B claim, but no source link, lead investor, round details, or terms; HKR-H and HKR-R are present, HKR-K fails, so importance stays below 40 and tier is excluded.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H1·K0·R1
01:23
69d ago
X · @dotey· x-apiZH01:23 · 04·01
It won't be open-sourced, not because the code is so valuable, but because closed source has many benefits
dotey lists four claimed benefits of staying closed source and concludes the product will not be open-sourced. The post cites hiding poor code quality, adding anti-distillation or user ID logic, staging prebuilt features, and faster iteration without code review; these are the author's claims, with no verifiable case disclosed.
#dotey#React#Commentary
why featured
This triggers hard-exclusion-zero-sourcing: four arguments are listed, but no case, data, or named firsthand example is provided, so importance is capped below 40. HKR-H and HKR-R land, but HKR-K fails because there is no new factual payload.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H1·K0·R1
00:27
69d ago
X · @AnthropicAI· x-apiEN00:27 · 04·01
Anthropic signs MOU with the Australian Government on AI safety research
Anthropic said it signed an MOU with the Australian Government to collaborate on AI safety research and support Australia's National AI Plan. The snippet confirms the parties and scope, but the post does not disclose term length, funding, research agenda, or delivery mechanism. The real signal is whether this turns into evaluations, policy tooling, or procurement standards.
#Safety#Alignment#Anthropic#Australian Government
why featured
This has HKR-R because government AI safety ties can shape compliance and procurement. HKR-H and HKR-K miss: it is an MOU announcement with no disclosed term, funding, scope, or delivery mechanism, so it stays in all.
editor take
Anthropic and Australia disclosed only an MOU, with no term, budget, or deliverables; this looks like policy positioning, not deployed safety infrastructure.
sharp
Anthropic disclosed 1 MOU with the Australian Government, and the post omits term length, funding, research scope, and delivery mechanics. My read is simple: don't read this as national AI safety infrastructure getting deployed. Right now it looks more like a frontier lab securing position inside an important policy jurisdiction. The word MOU does a lot of work here. An MOU usually signals intent, not procurement, not a binding regulatory regime, and not an operational safety program. Without a budget, timeline, or evaluation framework, we cannot tell whether this becomes a few workshops, a research paper, or something that actually changes behavior, like model eval requirements, incident reporting pathways, or procurement standards for government use. Those are very different outcomes. One is optics. The other shapes market access. I've thought for a while that Anthropic's government strategy has been pretty consistent over the last year: turn “safety” from a research identity into a credential for entering public-sector and regulated markets. You could already see versions of this around the UK AI Safety Institute, the earlier voluntary commitments in the US, and the broader push for pre-deployment testing norms. OpenAI and Google DeepMind have done similar work, but Anthropic has been more disciplined about presenting itself as the safety-aligned partner. That matters because once governments write third-party evals, model documentation, or deployment review into procurement flows, companies involved early in drafting those norms start with an advantage. I do have a pushback here. The title says Anthropic will support Australia's National AI Plan, but the body never says whether Anthropic is contributing researchers, tooling, evaluation methods, policy advice, or just access. That ambiguity is convenient. It can frame a commercial positioning exercise as public-interest collaboration. If the eventual output is an Anthropic-flavored evaluation stack, or standards that fit Claude-style documentation and assurance practices better than rivals, then this is not just safety research. It is also market design. I'm not saying that's inherently bad. I am saying it is not neutral. There is also broader context outside the snippet. Australia has been moving toward a mix of AI risk governance and national capability building, with a stronger sovereignty instinct around cloud, platforms, and critical tech dependencies. Anthropic's value here is not that Australia alone is a massive model market. The value is whether Australia becomes a template jurisdiction: evaluation templates, incident-reporting formats, model risk tiers, and procurement language that can travel to places like the UK, Canada, or Singapore. If that happens, a thin MOU starts to matter a lot more. The material here is still sparse, so the judgment has to stay disciplined. The title gives us the partnership and the theme. The body gives us almost nothing operational. I would not overrate it yet. This moves up a tier only if later disclosures add three things: a concrete evaluation target such as frontier model pre-deployment assessments, a funding and accountability structure, and a path into government procurement or assurance processes. Without those, this is a positioning document.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K0·R1
00:08
69d ago
Sspai (direct RSS)· rssZH00:08 · 04·01
Morning Dispatch: Claude Code source code leaked by accident, OpenAI raises $122 billion, and more
The headline says Claude Code source code leaked by accident and OpenAI raised $122 billion. The RSS snippet only adds that Sony will keep increasing PlayStation Plus prices and Microsoft is building fully native Windows 11 apps; the post does not disclose the leak scope, funding round, or investors. This is a news roundup, not a deep dive on one event.
#Code#Tools#Anthropic#OpenAI
why featured
This is a news roundup, not a standalone report on the Claude Code leak or OpenAI's $122B funding. HKR-H passes on headline curiosity, but HKR-K and HKR-R fail because key facts are missing; hard-exclusion-stale rerun caps it below 40.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H1·K0·R0
00:00
69d ago
Computing Life · Share (鸭哥 research reports)· rssZH00:00 · 04·01
Claude Code's defenses: how it stops you from pretending to be it
The title says Claude Code has defenses to stop users from pretending to be it; the current condition is title-only because the body is empty. The RSS item does not disclose the mechanism, trigger conditions, false-positive rate, or scope. What actually matters is whether the control sits in system prompts, tool permissions, or output checks.
#Safety#Tools#Claude Code#Commentary
why featured
Hard-exclusion-zero-sourcing applies: the body is empty, so there are no facts, examples, or reproducible details. Only HKR-H passes; HKR-K and HKR-R lack support, so importance stays capped below 40 despite a mildly interesting Claude Code security hook.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H1·K0·R0

more

feeds

admin