papers · 2026-05-23

▸ 185 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-23 · Sat

04:00

17d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·23

→Vector Policy Optimization Improves Diversity in Test-Time Search

The paper proposes Vector Policy Optimization as a drop-in replacement for the GRPO advantage estimator, and reports that it matches or beats scalar RL baselines across four tasks, with larger gains as the test-time search budget grows.

#Reasoning#Code#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: VPO replaces GRPO's scalar advantage with a vector estimator, and the reported edge grows across 4 tasks as search budget rises. It stays below 78 because the source discloses no code or independent replication.

editor take

VPO pushes diversity back into training, not sampling knobs. If the results hold, scalar-reward GRPO starts looking too narrow for search-heavy agents.

sharp

Three sources carried the same headline, but this is one arXiv paper mirrored across cs.LG, cs.CL, and Reddit; the agreement is a single-source chain, not independent validation. The paper proposes VPO as a drop-in replacement for the GRPO advantage estimator, training policies on vector-valued rewards so sampled solutions specialize across trade-offs. I buy the direction, but not the swagger around making it the default post-training objective. The concrete hook is strong: across four tasks, VPO matches or beats scalar RL on pass@k and best@k, with gaps widening as search budget grows; in evolutionary search, VPO solves problems GRPO does not solve. The missing piece is also obvious: the abstract gives no model scale, task list, or absolute lift. For AlphaEvolve-style systems, this is a cleaner bet than endlessly tuning temperature.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Maestro uses a 4B orchestrator to reach 70.1% average accuracy across 10 multimodal benchmarks, above GPT-5 at 69.3% and Gemini-2.5-Pro at 68.7%, by training an outcome-based RL policy over frozen expert models and a two-tier skill library.

#Agent#Multimodal#Reasoning#Maestro

why featured

HKR-H comes from the 4B orchestrator beating GPT-5; HKR-K has 10 multimodal benchmarks and 70.1/69.3/68.7. Single arXiv evidence needs replication, so this is featured research, not same-day must-write.

editor take

Maestro beats GPT-5 by 0.8 points with a 4B router; don’t read this as small-model magic until the expert pool cost is audited.

sharp

Maestro’s useful claim is not the 70.1% average beating GPT-5’s 69.3%. It moves the gain into routing. A 4B orchestrator chooses among frozen expert models and a two-tier skill library. Outcome-based RL also avoids step-level labels, which makes the recipe much easier to reproduce than supervised agent traces. I don’t fully buy the leaderboard framing. The abstract says low latency, but the provided text gives no token cost, expert count, retry rate, or per-task call budget. If a single answer fans out across several external experts, 70.1% is not directly comparable to a monolithic GPT-5 run. This smells closer to RouteLLM or mixture-of-agents for multimodal tasks: strong as a composition layer, easy to overclaim as a model-vs-model win.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Metis reaches an 89.2% average Attack Success Rate across 10 evaluated models, with 76.0% on O1 and 78.0% on GPT-5-chat, while reducing token costs by 8.2x on average and up to 11.4x under the tested jailbreak settings.

#Reasoning#Safety#Alignment#Metis

why featured

HKR-H/K/R all pass: the GPT-5-chat/O1 jailbreak hook is concrete, and ASR plus token-cost numbers add substance. It is a high-signal safety paper, but as a single arXiv result it stays below must-write P1.

editor take

Metis turns jailbreaks from prompt craft into closed-loop policy optimization; 89.2% ASR across 10 models makes static refusal look exposed.

sharp

Metis is scary less because of 89.2% average ASR, and more because it cuts attack token cost by 8.2x. O1 still lands at 76.0% ASR, and GPT-5-chat at 78.0%. That says frontier safety layers still behave like probeable state machines under closed-loop pressure. I don’t buy the paper’s cheerful framing around “transparent reasoning traces.” Attackers get an interpretable tuning loop; defenders get a faster red-team script. ICML 2026 acceptance gives the result weight, but the scope matters: these are tested jailbreak settings. Product defenses like rate limits, session isolation, and audit triggers were not shown breaking here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Training-Trajectory-Aware Token Selection

The paper proposes T3S, a token-level selection method based on training trajectories; with hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and a T3-trained LLaDA-2.0-Mini beats its AR baseline among 16B-scale no-think models.

#Reasoning#Fine-tuning#Inference-opt#Qwen

why featured

HKR-H/K/R all pass: T3S ties token selection to training trajectory and claims Qwen3-8B beats DeepSeek-R1 while 32B nears 235B. Single arXiv paper still needs replication, so this stays high-quality research, not P1.

editor take

T3S is less about “hundreds of samples beat R1” and more a warning: strong-student distillation can fail at the token objective level.

sharp

T3S matters because it gives strong-student distillation a token-level failure mode, not another vague data-quality story. The paper says loss keeps falling while multiple metrics crash at the same bottleneck, then recover. Its mechanism is specific: Imitation-Anchor Tokens lock optimization early, while yet-to-learn tokens get confidence-suppressed until later. That is a usable knob, if the claim survives reproduction. I’d still keep the hype on a leash. “Hundreds of examples” making Qwen3-8B beat DeepSeek-R1, and Qwen3-32B approach Qwen3-235B, are very large claims. The snippet gives no benchmark names, scores, teacher setup, or sampling conditions. DeepSeek-R1 already made distillation the default story; T3S earns attention only if the training-trajectory selection reproduces outside its own benchmark stack.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

The paper proposes compiling agentic procedures into small fine-tuned model weights and evaluates the approach on travel booking with 14 nodes, Zoom support with 14 nodes, and insurance claims with 55 nodes; the abstract claims near-frontier quality at two orders of magnitude lower cost.

#Agent#Fine-tuning#LangGraph#CrewAI

why featured

HKR-H/K/R all pass: the paper offers a counterintuitive mechanism, three workflow sizes, and a 100x cost claim. As a single arXiv paper needing replication, it stays in the 78–84 band.

editor take

This goes after the runtime tax of LangGraph/CrewAI-style agents; the claim only gets serious if it survives beyond 55-node workflows.

sharp

The sharp claim here is that many agent frameworks turned deterministic procedure into expensive runtime theater. The paper names LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex, with 290,000+ GitHub stars combined. Their shared pattern is external orchestration: inject instructions and route decisions every turn. The proposed fix is old-school and uncomfortable for framework vendors: compile the procedure into a small fine-tuned model’s weights. The evidence covers travel booking with 14 nodes, Zoom support with 14 nodes, and insurance claims with 55 nodes and 6 decision hubs. The abstract claims two orders of magnitude lower cost. I buy the savings on context, calls, and procedure leakage. I don’t yet buy “near-frontier quality.” Enterprise workflows break on exceptions, audit trails, rollback, and permissions, not just node count.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Putnam 2025 Problems in Rocq Using Opus 4.6 and Rocq-MCP

Claude Opus 4.6 used Rocq-MCP tools on an isolated no-internet VM to prove 10 of 12 problems from the 2025 Putnam Mathematical Competition, deploying 141 subagents over 17.7 active compute hours and consuming about 1.9 billion tokens.

#Agent#Reasoning#Tools#Anthropic

why featured

HKR-H/K/R all pass: the Putnam 10/12 result is clickable, the setup gives hard numbers, and the cost profile matters to agent builders. Single arXiv paper, not a broad product release, so it stays below P1.

editor take

Opus 4.6 proving 10/12 Putnam problems is serious; 1.9B tokens is the catch. This is proof search at industrial burn, not a math savant moment.

sharp

Don’t read this as “the model does math now.” The sharper read is that Opus 4.6 plus Rocq-MCP turns hard contest problems into verifiable search jobs, then throws an agent farm at them. The 10/12 result is strong, but the bill is explicit: 141 subagents, 17.7 active compute hours, 51.6 wall-clock hours, and about 1.9B tokens. That is far from a normal interactive math assistant. The no-internet VM and public Rocq proofs matter more than the splashy score. No internet cuts off retrieval leakage; Rocq makes the output auditable. Compared with miniF2F-style short-proof benchmarks, Putnam is a better stress test for long-horizon formal work. I still would not extrapolate to “automatic mathematician”: the tool strategy was tuned from prior miniF2F-Rocq logs, and the abstract does not explain the two failures.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft releases a public emergency-medicine RL safety environment with 195 tasks, 24 MCP tools, and 2,255 binary grading criteria. Claude Opus 4.6 reaches 24.8% Pass@1 and GPT-5.4 reaches 12.6%, while multi-step workflows fall to 1.0% and 0.0%.

#Agent#Safety#Benchmarking#HealthCraft

why featured

All three HKR axes pass: this is a reproducible agent-safety environment, not generic medical AI, with low Claude Opus 4.6 and GPT-5.4 pass rates. As a single arXiv benchmark, it fits the 78-84 band, not must-write.

editor take

HealthCraft hits trajectory safety, and the numbers are brutal: Claude Opus 4.6 at 1.0% multi-step, GPT-5.4 at 0.0%. Don’t wire this into ER workflows yet.

sharp

HealthCraft exposes the gap vendors keep hiding: passing medical questions is not surviving an emergency trajectory. Claude Opus 4.6 gets 24.8% Pass@1, and GPT-5.4 gets 12.6%. On multi-step workflows, they drop to 1.0% and 0.0%. That is not benchmark noise; that is trajectory-level safety collapse. The evaluation design matters here: 195 tasks, 24 MCP tools, 2,255 binary criteria, and 515 safety-critical criteria with zero reward on violation. Six infrastructure bugs fixed between v2 and v8 even changed which model looked stronger. Static medical QA boards still flatter models that can answer isolated prompts. HealthCraft asks whether the model survives tools, pressure, and sequential clinical decisions. The answer is ugly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

SkillWeave partitions a general-purpose model into lightweight domain-specific skillpacks and compresses them with SkillZip for deployment; a 9B SkillWeave model beats several baselines and surpasses a 32B monolithic LLM on multi-task and agentic benchmarks, with up to 4x speedup.

#Agent#Inference-opt#Benchmarking#SkillWeave

why featured

HKR-H/K/R all pass: modular skillpacks claim a 9B model can beat a 32B monolithic LLM with up to 4x speedup. It stays in the 78–84 band because this is a single arXiv paper without independent replication or major-lab authority.

editor take

A 9B beating a 32B is the hook; the actual bet is whether skillpack routing survives messy production workloads.

sharp

SkillWeave’s sharp idea is not the 9B-beats-32B headline; it is turning capabilities into deployable domain-specific deltas. The snippet gives two hard claims: a 9B SkillWeave model beats a 32B monolithic LLM on multi-task and agentic benchmarks, and SkillZip reaches up to 4x speedup. The abstract does not name the benchmarks, the 32B baseline, the memory budget, or routing overhead. I buy the direction more than the number. LoRA and adapters have had the same production problem for years: many small patches are easy to train and annoying to compose. If SkillWeave makes skillpacks compact and inference-ready, it fits the enterprise path for cheaper specialization. But 4x speedups often come from narrow tasks and cache-friendly setups; cross-domain agent runs will expose switching cost fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Linear Dynamics in the RLVR Training of Large Language Models

The paper analyzes RLVR trajectories and finds that weights and output log-probabilities enter a linear regime across model families, RL algorithms, and training setups with R² > 0.7; weight-space extrapolation delivers a 6.1x training speedup, while output-space extrapolation improves math and coding benchmark performance by 4.2% on average.

#Reasoning#Fine-tuning#Benchmarking#Miaow-Lab

why featured

HKR-H/K/R all pass: the linear-extrapolation angle is novel, the paper gives concrete numbers, and it targets RLVR training cost. It is strong research signal, not an industry-shaking release.

editor take

If RLVR really settles into linear drift, post-training stops looking like mysticism and starts looking like an engineering loop.

sharp

The sharp claim here is that RLVR post-training has a predictable drift, not just noisy reward hacking. The paper reports a linear regime across model families, RL algorithms, and training setups, with weights and teacher-forced output log-probs reaching R² > 0.7. Then it turns that observation into an intervention: weight-space extrapolation with periodic re-grounding gives a reported 6.1x training speedup, while output-space extrapolation adds 4.2% on math and coding benchmarks. I buy the direction, not the discount sticker yet. If this holds on larger MoE systems, long-horizon reasoning, and tool-use runs, RLVR becomes much less artisanal. But the abstract does not name model sizes, benchmark mix, or failure cases. A 6.1x number without those details is a research lead, not an infra planning assumption.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

The paper builds 11,488 PapersWithCode-grounded idea pairs and trains models to predict, before experiments, which research idea will score higher on a target benchmark; an 8B model reaches 77.1% accuracy after SFT, above GPT-5 at 61.1%, while RLVR reaches 71.35% with interpretable justifications.

#Reasoning#Fine-tuning#Benchmarking#PapersWithCode

why featured

HKR-H/K/R all pass: the hook is AI forecasting research success, with 11,488 pairs and a 77.1% vs 61.1% result. It is a practical arXiv claim, not a model launch, so it fits the 78–84 band.

editor take

An 8B SFT model beating GPT-5 by 16 points smells like the product is research triage, not idea generation.

sharp

The sharp move here is turning “research taste” into supervised pairwise prediction. The dataset has 11,488 PapersWithCode-grounded idea pairs; before any experiment, the model picks which idea scores higher. An 8B model after SFT hits 77.1%, while GPT-5 is listed at 61.1%. If the construction avoids leakage, general models lose the reviewer seat inside research agents. I’m skeptical about benchmark ancestry. PapersWithCode outcomes carry method families, task fashions, and leaderboard habits; SFT may learn which trick usually wins, not which idea is scientifically better. The authors claim time-split transfer, an independent test set, and anti-heuristic ablations, but the RSS snippet does not give those set sizes. RLVR reaching 71.35% with justifications is less convincing than the SFT number; interpretability is the wrapper, accuracy is the blade.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

The paper introduces OPCT and evaluates sycophancy, jailbreaking, and safety awareness across three model families; it reduces sycophancy from 15.4% to 8.1% and keeps jailbreak defense near 99% under an adaptive per-target attacker.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: OPCT gives a testable mechanism and numbers across 3 model families, with a clear safety-versus-capability tradeoff. It stays in the 78–84 band because it is a single arXiv paper without independent replication or product adoption.

editor take

OPCT lands because it attacks safety overfitting: 99% jailbreak defense with little capability drag is a cleaner story than another SFT safety patch.

sharp

OPCT’s useful claim is that safety tuning should stop teaching models to recite offline safety pairs. The method computes the consistency objective on the model’s own responses, then supervises with its own outputs under contrastive prompts. That targets behavioral invariance, not surface-form memorization. The numbers fit the claim: sycophancy drops from 15.4% to 8.1%, versus 11.2% for SFT. Under an adaptive per-target attacker, jailbreak defense stays near 99% on held-out behaviors, while SFT averages 87%. The MATH-500 detail is the tell. The abstract says SFT induces a 28-point drop, while OPCT largely avoids capability regressions. Safety papers often bury capability tax in appendix tables; this one puts the tax in the abstract. The gap: the RSS snippet names three model families but not which ones, and it gives no training cost or attacker budget. The 99% result needs the full tables before anyone treats it as deployment-grade.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

The paper evaluates autonomous AI agents with the MIT Beer Game and reports that optimized reasoning models cut costs by up to 67% versus human teams, while fixed demand paths still show decision bullwhip, run-to-run instability, and tail-risk failures that repeated sampling does not meaningfully reduce.

#Agent#Reasoning#Fine-tuning#Carol Xuan Long

why featured

HKR-H/K/R all pass: the paper gives a testable MIT Beer Game setup, a 67% cost-reduction claim, and residual bullwhip risk. It is research-grade signal, not a same-day industry event, so 80 fits the 78–84 band.

editor take

The 67% cost cut is the bait; fixed-demand decision bullwhip is the deployment blocker for supply-chain agents.

sharp

This paper raises the bar for supply-chain agents: beating humans on average cost is insufficient if fixed demand still triggers self-made volatility. In the MIT Beer Game, optimized reasoning models cut costs by up to 67% versus human teams, yet the same demand path still produced run-to-run instability, decision bullwhip, and tail-risk failures. The ugly part is repeated sampling did not meaningfully damp the instability, so “sample more and vote” is not a reliability plan. The GRPO post-training setup uses system-level supply-chain rewards to reduce tail events, which feels closer to deployment machinery than prompt guardrails. Any vendor pitching autonomous planning should be forced through this test before the case study deck.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

The paper argues AI text detectors amplify a pretrained typicality axis rather than learn an AI-vs-human boundary, with raw centroid projection reaching AUROC 0.806/0.944/0.834 across three architectures and inverting on non-native ESL writing at AUROC 0.06-0.20.

#Fine-tuning#Benchmarking#Interpretability#RoBERTa

why featured

HKR-H/K/R all pass: the angle is counterintuitive, the paper gives testable AUROC and ESL-failure numbers, and the fairness risk is practitioner-relevant. As a single arXiv paper, it fits the 78-84 quality band, not same-day must-write.

editor take

This paper makes AI-text detectors look like typicality meters: ESL AUROC at 0.06-0.20 is not noise, it is product-grade false accusation risk.

sharp

AI-text detection takes a hard hit here: the paper says detectors learn a pretrained typicality direction, not an AI-versus-human boundary. The evidence is unusually concrete: raw centroid projection gets NYT-vs-HC3 AUROC of 0.806/0.944/0.834 across three architectures, and RoBERTa-base beats full fine-tuning. A 24-example frozen probe reaches 0.900 versus 0.895 for full FT. The ugly part is the ESL inversion: AUROC falls to 0.06-0.20 on non-native writing. That turns the “catch AI cheating” pitch into a proxy for who writes less like native formal English. OpenAI’s old detector retreat looks less like caution and more like pattern recognition. Any vendor still selling aggregate AUROC here should publish ESL false-positive rates before asking for institutional trust.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Boundary-targeted Membership Inference Attacks on Safety Classifiers

The paper introduces a boundary-targeted selection strategy for membership inference attacks and recovers 19% of distress-flagged conversations at a 5% false-positive rate, 3.5 times more than state-of-the-art MIA methods alone.

#Safety#Fine-tuning#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: the paper ties membership inference to safety classifiers with concrete recovery numbers and sensitive user chats. It remains a single arXiv research item, so it fits the 78–84 quality-research band, not P1.

editor take

Safety classifiers leak exactly where users are most exposed: 19% distress-flagged conversations recovered at 5% FPR, and content filtering misses the boundary cases.

sharp

Safety classifiers are becoming privacy amplifiers for the users they claim to protect. Hughes et al. target low-confidence boundary examples in an emotional-support detector; at 5% false-positive rate, they recover 19% of distress-flagged training conversations, 3.5x more than state-of-the-art MIA alone. That cuts straight into the usual safety-data instinct: collect more self-harm and mental-health conversations, then fine-tune a better gatekeeper. The paper’s mechanism is nasty because the classifier leans on memorization when labels are ambiguous near the decision boundary. Content filtering fails on those examples, while noise strategies help. For teams shipping safety classifiers, “we removed sensitive strings” is not a privacy story anymore.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

The paper proposes Zero-CoT Probe, a black-box detection method that truncates the full CoT process and compares zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset, using Contamination Confidence to quantify both contamination likelihood and severity.

#Reasoning#Benchmarking#Safety#Research release

why featured

HKR-H/K/R all pass: the title has a controversy hook, the summary gives a test mechanism and metric, and benchmark contamination matters to practitioners. Single arXiv paper with limited external validation keeps it in the 78–84 band.

editor take

Zero-CoT Probe hits a leaderboard sore spot: polished CoT can hide memorization rather than prove reasoning.

sharp

Zero-CoT Probe makes a sharp bet: remove the full chain-of-thought, and memorized shortcuts become easier to expose. The method compares zero-CoT accuracy on the original benchmark against an isomorphically perturbed reference set, then scores contamination with Contamination Confidence. That targets paraphrased benchmark leakage better than plain n-gram or nearest-neighbor checks. I buy the direction, but not as a silver bullet. The abstract claims tests on known contaminated models and specially fine-tuned contaminated models, but it does not disclose model names, benchmark count, or false-positive rates here. After MMLU and GSM8K became training-set folklore, contamination stopped being only exact-item overlap. The live issue is exposure to equivalent problem structure, and ZCP at least gives evaluators a reproducible black-box lever.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Check Your LLM's Secret Dictionary: Five Lines of Code Reveal What Your LLM Learned

The paper applies five lines of PyTorch SVD to the lm_head weight matrix, with no inference, and analyzes GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B to expose semantic subspaces, training-data signals, ethically concerning vocabulary clusters, and a WPS-detected glitch token in GPT-OSS-120B.

#Interpretability#Safety#Benchmarking#GPT-OSS-120B

why featured

HKR-H/K/R all pass: the paper offers a catchy no-inference audit hook, a concrete SVD-on-lm_head method, and safety-audit resonance. It remains a single arXiv paper without cross-source traction, so it fits the 78–84 band.

editor take

Five-line SVD finding a GPT-OSS-120B glitch token is a bad look for any lab claiming post-training cleaned the model.

sharp

The sharp part here is not interpretability theater; it pushes safety auditing down to the lm_head weights. The paper runs SVD on the output head only, with no inference, then finds semantic clusters across GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B. WPS also recovers GPT-OSS-120B’s shokubutsu-hyakka-tsu glitch token, ID 137606. That cuts against the usual alignment story. The base-instruct comparison says ethically concerning vocabulary subspaces come from pretraining and survive post-training. This bypasses red-team prompt choice, sampling settings, and refusal-template noise. My caveat is simple: three models is thin, and VCS/WPS needs independent replication. But as a pre-release static check, five lines of PyTorch is too cheap for labs to ignore.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Robust Reasoning Benchmark

Robust Reasoning Benchmark applies 13 deterministic textual perturbations to AIME 2024 and AIME 2025 and evaluates 8 state-of-the-art models; open-weights reasoning models show average accuracy drops of up to 54%, with some perturbations causing up to 100% drops.

#Reasoning#Benchmarking#Interpretability#Claude

why featured

HKR-H/K/R all pass: AIME perturbations, 8 models, and a 54% drop give testable substance and a clear robustness nerve. As a single arXiv benchmark, it fits the 78-84 band rather than a must-write release.

editor take

AIME scores are a thin shield: 13 text perturbations cut open-weight reasoning accuracy by up to 54%, and Claude’s refusal behavior looks ugly here.

sharp

RRB separates “can solve AIME” from “can reason robustly.” The setup is blunt: apply 13 deterministic text perturbations to AIME 2024 and AIME 2025, then test 8 models. Frontier closed models mostly survive, but Claude refuses many transformed prompts. Open-weight reasoning models lose up to 54% average accuracy, with some perturbations causing 100% drops. The sharper finding is Intra-Query Attention Dilution. Open-weight models from 7B to 120B degrade when solving multiple independent math problems sequentially in one context window. That is not just messy prompting; the model’s own chain-of-thought pollutes later reasoning under dense attention. After the DeepSeek-style push toward long visible CoT, this failure mode becomes an architecture tax, not a benchmark quirk.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

The researchers trained quadrotor racing agents with multi-agent reinforcement learning and league-based self-play, beating a champion-level human pilot in multiplayer races above 22 m/s while reducing collision rates by 50% versus state-of-the-art single-agent baselines.

#Robotics#Agent#Safety#University of Zurich

why featured

HKR-H/K/R all pass: champion-level drone racing is clickable, >22 m/s and 50% fewer collisions are concrete, and safe robotic control resonates. No hard exclusion; as an arXiv robotics paper, it sits below major model or product launches.

editor take

Quadrotors beating a champion pilot above 22 m/s is real robotics signal; the safety claim still lives inside a racing-shaped sandbox.

sharp

This paper pulls multi-agent RL out of simulation theater and into a fast physical setting. The hard hook is good: quadrotors beat a champion-level human pilot in multiplayer races above 22 m/s, while cutting collisions by 50% versus single-agent baselines. The useful part is not the “superhuman” label. It is the training distribution: league self-play, variable racer counts, overtaking, downwash, and anticipatory avoidance all become first-class training pressure. I would still discount the safety framing. Drone racing has a closed track, crisp reward, short horizon, and controlled actors. That is far from warehouse robots, sidewalk delivery, or home robotics. This reads like a serious multiplayer extension of UZH’s earlier autonomous drone racing line: strong evidence that other agents should be modeled as agents, not noise; thin evidence that the recipe transfers cleanly outside the racecourse.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

WarmServe preloads parameters from multiple models based on workload forecasts; on real-world datasets, it reduces tail TTFT by up to 50.8x versus an autoscaling baseline and supports up to 2.5x higher request throughput than a GPU-sharing system.

#Inference-opt#WarmServe#Research release

why featured

HKR-H/K/R all pass: the hook is one-for-many prewarming, with 1/50.8 tail TTFT and 2.5x throughput. As a single arXiv systems paper, it stays in the 78–84 band pending reproduction.

editor take

WarmServe moves multi-LLM serving from reactive scheduling to forecasted memory staging; 50.8x tail TTFT is huge if your traffic is actually periodic.

sharp

WarmServe makes a clean bet: multi-LLM serving is losing time to dumb cold starts, not just scarce GPUs. It forecasts demand, preloads weights for multiple models, reuses idle KV-cache space for prewarming, and reports tail TTFT up to 50.8x lower than an autoscaling baseline. Throughput reaches 2.5x a GPU-sharing system on real-world traces. I buy the direction more than the headline number. Production traffic has daily cycles and event spikes, and stacks around vLLM or Ray Serve still suffer when a burst forces model loading. The dangerous part is forecast error. A wrong prewarm burns HBM, squeezes KV cache, and can hurt live instances. The abstract does not give the degradation curve under bad forecasts, and that curve decides whether WarmServe is a paper win or an ops-safe serving primitive.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

The paper proves that no feature ranking is simultaneously faithful, stable, and complete under collinearity, verifies the result with 305 Lean 4 theorems, and reports attribution instability in 68% of 77 public datasets.

#Interpretability#Benchmarking#Safety#arXiv

why featured

HKR-H/K/R all pass: the title makes a strong impossibility claim, with formal Lean 4 validation and dataset prevalence. It is theory-heavy, so it stays below P1, but no hard-exclusion rule is triggered.

editor take

Collinearity turns SHAP rankings into coin flips; 305 Lean 4 theorems make that failure harder to hand-wave away than another XAI dashboard.

sharp

This paper does more than dunk on SHAP; it cuts into the business model of ranked explanations. Under collinearity, no feature ranking is faithful, stable, and complete at once. The concrete hooks are ugly: 68% of 77 public datasets show attribution instability, the attribution ratio blows up as 1/(1-rho^2) for gradient boosting, and Lasso goes infinite. The authors also machine-check the proof with 305 Lean 4 theorems from 16 axioms and 0 sorry, which makes the usual “edge case” dismissal harder. DASH is the sober answer: keep stability by reporting ties for symmetric features, not by forcing a fake top-1. Fairness teams should be nervous. A lot of SHAP-based proxy discrimination audits are legal-looking documents wrapped around unstable rankings when correlated proxies enter the table.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

The paper proposes intelligence per watt as task accuracy per unit of power, evaluating 20+ local LMs, 8 accelerators, and 1M real-world single-turn queries; local models answered 88.7% successfully, while IPW improved 5.3x from 2023 to 2025 and locally serviceable query coverage rose from 23.2% to 71.3%.

#Inference-opt#Benchmarking#arXiv#Apple

why featured

HKR-H/K/R all pass: the energy-efficiency hook is clear, with 88.7% accuracy and a 5.3x IPW gain from 2023-2025. As a single arXiv benchmark, it fits the 78-84 band, below major model releases.

editor take

IPW drags local inference from privacy pitch to power accounting; 88.7% looks strong, but “win rate against frontier models” needs a hard audit.

sharp

IPW is a useful paper because it forces local models to pay two bills at once: accuracy and power. The setup is unusually concrete: 20+ local LMs, 8 accelerator types, and 1M single-turn real-world queries. The headline numbers are strong: 88.7% successful answers, 5.3x IPW gain from 2023 to 2025, and locally serviceable coverage moving from 23.2% to 71.3%. I don’t fully buy the metric until the judging details are audited. “Accuracy” here is a local-LM win rate against frontier models, so the evaluator, domain mix, and failure labels can swing the result. The spicy bit is hardware: local accelerators report at least 1.4x lower IPW than cloud accelerators on identical models. That is a direct shot at TOPS marketing from Apple-class laptop silicon vendors.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→LEMUR: Learned Multi-Vector Retrieval

LEMUR reduces multi-vector similarity search to single-vector indexed search by learning MaxSim with a one-hidden-layer neural network, and the paper reports an order-of-magnitude speedup over prior multi-vector search methods with code released on GitHub.

#RAG#Embedding#Inference-opt#LEMUR

why featured

HKR-H/K/R all pass, but this is an arXiv retrieval paper rather than a major model or product launch. The single-vector approximation of multi-vector MaxSim with 10x speedup supports a 78 featured score.

editor take

LEMUR turns ColBERT-style MaxSim into single-vector search; that is useful plumbing, but the 10x speed claim lives or dies on recall loss.

sharp

LEMUR hits the old multi-vector retrieval tradeoff: ColBERT-style quality, MaxSim latency. Its move is clean: train a one-hidden-layer network to approximate MaxSim, then reduce inference to single-vector similarity search in a latent space. That lets existing ANN indexes carry part of late interaction’s benefit. The paper claims an order-of-magnitude speedup over prior multi-vector methods, releases code, and is accepted to ICML 2026. I would stress-test two things before buying it for production RAG: recall drop on out-of-domain queries, and whether the latent single-vector step washes out token-level evidence in long documents. A 10x retrieval speedup is great only if the missing top-k passages are not the ones the generator needed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→CacheClip: Accelerating RAG with Effective KV Cache Reuse

CacheClip uses a small auxiliary LLM to select tokens for KV cache recomputation; at recomp=20%, it retains 85.2% of full-attention performance on NIAH and 91.1% on LongBench, while accelerating LLM prefill by up to 3.33x.

#RAG#Inference-opt#Benchmarking#CacheClip

why featured

HKR-H/K/R all pass: clear speedup number, testable cache-reuse mechanism, and direct RAG cost relevance. As a single arXiv paper needing reproduction, it fits 78 featured, not p1.

editor take

CacheClip attacks RAG latency at the KV layer: 20% recompute for 3.33× prefill speedup beats stapling on another reranker.

sharp

CacheClip makes the right call: RAG TTFT is not a prefix-cache problem once retrieved chunks vary. The paper uses a small auxiliary LLM to choose tokens for KV recomputation; at recomp=20%, it keeps 85.2% of full-attention NIAH and 91.1% of LongBench, with up to 3.33× prefill speedup. I buy the diagnosis more than the “practical solution” framing. The gains over APE and CacheBlend are real: +16.1 and +12.8 points on NIAH, plus +4.5 and +4.2 on LongBench. But the abstract does not give target model size, retrieval length, or CPU-side throughput limits. CPU-GPU hybrid sounds cheap until production scheduling turns the auxiliary model into a new tail-latency source.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization

Text2Opt-Bench tests 10+ models across 12 optimization categories and finds accuracy drops as instance data grows; BIND externalizes numeric data into structured files, raising GPT-5-Nano accuracy from 59.1% to 82.4% and GPT-5 from 86.2% to 95.8%.

#Reasoning#Code#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: the paper names a crisp failure mode, gives benchmark scale plus a reproducible BIND mechanism, and touches LLM reliability in structured numeric work. Single arXiv paper, so it stays at 78 rather than same-day must-write.

editor take

BIND lifts GPT-5-Nano from 59.1% to 82.4%; the lesson is brutal: stop asking LLMs to be spreadsheet clerks.

sharp

The sharp claim here is that “reasoning failure” is often plain data-binding failure. Text2Opt-Bench covers 12 optimization categories and 10+ models, and accuracy falls as instance data grows. BIND changes one interface: numeric data moves into structured files, so the model binds coefficients and indices programmatically. GPT-5-Nano jumps from 59.1% to 82.4%; GPT-5 moves from 86.2% to 95.8%. I buy this more than another agent benchmark. A lot of failed tool workflows are not bad planning; they are copied IDs, swapped columns, and mangled constraints. The annoying number is pass@5 at 82.0% versus BIND pass@1 at 82.4%, with lower token cost. For production AI, sampling harder is a tax you pay when the interface is wrong.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Towards Real-world Human Behavior Simulation: Benchmarking LLMs on Long-horizon, Cross-scenario Behavior Traces

The paper introduces OmniBehavior, a real-world-data benchmark for LLM user simulation across long-horizon, cross-scenario, heterogeneous behavior traces; its evaluations report that current models plateau as context windows expand and converge toward a “positive average person” with hyper-activity, persona homogenization, and utopian bias.

#Agent#Benchmarking#Memory#OmniBehavior

why featured

HKR-H/K/R all pass, but the body only gives the benchmark name and two findings; dataset size, model list, and code status are not disclosed. Single arXiv benchmark, so featured lower band.

editor take

OmniBehavior punctures the long-context user-sim story: the models drift into a busy, cheerful, personality-flattened average human.

sharp

OmniBehavior’s sharpest result is not weak scores; it is the plateau after adding more context. The benchmark uses real-world long-horizon, cross-scenario, heterogeneous behavior traces, which hits the lazy assumption behind many user simulators: feed the model more history and it will recover the person. I care more about the “positive average person” bias. The paper says models become hyper-active, persona-homogenized, and utopian, losing long-tail behavior and individual differences. That is poison for agent startups using LLMs as user sandboxes for growth tests, product simulation, or workflow rehearsal. It will overestimate cooperative users and underrepresent lazy, inconsistent, annoyed humans. The scraped body does not expose the model list or score table, so I would treat this as a benchmark direction first, not a leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→How does Chain of Thought decompose complex tasks?

The paper models CoT as a tree-structured decomposition of classification tasks, showing error scales as a power law with the number of classes; it identifies a degree threshold where deeper reasoning hurts below the threshold and has an optimal depth above it.

#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is an arXiv theory paper rather than a model or product release. The testable CoT-depth threshold makes it featured, not must-write.

editor take

CoT length is not free compute magic; this paper turns “think longer” into a degree threshold, which undercuts token-stuffed reasoning demos.

sharp

The sharp claim here is that CoT only helps when the decomposition has enough branching quality. The paper models a task as multiclass classification, with error scaling as a power law in class count. CoT becomes a fixed-degree tree. Below a critical degree, deeper reasoning hurts accuracy. Above it, there is an optimal depth, and more steps cannot beat the minimum error. That lands directly against the “longer reasoning equals better reasoning” story vendors have been selling through test-time compute. OpenAI and Anthropic have both leaned on longer visible or hidden reasoning traces as a capability signal. This paper says the split itself matters before the length does. The body gives theory, not concrete LLM benchmark numbers, so I would not treat it as leaderboard evidence. I would use it as a stress test for CoT eval design.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas builds OCR datasets and benchmarks across 82 languages and 9 evaluation tasks, evaluates 16 state-of-the-art models, and reports persistent gaps for low-resource scripts; DPO with rendering-derived ground truth improves in-domain accuracy by 1.9% and out-of-domain accuracy by 1.8%, while supervised fine-tuning degrades out-of-domain performance by up to 21%.

#Vision#Multimodal#Fine-tuning#DocAtlas

why featured

HKR-H/K/R pass, but this is a single arXiv benchmark paper whose impact depends on dataset quality and adoption; scored in the low featured band for a solid research release.

editor take

DocAtlas is sharp because it avoids model-made labels; low-resource OCR gaps won’t be fixed by model scale alone.

sharp

DocAtlas is a useful slap at multilingual OCR optimism: 82 languages, 9 tasks, and 16 models still leave low-resource scripts behind. The strongest design choice is not the benchmark size. It is the annotation path: DOCX differential rendering plus LaTeX synthesis for right-to-left scripts, with DocTag ground truth generated without learned models in the core loop. The numbers make the paper less flashy and more credible. DPO adds only +1.9% in-domain and +1.8% out-of-domain, while supervised fine-tuning drops out-of-domain performance by up to 21%. DocAtlas-DeepSeek beats the strongest baseline by just +1.7%. I read that as a warning for document AI teams: cleaner preference signals beat another round of supervised data stuffing when scripts and layouts move off the high-resource happy path.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

P2D updates only 10% of attention heads on 10% of the data, improves performance by 8.3 percentage points over strong baselines, and delivers a 7.0x end-to-end time speedup.

#Fine-tuning#Alignment#Inference-opt#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with no disclosed code, replication, or lab adoption. Score stays in the practical research band above the featured threshold.

editor take

P2D’s bet is that data selection and PEFT are the same control loop; if AER holds up, small labs waste far fewer GPU hours on alignment.

sharp

P2D’s sharp move is making parameters drive data selection, not just shrinking trainable weights. It updates 10% of attention heads on 10% of the data, then claims an 8.3 pp gain over strong baselines and a 7.0x end-to-end speedup. That matters more than another LoRA variant, because real fine-tuning waste often sits in bad samples and repeated trial runs. I’m not fully sold on the Strong Map Hypothesis yet. Attention heads acting as stable task keys sounds fragile across model families and domains. The snippet gives no model sizes, task suite, baseline names, or exact AER accounting. If this only works on a narrow alignment benchmark, it is a clever paper trick. If it transfers to coding, support, or medical domain tuning, it belongs in production fine-tuning stacks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→AutoBaxBuilder: Bootstrapping Code Security Benchmarking

AutoBaxBuilder generates code security benchmark tasks from scratch in under 2 hours for less than USD 4, and the workflow with manual verification reduces human effort for benchmark construction by 12x.

#Code#Benchmarking#Safety#AutoBaxBuilder

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark-building paper, narrower than a major model or product release. Concrete cost and labor claims put it just above the featured threshold.

editor take

Security benchmarks are getting factory-made; under $4 and two hours attacks the scarcity premium of expert-built evals first.

sharp

AutoBaxBuilder turns code-security eval construction into a pipeline, which is useful and dangerous in the same breath. The hard numbers are unusually clean: new tasks in under two hours, under $4, and 12x less human effort after manual verification. It also builds functional tests plus end-to-end security-probing exploits, so this is not just synthetic trivia generation. The catch is model-shaped blind spots. LLM-made benchmarks help escape stale, contaminated expert sets, but they can also preserve the failure modes of the model doing the generation. SWE-bench already showed how fast a benchmark becomes a product KPI once vendors can optimize against it. For security evals, the status marker moves from who wrote the tasks to who verified the exploit and proved it bites real code.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Harnesses for Inference-Time Alignment over Execution Trajectories

The paper splits harness design into task decomposition and guided execution, then validates failure modes such as over-decomposition, over-pruning, and hallucinated execution through controlled synthetic experiments and real terminal-agent benchmarks.

#Agent#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper frames terminal-agent failures as harness-design failures, with two mechanisms and three tested failure modes. Single arXiv source and no disclosed code or large-scale metrics keep it near the featured threshold.

editor take

This paper punctures the harness hype: more workflow scaffolding can lower pass rate through over-decomposition and over-pruning.

sharp

Agent harnesses do not get safer just by adding more structure, and this paper pushes directly against that engineering instinct. It splits harness design into task decomposition and guided execution, then ties performance limits to workflow granularity, retry budgets, and guidance-induced action reweighting. The sharp hook is partial harnessing: specify only the initial steps, then let the agent run, and it can beat a fully structured workflow. That matches the messy reality of terminal agents. Teams keep stacking planners, checklists, validators, and retry loops around Devin-style tasks, then wonder why pass rate stalls. The snippet gives no concrete benchmark numbers, so I would not overclaim the result size. But the failure labels are useful: over-decomposition, over-pruning, and hallucinated execution name the exact places where scaffolding stops being alignment and starts choking the search.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→X-SYNTH: Beyond Retrieval — Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH synthesizes enterprise context from observed digital human attention using seven attention filters and a four-stage pipeline. In the lead-generation task, it raises True Lead Rate from 9.5% to 61.9% and cuts False Lead Rate from 90.5% to 18.8%.

#Agent#RAG#Benchmarking#Guruprasad Raghavan

why featured

HKR-H/K/R all pass: the paper has a novel attention-over-retrieval angle, concrete TLR/FLR numbers, and an enterprise RAG pain point. Single arXiv source with no disclosed open-source artifact keeps it near the lower featured band.

editor take

X-SYNTH lifts lead TLR from 9.5% to 61.9%, but the sharp question is governance: who can collect worker attention traces, audit them, and prevent looped bias?

sharp

X-SYNTH’s strong claim is not “beyond retrieval”; it turns employee behavior logs into implicit labels for agents. The paper gives a hard delta: a frontier model alone gets 9.5% True Lead Rate and 90.5% False Lead Rate on lead generation. With seven attention filters and a four-stage pipeline, TLR reaches 61.9% and FLR drops to 18.8%. That is too large to dismiss as another RAG wrapper. I don’t buy the clean framing around “digital human attention” as reliable ground truth. CRM touches, email patterns, Slack trails, and browsing sequences capture observable work, not decision quality. If the system clones top-seller behavior, it also clones territory bias, account-selection bias, and whatever the org already rewards. Compared with vanilla enterprise RAG, X-SYNTH looks closer to behavioral cloning for enterprise agents. The benchmark is impressive; the governance surface is ugly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Discovering Implicit Large Language Model Alignment Objectives

The paper introduces Obj-Disco, a framework that uses an iterative greedy algorithm to decompose alignment reward signals into sparse weighted natural-language objectives, and experiments on open-source reward models report that it captures more than 90% of reward behavior.

#Alignment#Safety#Interpretability#Obj-Disco

why featured

HKR-H/K/R all pass: the paper has a black-box alignment hook, a concrete Obj-Disco mechanism, and a >90% explanation claim. As a single arXiv paper without lab authority or cross-source pickup, it stays just above the featured threshold.

editor take

Obj-Disco’s >90% reward-behavior coverage is useful, but don’t confuse explaining a reward model with proving alignment.

sharp

Obj-Disco hits the old RLHF wound: teams still guess what their reward models reward after training. It uses an iterative greedy algorithm over behavioral changes across checkpoints, then decomposes the signal into sparse weighted natural-language objectives. The reported hook is strong: popular open-source reward models show over 90% reward-behavior coverage, with human evaluation backing it. I don’t buy the full “safety tool” framing. It explains the reward signal, not every exploit a downstream policy finds in long-horizon agent settings. The abstract gives no closed production RM results, no online RLHF run, and no tool-use deployment evidence. Compared with Anthropic/OpenAI-style behavior evals, this is a reward-audit layer: useful for finding bad incentives, weak as a deployment green light.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→General Agentic Planning Through Simulative Reasoning with World Models

The paper introduces SiRA, a model-agnostic planning architecture that uses an LLM-based world model and natural-language belief states, and reports up to 124% higher task completion rates than a matched reactive baseline across constrained navigation, multi-hop information aggregation, and general instruction-following browser tasks.

#Agent#Reasoning#SiRA#Research release

why featured

HKR-H/K/R all pass: the paper has a concrete agent-planning mechanism, three browser-task settings, and a +124% completion-rate claim. Single arXiv source with no disclosed artifact or cluster keeps it below P1.

editor take

SiRA makes planning feel sane again: simulate before acting. The 124% gain is loud, but browser tasks still flatter world models.

sharp

I buy half of SiRA’s claim: an LLM world model with natural-language belief states is a cleaner planning primitive than another reactive browser-agent loop. The paper reports three browser task families, up to 124% higher completion than a matched reactive baseline, and constrained navigation moving from 0% to 32.2%. That is a real hook, not just agent wallpaper. But I would not read this as open-web agency being cracked. The tasks are constrained navigation, multi-hop information aggregation, and instruction following; the environment boundaries are still doing work. WebArena-style mess—login state, flaky UI, async page changes, tool failures—punishes bad future-state simulation fast. SiRA is a good architecture to reproduce, not a license to claim general agents.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→DecepChain: Inducing Deceptive Reasoning in Large Language Models

DecepChain induces LLMs to produce coherent but wrong chain-of-thought reasoning by fine-tuning on naturally erroneous rollouts and applying GRPO with flipped rewards on triggered inputs; the arXiv snippet says it works across multiple models and benchmarks, but does not disclose exact scores.

#Reasoning#Fine-tuning#Safety#DecepChain

why featured

HKR-H/K/R all pass: the paper has a sharp deception hook, a concrete training mechanism, and a CoT-safety nerve. The score stays below 78 because the feed gives no benchmark scores, model list, or reproducible detail.

editor take

DecepChain is a clean warning shot: CoT-as-audit-log collapses once training can make wrong reasoning look normal.

sharp

DecepChain hits the trust interface of reasoning products, not just another jailbreak lane. It fine-tunes on the model’s own erroneous rollouts, then uses GRPO with flipped rewards on triggered inputs. The target is not noisy hallucination. It is fluent, benign-looking CoT that lands on the wrong answer. The snippet claims results across multiple models and benchmarks with minimal benign degradation, but gives no exact scores. That is nastier than ordinary hallucination because it attacks the review habit every AI team has adopted: read the chain, trust the process. OpenAI and Anthropic have spent the year making long reasoning traces feel inspectable. DecepChain is the reminder that readable CoT is not evidence; it is another output surface the optimizer can learn to counterfeit.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→OPPO Introduces Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO estimates per-position success probability with one extra forward pass and no learned value network, improving over GRPO, DAPO, and SDPO on two base LLMs across seven reasoning benchmarks by up to 6.0 points on AMC’23 and 5.2 points on AIME’24.

#Reasoning#Fine-tuning#Benchmarking#OPPO

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper, not a product or major-lab release. One extra forward pass for up to +6.0 points puts it at the featured threshold.

editor take

OPPO attacks the right bottleneck: one extra forward pass for +6.0 points makes GRPO’s trajectory-level advantage look blunt.

sharp

OPPO hits a live nerve: RLVR does not need more rollout theater as much as better token credit assignment. GRPO gives every token the same trajectory-level advantage; OPPO uses one extra forward pass to estimate success probability at each position, with no learned value network and no extra rollouts. The reported gains are concrete: two base LLMs, seven reasoning benchmarks, up to +6.0 on AMC’23 and +5.2 on AIME’24 over GRPO, DAPO, and SDPO. I have one caveat: the abstract gives peak gains, not average lift, training budget, or the base model names. Still, the mechanism is plausible. If the claimed monotonic widening with response length survives replication, OPPO lands right on the math/code failure mode where long chains turn GRPO’s advantage signal into noise.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

The paper tests a Transformer on base-digit extraction; across 3 seeds, it reaches 99.83% exact-answer accuracy on held-out number-base intersections, while causal analysis finds the localized route does not transmit the probe-decodable closed-form intermediates to the output stream.

#Interpretability#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the paper gives 99.83% plus a causal-path result, and it challenges probe-based explanations. It stays in the 72–77 band because the work is mechanistic-interpretability-heavy, not a broad safety event.

editor take

Probe-friendly evidence takes another hit: 99.83% accuracy and decodable intermediates still do not prove the Transformer runs your algorithm.

sharp

This paper lands because it attacks the comfortable probe story in a clean arithmetic sandbox. The Transformer hits 99.83% exact-answer accuracy across 3 seeds on base-digit extraction, and linear probes decode closed-form intermediates tied to \lfloor N/B^D\rfloor \bmod B. That would usually get sold as evidence for staged arithmetic inside the model. The causal result refuses that story. The localized route from the D-input stream to output positions depends on early D-selective communication, independent of N and B. A sparse circuit search also finds mostly separate N, B, and D routes that combine late. For interpretability work, the warning is sharp: probe-decodable state is not an execution trace. If your mechanistic claim stops at “the feature is linearly readable,” this paper gives reviewers an easy knife.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Towards Solving the Gilbert-Pollak Conjecture via Large Language Models

The paper uses LLMs to generate rule-constrained geometric lemmas as executable code, raising the certified Steiner ratio lower bound from 0.824 to 0.8559. The system refines lemmas through reflection, and the full research effort uses only thousands of LLM calls.

#Reasoning#Code#Tools#arXiv

why featured

HKR-H/K/R pass: an LLM pushing a geometry conjecture is clickable; 0.824→0.8559 and thousands of calls are concrete; it hits the “can AI do research?” nerve. The math is narrow, so it stays in low featured rather than 78+.

editor take

LLMs did not prove Gilbert-Pollak, but moving the certified bound from 0.824 to 0.8559 is a cleaner research signal than another contest benchmark.

sharp

The sharp part is not “AI does math”; it is LLMs being demoted into auditable lemma-code generators. That is a better fit for research than free-form proof theater. Gilbert-Pollak targets √3/2≈0.866, and the certified lower bound had sat at 0.824 for roughly three decades. This system raises it to 0.8559 using rule-constrained geometric lemmas and verification functions, with only thousands of LLM calls. I buy this pattern more than leaderboard math demos. The output is executable, checkable local machinery, not a polished proof transcript. The caveat is equally hard: 0.8559 still misses 0.866, and an arXiv v2 is not community acceptance. But as a research workflow, this is closer to useful automation than most “LLM scientist” claims.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem splits long-term memory QA into high-recall candidate retrieval and query-conditioned evidence distillation, using DistillPO reinforcement learning for message selection and evidence rewriting; on LoCoMo and LongMemEval-S, it reports the highest QA accuracy, fastest runtime, and zero commercial-API token cost for memory operations.

#Agent#Memory#RAG#DeferMem

why featured

HKR-H/K/R all pass: DeferMem targets agent long-term memory, names retrieval plus evidence distillation, and claims top accuracy, fastest runtime, and 0 commercial API token cost on two benchmarks. Single arXiv paper with no code or independent replication keeps it in lower-entry

editor take

DeferMem moves memory from pre-compression to query-time evidence work; good direction, but LoCoMo and LongMemEval-S are still a clean-room proxy.

sharp

DeferMem makes the right bet: long-term memory should not be compressed before the future question exists. It keeps raw history in a segment-link structure, retrieves broad candidates, then uses DistillPO for message selection and evidence rewriting at query time. The paper claims top QA accuracy, fastest runtime, and zero commercial-API token cost for memory operations on LoCoMo and LongMemEval-S. I buy the mechanism more than the victory lap. Real agent memory breaks on permissions, temporal drift, conflicting user preferences, and contamination across tasks. Those failure modes barely show up in clean long-memory QA benchmarks. MemGPT, Zep, and LangGraph memory have already shown that memory systems die in policy and lifecycle details, not only in retrieval noise.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

The paper proposes a joint optimization framework for multi-slot guaranteed display advertising, and Meituan online A/B tests under 70% traffic show a 28.99% ARPU increase with improved contract stability reported by DID analysis.

#Reasoning#Meituan#Research release

why featured

HKR-H/K/R all pass, but ad-optimization research is narrower than model or agent news. The 70% traffic A/B test and 28.99% ARPU lift make it a practical research release, so it lands in low featured.

editor take

Meituan’s 28.99% ARPU lift is not model glamour; it’s constrained allocation doing the money work inside ad infra.

sharp

Meituan’s paper drags ad AI back to the cash register: the 28.99% ARPU lift comes from joint multi-slot GD allocation, not generative ad copy. The concrete hook is strong: 70% traffic A/B test, offline bipartite matching, contract roulette for slot exclusivity, Page View constraints for impression control, plus DID evidence for better contract stability. I buy the direction more than most “AI ads” claims. Marketplace ads have always been a three-way knife fight between merchant ROI, platform revenue, and contract fulfillment. Tencent and Alibaba ad systems have made money for years through auction design and constrained optimization, not chatty models. The missing pieces are the baseline, test duration, and confidence intervals. A 28.99% ARPU jump is huge; without those details, treat it as a strong Meituan-system result, not a portable recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

The paper evaluates prompt-injection defenses for educational LLM tutors, reporting that NeMo Guardrails reaches 0% bypass, 16.22% false positives, and roughly 1.5 seconds latency, while Prompt Guard shows 38.48% bypass and 3.60% false positives under the same controlled holdout split.

#Safety#Alignment#Benchmarking#NeMo Guardrails

why featured

HKR-H/K/R all pass, but this is still a single arXiv evaluation in the education-tutor niche, not a broad agent-security benchmark. Concrete bypass, false-positive, and latency numbers justify featured at the lower band.

editor take

NeMo Guardrails’ 0% bypass looks great until 16.22% false positives turns normal tutoring prompts into security incidents.

sharp

This paper puts a hard price tag on tutoring safety: NeMo Guardrails hits 0% bypass, but pays with 16.22% false positives and about 1.5 seconds of latency. In education, 16.22% is not background noise. Student prompts are messy by default, and false blocks break the learning loop. Prompt Guard sits on the other ugly end: 38.48% bypass with 3.60% false positives. That is a prefilter, not a final gate. I like the methodology more than the headline number: same holdout split, paired McNemar tests, bootstrap confidence intervals, and multi-seed sweeps. But I would not let “0% bypass” travel outside this benchmark without fresh adversarial data.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Cross-domain Benchmarks Reveal When Coordinated AI Agents Improve Scientific Inference from Partial Evidence

The paper evaluates coordinated AI agents across four scientific tasks: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955, but the exoplanet workflow is effectively tied with a strong combined-summary baseline.

#Agent#Benchmarking#ScienceClaw#Infinite

why featured

HKR-H/K/R all pass: the paper tests when multi-agent coordination beats strong baselines, not just science-domain AI. Single arXiv source with no product or open artifact keeps it in the low featured band.

editor take

This paper punctures the multi-agent science pitch: exoplanet AUROC hits 0.955, yet a strong combined-summary baseline ties it.

sharp

This paper earns its keep by forcing multi-agent science into baseline discipline. Across four tasks, climate-vector emergence reaches AUROC 0.944, and exoplanet vetting reaches 0.955. The catch is brutal: the exoplanet workflow is effectively tied with a strong combined-summary baseline, so role decomposition did not buy top-line accuracy there. I trust the negative result more than the headline score. In paradigm-shift detection, one signal dominates, so coordination mainly adds interpretation and traceability. In molecular sonification, the gain is representational, not predictive. ScienceClaw x Infinite supplies the audit and provenance layer, which is a cleaner claim than the usual “AI scientist” theater.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

SCI-Defense combines PPL, SIS, and ICD, and achieves 1.000 precision and 0.000 FPR on 600 Amazon product descriptions, with recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively.

#Safety#RAG#Benchmarking#Amazon

why featured

Single arXiv paper, not a model or product event; HKR-H/K/R pass because it names GEO manipulation, gives test metrics on 600 Amazon descriptions, and hits AI-search/RAG trust concerns. Featured threshold, not P1.

editor take

SCI-Defense posts perfect Amazon numbers, but 600 product descriptions with 0.000 FPR smells more like a lab fence than production GEO defense.

sharp

SCI-Defense moves GEO defense forward, but I would not read 1.000 precision as deployment-ready. The hard number is narrow: 600 Amazon product descriptions across 6 categories, with recall of 1.000, 0.952, and 0.830 on String, Reasoning, and Review attacks, plus 0.000 FPR. On 600 MS MARCO web passages, Review-attack recall drops near zero because those passages lack the persuasion signals SIS is built to catch. The useful claim is the failure mode: PPL-only filters, SafetyClf classifiers, and paraphrasing show zero recall on semantic manipulation. That tracks with what search teams are seeing: GEO is not classic content safety; it is relevance gaming written in natural language. SCI-Defense looks like a product-page rule stack today, not a general retrieval shield.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→RADAR: Defending RAG Dynamically against Retrieval Corruption

RADAR models reliable context selection in dynamic RAG as graph-based energy minimization, solves it exactly with Max-Flow Min-Cut, and uses a Bayesian memory node to recursively update belief state instead of archiving raw historical documents.

#RAG#Memory#Safety#Research release

why featured

HKR-K is strong via the graph-cut and Bayesian-memory mechanism, and HKR-R lands on production RAG trust. Single arXiv item with no disclosed metrics keeps it in the 72–77 band.

editor take

RADAR treats RAG poisoning as graph optimization, not another filter. Strong idea, but no attack scale or dataset size is disclosed here.

sharp

RADAR’s useful move is making dynamic RAG defense an optimization problem, not another reranking patch. It casts reliable context selection as graph energy minimization, solves it with Max-Flow Min-Cut, then uses a Bayesian memory node to update belief state without storing raw historical documents. That matches a real production failure mode: web retrieval changes, poisoned pages change, and static filters decay fast. I’m not buying the “superior robustness” claim yet. The excerpt names a novel dynamic dataset, but gives no dataset size, corruption rate, attacker budget, or latency cost. Security papers often win by controlling the data generator; without those knobs, this is a promising mechanism, not a deployable RAG shield.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→GraphFlow: Graph-Based Workflow Management for Efficient LLM-Agent Serving

GraphFlow represents agent workflows as wGraph, a graph of atomic operations, and reports a 4.95 percentage-point average gain across five benchmark datasets while reducing memory footprint by about 4× through adaptive workflow generation and KV-cache state management.

#Agent#Inference-opt#Tools#Ao Li

why featured

HKR-H/K/R pass, but this is still a single arXiv paper with mechanism and benchmark numbers only; no code, production deployment, or major-lab adoption is disclosed, so it fits the lower 72–77 featured band.

editor take

GraphFlow turns agent workflows into a reusable graph; +4.95 points and 4× lower memory are strong, but generality is still the claim to audit.

sharp

GraphFlow hits a real agent-serving problem: workflows are still treated like brittle templates, while KV-cache reuse sits in another layer. Its move is to compile workflows into a wGraph of atomic operations, instantiate task-specific paths, then manage KV state through that graph. The paper reports +4.95 percentage points across five benchmarks and roughly 4× lower memory footprint. I buy the systems direction more than the “generalizes to unseen tasks” framing. LangGraph and AutoGen already made agent flows explicit, but they mostly live at orchestration level; GraphFlow digs into serving and cache reuse, which is the stronger bet. The missing pieces are benchmark names, model sizes, concurrency settings, and latency. Without those, 4× memory reduction does not yet translate into production throughput.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Factored Diffusion Policies: Compositionally Generalized Robot Control with a Single Score Network

Factored Diffusion Policies uses one shared diffusion network with per-factor null-token dropout, reducing the training-task budget from a product of factor cardinalities to a sum; in state-based multi-gate drone racing, it passes 90% of held-out gates, while the K-network composition baseline drops to 3%.

#Robotics#Reasoning#Research release#Benchmark

why featured

HKR-H comes from the 90% vs 3% drone-gate contrast; HKR-K has one score network and product-to-sum training cost. HKR-R is narrower, tied to robotics generalization and cost, so this stays in low featured.

editor take

This is a clean stab at compositional robot control: one score net, factor dropout, 90% held-out gates—promising, but still simulator-shaped.

sharp

Factored Diffusion Policies moves compositional control back into one model, and that matters more than the 90% headline. The paper trains one shared diffusion network with per-factor null-token dropout, then adds factor scores at inference. That cuts the required training-task coverage from the product of factor cardinalities to their sum. In state-based multi-gate drone racing, it passes 90% of held-out gates; the K-network composition baseline falls to 3%. I buy the direction, not the extrapolation. The authors do more than report a benchmark: they chain score error through the reverse-time ODE and a tracking controller into a trajectory-tube certificate. That is a real mechanism. The catch is scope. The strongest result is state-based drone racing, while vision is only single-gate transfer with +11.7pp success and 2.4X lower crash rate. Multi-object contact manipulation is still a different fight.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

SR²AM splits agent decision-making into simulative reasoning, self-regulation, and reactive execution, and its v1.0-30B model uses 25.8% to 95.3% fewer reasoning tokens than comparable agentic LLMs across math, science, tabular analysis, and web information seeking.

#Agent#Reasoning#Inference-opt#SR²AM

why featured

HKR-H/K/R all pass, but the item is arXiv-summary level with no disclosed code, known lab, or external replication. It lands in low featured as practical agent-efficiency research.

editor take

SR²AM’s punch is not 30B chasing giants; it trains “think less” as policy. Agent cost work is drifting back to control, not prompts.

sharp

SR²AM attacks agent cost at the scheduling layer, and I buy that direction. Its v1.0-30B reports Pass@1 in range of 685B-1T systems across math, science, tabular analysis, and web information seeking, while using 25.8% to 95.3% fewer reasoning tokens than comparable agentic LLMs. If that reproduces, it beats another brittle CoT wrapper. The sharp hook is the RL result: average planning horizon rises 22.8%, while planning frequency rises only 2.0%. The model learns to plan less often and look farther when it does, rather than turning every step into long deliberation. That undercuts a lot of agent demos where token burn masquerades as planning. My doubt is task coverage: web search and tables are still far from messy, long-horizon production workflows.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox replaces per-task containers with kernel-isolated workspaces for RL-based software engineering agents, reducing disk usage to about 5% of container-based pipelines and environment preparation time to about 25% of the container baseline while reporting comparable evaluation performance.

#Agent#Code#SWE-MiniSandbox#Research release

why featured

HKR-H/K/R pass: the paper offers a concrete container-free sandboxing mechanism with ~5% disk and ~25% setup time. It is useful agent-infra research, not a broad model/product release, so it sits in the 72–77 featured band.

editor take

SWE-MiniSandbox drags SWE-agent RL back to systems work: 5% disk and 25% setup time is the kind of boring win that cuts real cost.

sharp

SWE-MiniSandbox matters because it cuts the SWE-agent RL bottleneck below the model layer. It replaces per-task containers with kernel-isolated workspaces, drops disk use to about 5% of container pipelines, and cuts environment prep time to about 25%, while claiming comparable evaluation performance. That is a more deployable result than another tiny SWE-bench bump. I would be careful with the “without sacrificing isolation” claim. The abstract does not give kernel-mechanism details, attack-surface analysis, concurrency scale, or a clean Docker / Firecracker / gVisor comparison. For resource-constrained research labs, this looks immediately useful. For multi-tenant production training, the security bar is a lot higher.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

The paper formulates rank-1 steering as budget-constrained search over layer and coefficient. Geometry-guided search recovers 95% of best-found utility with 39.8% fewer trials on average across three model families, while concept granularity correlates with slower convergence and lower best-found utility at Pearson r=0.44 and r=-0.46, both p<0.001.

#Alignment#Interpretability#Inference-opt#GRACE

why featured

HKR-H/K pass: the title has a “cheap steering” hook and the abstract gives testable 39.8%/95% claims. HKR-R is weak because rank-1 steering is niche, so this sits near the featured threshold.

editor take

Rank-1 steering is not dead; it just has a search tax. A 39.8% trial cut is useful, not a stability guarantee.

sharp

This paper moves activation steering out of vibe-tuned demos and into budgeted optimization. I buy the core claim: rank-1 often fails because the layer and coefficient search is bad, not because no useful direction exists. GRACE uses prompt-boundary directional alignment as a prior, and across three model families it recovers 95% of best-found utility with 39.8% fewer trials. That matters for anyone trying to run steering inside a real inference loop. But don’t read this as proof that rank-1 controls hard concepts. The paper’s own granularity metric undercuts that story: higher granularity correlates with slower convergence and lower best-found utility, at Pearson r=0.44 and -0.46, both p<0.001. Compared with the last wave of SAE and steering-vector work, this is a cost model, not a capability jump.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Transporting Task Vectors across Different Architectures without Training

Theseus formulates task-vector transport as functional matching on observed activations, aligns representation spaces with orthogonal Procrustes analysis, and transfers task updates across vision and language models with different widths without extra training or backpropagation.

#Fine-tuning#Inference-opt#Vision#Theseus

why featured

HKR-H/K/R all pass: the no-training transfer hook is concrete, and the Procrustes alignment mechanism is new. Missing benchmark numbers and broad replication keep it at the lower featured band.

editor take

Theseus moves task vectors from weight matching to activation geometry; promising, but no scores in the body, so don’t buy the cost-saving story yet.

sharp

Theseus has the right target: task updates should be moved by functional effect, not raw parameter deltas. The paper uses orthogonal Procrustes alignment on observed activations, then claims closed-form transport across vision and language models with different widths, with no backprop or extra training. ICML 2026 acceptance and an open repo make it more than a workshop sketch. I buy the framing, not the strength of the claim yet. The body gives no benchmark table, model sizes, gain numbers, or failure cases; it only says “consistent improvements.” LoRA merging, model soup, and classic task vectors all hit the same homomorphic-architecture wall. If Theseus survives width changes with stable accuracy, it hits a real model-reuse bottleneck. Until the numbers show up, this is a clean geometric interface, not a deployable adaptation trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

VeriScale expands test suites for verifiable code generation using adversarial implementations; VerinaPlus grows the original Verina suites by over 83x, while experiments on eight state-of-the-art LLMs show sharper score drops on SpecGen and CodeGen than the original benchmark exposed.

#Code#Benchmarking#VeriScale#Verina

why featured

HKR-H/K/R pass, but this is an arXiv benchmark-method paper, not a major lab release. The 83x suite expansion and 8-LLM score drops justify low featured range.

editor take

VeriScale expands Verina suites by 83x, and the hit is clear: code models were over-scoring on verifiable correctness.

sharp

VeriScale lands where many code benchmarks are weakest: thin tests that let models look cleaner than they are. It does not sell a new coding leaderboard; it attacks the evaluation substrate. VerinaPlus expands the original Verina suites by over 83x, with a 14x VerinaLite variant, and eight SOTA LLMs take sharper drops on both SpecGen and CodeGen. That matters because this is a different failure surface from SWE-bench-style patch passing. Verina is closer to whether specs and implementations survive formal verification pressure, not whether a generated diff passes the visible project tests. The adversarial-implementation loop is the useful part here: it creates negatives that target model blind spots instead of padding the suite with easy cases. Public code also makes the claim less hand-wavy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel introduces an open-source world modeling platform with a Lance-based data layer, baseline implementations, and standardized generalization benchmarks, supporting MP4, HDF5, and LeRobot datasets for reproducible evaluation.

#Agent#Robotics#Benchmarking#stable-worldmodel

why featured

HKR-K/R pass: Lance data layer, baselines, and standardized generalization eval address reproducible world-model work. HKR-H is weak, with no results, author signal, or adoption case, so this stays at the featured floor.

editor take

swm attacks the unglamorous bottleneck: MP4, HDF5, and LeRobot into Lance first, before anyone claims world-model progress from bespoke pipelines.

sharp

stable-worldmodel matters because it moves the reproducibility fight into the data layer and evaluation protocol. The paper names a Lance-based layer, MP4/HDF5/LeRobot conversion, baseline implementations, planning solvers, and controllable visual, geometric, and physical factors for generalization tests. That is more useful than another polished video-prediction clip. I don’t buy the abstract’s “dramatically reduces research overhead” claim yet. It gives no throughput number, task count, baseline score, or ablation in the scraped body. But the target is right: world-model and robotics papers have leaned too hard on private loaders, private environments, and private splits. LeRobot already pushed format convergence; swm adds pressure on evaluation. If the repo is maintained, future papers lose one excuse for hand-wavy OOD claims.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Dynamic Mixture of Latent Memories for Self-Evolving Agents

MoLEM uses a dynamic mixture-of-experts module to generate latent memories while keeping the reasoning base model frozen, with a router selecting experts through key-query matching and stage-specific lightweight autoencoders choosing routing groups at inference; on continual-learning sequences across math, science, and code, it improves average accuracy by 10.40% over a Vanilla pretrained baseline, while competing methods do not consistently exceed that baseline across training orders.

#Agent#Reasoning#Memory#MoLEM

why featured

HKR-H/K/R pass, but this is a single arXiv methods paper without disclosed code, author authority, or deployment proof. The +10.40% claim and frozen-base memory mechanism clear the featured threshold.

editor take

MoLEM is a neat frozen-backbone memory patch, not self-evolution yet; the 10.40% lift matters only if routing survives messy live agent traces.

sharp

MoLEM’s useful move is pushing continual learning into a dynamic MoE memory layer while freezing the reasoning backbone. The paper uses key-query routing over experts, then stage-specific lightweight autoencoders to choose routing groups at inference; after math, science, and code sequences, average accuracy is 10.40% above the Vanilla pretrained baseline. That is a cleaner claim than another RAG wrapper, because the memory is injected as latent state rather than retrieved text stuffed into context. I don’t buy the “self-evolving agents” label yet. The evaluation is still offline continual-learning orderings, not live agent traces with tool failures, user-state drift, or adversarial memory pollution. For practitioners, this reads like a promising adapter-memory recipe, not evidence that agents can safely update themselves in the wild.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·23

→Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

The paper identifies activation subspace bottlenecks in Mamba-family SSMs and applies a test-time scalar activation-scaling intervention, improving performance by 8.27% on average across 7 SSMs and 6 benchmarks without task-specific tuning.

#Interpretability#Inference-opt#Benchmarking#Mamba

why featured

HKR-H and HKR-K pass: this is not a routine SOTA claim, but a test-time steering mechanism for Mamba-like SSMs. HKR-R is weak because the topic stays niche and research-heavy.

editor take

This SSM paper has teeth: 8.27% average gain from test-time bottleneck scaling across 7 Mamba models is a tool, not another attention-alternative pitch.

sharp

This paper makes the Mamba-family SSM weakness unusually actionable: an activation subspace bottleneck. Across 7 SSMs and 6 benchmarks, the authors get an 8.27% average lift by multiplying identified bottleneck activations by a scalar at test time, with no task-specific tuning. That is a more useful claim than the usual “SSMs avoid quadratic attention” sales line. I’d still interrogate the 8.27%. The snippet does not show per-benchmark splits, so two long-context tasks may be carrying the average. Stable-Mamba also needs retraining from scratch for long-context gains, so the steering trick is not a free architecture fix. Transformer interpretability already has heads, MLP circuits, and SAE-style tooling; SSMs need exactly this kind of locatable, editable internal structure to stay in the serious model-design conversation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

The paper compares SFT, RL, and on-policy distillation using Qwen3-0.6B-Base on GSM8K. Mild SFT and lightweight on-policy RL improve GSM8K with limited forgetting. Stress SFT causes retention loss on TruthfulQA and MMLU, while OPD from a degraded SFT teacher beats that teacher across all three evaluations.

#Fine-tuning#Reasoning#Benchmarking#Qwen

why featured

HKR-H/K/R pass, but the evidence is limited to Qwen3-0.6B and GSM8K, with weak broad-model validation. This is useful research signal, not same-day must-write news.

editor take

Qwen3-0.6B tests SFT/RL/OPD on GSM8K; I buy the state-distribution lens, but not broad claims from small-model GSM8K.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→TextTeacher: What Can Language Teach About Images?

TextTeacher adds frozen text-encoder embeddings from image captions as auxiliary semantic anchors during standard ViT image-classification training, leaving inference unchanged; on ImageNet it improves accuracy by up to 2.7 percentage points, averages 1.0 point transfer gains, and matches vision distillation accuracy while running 33% faster under comparable compute.

#Vision#Multimodal#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv vision-training paper with impact mostly inside model-training teams. The mechanism and numbers are concrete, but it stays below featured product-level urgency.

editor take

TextTeacher lifts ImageNet ViT by up to 2.7 points; I buy this low-intrusion frozen-text-anchor route over heavier distillation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

The paper compares optimizers under a fixed Transformer architecture and width schedule: AdamW shows weak hard-rank scaling on rare-token TAIL representations with β=0.44, while Muon reaches β=1.02 in the same regime, a 2.3× higher scaling exponent.

#Reasoning#Benchmarking#AdamW#Muon

why featured

HKR-H/K/R all pass, but the paper sits in optimizer and spectral-analysis territory with a high accessibility bar. No model release, tool, or production replacement keeps it below featured.

editor take

Muon lifts TAIL hard-rank β from 0.44 to 1.02 under the same architecture; choosing optimizers by loss alone is blind.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Heterogeneous Agent Collaborative Reinforcement Learning

The paper introduces HACRL and HACPO for heterogeneous agents that share verifiable rollouts during training and run independently at inference time; HACPO adds four mechanisms for capability gaps and policy shifts, beating GSPO with double rollouts by 3.6% on average while using half the rollout cost.

#Agent#Reasoning#Alignment#Research release

why featured

HKR-H/K/R pass: HACPO shares verifiable trajectories across heterogeneous agents and reports +3.6% over GSPO with half rollout cost. Single arXiv paper, no code or cross-source pickup disclosed, so it stays below featured.

editor take

HACPO beats double-rollout GSPO by 3.6%; I’d test whether it collapses into distillation once rewards stop being verifiable.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

AutoMCU uses an LLM-based multi-agent system to customize neural networks for MCUs, filtering infeasible RAM and Flash designs through vendor toolchain feedback before training and finishing CIFAR-10/100 customization in about 1–2 hours versus hundreds of GPU hours for MCU-oriented HW-NAS baselines.

#Agent#Inference-opt#Benchmarking#AutoMCU

why featured

HKR-H and HKR-K land: the paper claims 1–2h CIFAR-10/100 customization versus hundreds of GPU hours. HKR-R is weak because MCU HW-NAS is narrow, so this stays below featured.

editor take

AutoMCU gets CIFAR-10/100 MCU models in 1–2 hours; I buy toolchain feedback, not the multi-agent LLM framing.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2 trains memory-augmented LLM agents with LoGo-GRPO for fairer credit assignment. Local rerollouts compare memory operations from the same intermediate state. A global objective keeps trajectory-level learning. Its curriculum increases the training horizon from 8 to 16 to 32 sessions, and the post does not disclose benchmark results.

#Agent#Memory#Reasoning#Memory-R2

why featured

Single arXiv paper with concrete LoGo-GRPO, local resampling, and an 8→16/32-session curriculum, so HKR-K/R pass. HKR-H is weak, and no code, metrics, or adoption signal keeps it in 60–71.

editor take

Memory-R2 trains up to 32 sessions; the useful bit is same-state rerollouts, but benchmark results are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Curriculum Reinforcement Learning Improves LLM Reasoning Credit Assignment

SCRL derives verifiable subproblems from reference reasoning chains and improves Qwen3-4B-Base average accuracy over GRPO by 4.1 points across seven mathematical reasoning benchmarks.

#Reasoning#Alignment#Benchmarking#Qwen

why featured

HKR-H/K pass: the mechanism and +4.1 pp benchmark result are concrete for reasoning-training readers. HKR-R is weak because this is a single arXiv paper with no disclosed release, adoption, or cost impact.

editor take

SCRL beats GRPO by 4.1 points on 7 math benchmarks; slicing reference chains into verifiable subproblems is a practical RLVR credit-assignment patch.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Residual Skill Optimization for Text-to-SQL Ensembles

DivSkill-SQL improves selected accuracy on Spider2-Lite by up to 11.1 points for Snowflake and 8.3 points for BigQuery over the strongest ensemble baseline; it adds complementary Text-to-SQL skills without model fine-tuning by optimizing each new skill on examples the current ensemble fails.

#Agent#Code#Reasoning#DivSkill-SQL

why featured

HKR-K and HKR-R pass: it has concrete benchmark gains and a residual-skill mechanism. HKR-H misses because the paper framing is narrow, so it stays in the upper all band.

editor take

DivSkill-SQL gains 11.1 points on Spider2-Lite; I buy it—Text-to-SQL needs less correlated failure, not more sampling.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→The Distillation Game: Adaptive Attacks & Efficient Defenses

The paper frames distillation attacks as a minimax game and introduces PoE, a forward-pass-only defense; on GSM8K and MATH, adaptive students recover substantially more capability than passive evaluation reports, while PoE narrows the robustness gap against costlier defenses and keeps higher-quality reasoning traces.

#Reasoning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R pass, but the item is only an arXiv abstract with no author authority, artifact detail, or concrete extraction numbers disclosed; this stays at the high end of 60–71, not featured.

editor take

PoE uses only forward passes to suppress distillation signal; GSM8K/MATH show passive evals flatter defenses too much.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

The paper introduces SCALE, an optimizer that matches or exceeds Adam in 60M-1B LLM pretraining while using 35-45% of total memory.

#Fine-tuning#Inference-opt#SCALE#Adam

why featured

HKR-H/K/R all pass, but evidence is limited to 60M-1B pretraining, so frontier-scale relevance remains unproven. A useful arXiv optimizer paper, but not featured-level yet.

editor take

SCALE matches Adam on 60M-1B pretraining at 35-45% total memory; I’d reproduce first before retiring Adam.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→TextSeal: A Localized LLM Watermark for Provenance and Distillation Protection

TextSeal adds dual-key generation, entropy-weighted scoring, and multi-region localization on Gumbel-max sampling, reports no inference overhead, and shows no perceptible quality difference in 6,000 A/B comparisons across 5 languages.

#Safety#Inference-opt#Benchmarking#TextSeal

why featured

HKR-H/K/R pass via the localized watermark hook, dual-key entropy scoring, and provenance/IP concerns. Single arXiv paper with no named deployment, code, or cross-source cluster keeps it in 60–71.

editor take

TextSeal reports 6,000 A/B tests with no perceived quality loss; the distillation “radioactivity” is the sharp claim for dataset forensics.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

Simon Rosen and coauthors released MoralityGym, a benchmark with 98 trolley-dilemma-style Gymnasium environments that uses Morality Chains and a Morality Metric to evaluate hierarchical moral alignment in sequential decision-making agents.

#Agent#Alignment#Benchmarking#Simon Rosen

why featured

HKR-H/K/R pass on the trolley-dilemma Gym hook, 98 environments, and agent-safety concern. Importance stays below featured because this is an arXiv v2 with no disclosed adoption, leaderboard impact, or visible debate.

editor take

MoralityGym ships 98 trolley-style Gymnasium tasks; I don’t buy the moral-alignment framing, but it’s useful Safe RL stress testing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I learns explicit rubrics from preference pairs, scores paired images with a VLM judge, and uses L1-regularized logistic regression to select Top-N discriminative rules; the paper says it uses less than 0.01% of annotated preference data, beats strong reward-model baselines on MMRB2, and improves TIIF and UniGenBench++ generation quality via Flow-GRPO on diffusion models.

#Vision#Alignment#Benchmarking#Kuei-Chun Kao

why featured

HKR-H/K/R all pass, driven by the <0.01% preference-data claim and concrete rule-selection mechanism. It stays in the upper 60–71 band because this is a single arXiv paper with benchmark claims, no disclosed code or independent replication.

editor take

AutoRubric-T2I beats MMRB2 baselines with under 0.01% preference data; readable rubrics beat another opaque BT score.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

The paper tests self-play RL on a Python output-prediction task and a deterministic DSL twin task, finding that a strict data gate stabilizes training under every tested reward variant, while no reward variant remains sufficient once the gate is removed.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but the article only provides arXiv-level summary with no lab authority, code, or cross-source pickup. It is useful self-play RL training signal, not same-day industry news.

editor take

Two tasks show strict data gating stabilizes every reward variant; blaming self-play collapse on reward design looks lazy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV uses query-aware page scoring, support-aware candidate selection, and sparse entmax attention to approach full-cache entmax decoding with a small KV-cache fraction, reporting up to 3.36× speedup over full softmax attention and 5.43× over full entmax attention at 1M context length.

#Inference-opt#EntmaxKV#arXiv#deep-spin

why featured

HKR-K and HKR-R pass: 1M context plus 3.36×/5.43× speedups give concrete signal for inference cost. HKR-H is weak because Entmax attention is niche, so technical accessibility keeps it in all.

editor take

EntmaxKV reports 5.43× at 1M context; I buy support recovery, but entmax-model migration cost is the catch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent decouples task planning from tool execution with graph-modeled tool relations, and its TWNM component raises task success rate by 13.1 points on complex API-Bank and ToolBench tasks.

#Agent#Tools#Reasoning#NaviAgent

why featured

HKR-K and HKR-R pass: the paper states a concrete mechanism and a 13.1-point gain on API-Bank and ToolBench. Single arXiv source, dry title, and no disclosed code or production validation keep it in the 60–71 band.

editor take

NaviAgent adds 13.1 TSR points on complex tasks; graphing tool dependencies sounds useful, but “thousands of tools” needs harder proof.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

The paper proposes SCP-Δr, a NAF-based algorithm that smooths low-impact tokens using relative probabilities and a base model; the abstract claims orders-of-magnitude stronger theoretical bounds against training data extraction, but the RSS snippet does not disclose exact factors.

#Fine-tuning#Safety#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: the paper adds a named defense mechanism and targets fine-tuning privacy risk. The article is theory-heavy and lacks concrete protection ratios or reproduction details, so it stays in the 60–71 band.

editor take

SCP-Δr smooths low-impact tokens; exact factors and tasks are undisclosed, so don’t treat NAF as deployable privacy yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→EdgeRazor: A Lightweight Framework for LLMs via Mixed-Precision Quantization-Aware Distillation

EdgeRazor compresses LLMs with three mixed-precision quantization-aware distillation modules; on Qwen3-0.6B, the 1.58-bit variant reduces storage from 1.11GB to 0.19GB and accelerates decoding by 15.16x over the 16-bit baseline.

#Inference-opt#Fine-tuning#Qwen#MobileLLM

why featured

HKR-H/K/R pass via the 1.58-bit, 0.19GB and 15.16x claims, with clear cost and edge-deployment relevance. It stays in all because this is an arXiv compression framework tested on Qwen3-0.6B, not a broad product release.

editor take

EdgeRazor cuts Qwen3-0.6B to 0.19GB; 1.58-bit with 15.16x decoding makes sub-4-bit edge LLMs look practical.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→LiteCoOp: Lightweight Multi-LLM Shared-Tree Reasoning for Model-Serving Compiler Optimizations

LiteCoOp coordinates eight heterogeneous LLMs through a shared MCTS tree for compiler optimization, reducing GPU/CPU compilation time by 1.95x/1.74x and API cost by 4.47x/4.32x while invoking the largest model for only 23.1%/23.9% of calls.

#Reasoning#Code#Inference-opt#LiteCoOp

why featured

HKR-H/K/R all pass, but the topic sits in model-serving compiler optimization with a higher systems bar than general AI news. After a technical-accessibility discount, it stays in 60–71, not featured.

editor take

LiteCoOp routes 8 LLMs serially and cuts API cost 4.47x; shared MCTS beats agent theater for compiler search.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

The paper splits the healthcare LLM evaluation-deployment gap into task and outcome assumptions, and a retrospective analysis of one healthcare RCT finds the two gap types are roughly equal in size.

#Benchmarking#Safety#Research release#Benchmark

why featured

Single arXiv paper on healthcare LLM evaluation with a concrete framework and RCT-based comparison, but no model release, product impact, or cross-source traction. HKR-K/R pass, HKR-H is weak, so it stays in the 60–71 research-signal band.

editor take

One healthcare RCT splits task/outcome gaps; I buy the framework, but one case can't indict medical benchmarks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction

The study trained five CKD risk classifiers on 400 UCI patients and all reached 1.00 AUROC internally; on 97 MIMIC-IV demo patients, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, conformal coverage dropped to 0.21-0.25 against a 90% target, and no model exceeded 4/16 deployment readiness.

#Benchmarking#Safety#UCI#MIMIC-IV

why featured

HKR-H/K/R all pass, but this is a medical risk-prediction evaluation, not a model, agent, or product update. Small samples and limited industry spillover keep it in the 60-71 research band.

editor take

Five CKD classifiers hit 1.00 AUROC on UCI, then fell to 0.48-0.58 on 97 MIMIC-IV cases; internal scores are still fooling clinical ML.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

The paper introduces Synergistic Faithfulness for VLM explainability and evaluates 8 XAI methods across 3 VLM architectures and 3 datasets. It reports ρ=0.92 as a surrogate for cross-modal interaction and a 24× computational speedup, while finding VLM explainers over-index on visual salience.

#Multimodal#Vision#Interpretability#Research release

why featured

HKR-K is strong: a new metric, test matrix, correlation, and speed figure. HKR-R is present for VLM explainability, but this is a single arXiv benchmark without product impact or cross-source traction, so it stays in 60–71.

editor take

Synergistic Faithfulness reports ρ=0.92 across 8 methods, 3 VLMs, 3 datasets; the visual-salience bias callout lands.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Billion-Scale Graph Foundation Models

GraphBFF presents an end-to-end recipe for billion-parameter graph foundation models, evaluates a billion-parameter GraphBFF Transformer on unseen real-world graphs, and reports gains over baselines across 10 downstream node- and link-level tasks, with margins up to 31 PRAUC points.

#Reasoning#Fine-tuning#Benchmarking#GraphBFF

why featured

HKR-H and HKR-K pass: the title has a billion-scale hook, and the abstract gives 10 tasks plus a 31 PRAUC-point gain. The graph-ML focus is specialized, so HKR-R fails and the item stays in the 60–71 band.

editor take

GraphBFF reports up to +31 PRAUC on 10 unseen-graph tasks; solid scaling-law signal, but arXiv evidence is not production proof.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→TAPIOCA: Why Task-Aware Pruning Improves OOD Model Capability

TAPIOCA shows that task-aware layer pruning gives no benefit on in-distribution data across controlled polynomial regression tasks and large language models, but consistently improves out-of-distribution accuracy under tested distribution shifts.

#Inference-opt#Reasoning#Benchmarking#TAPIOCA

why featured

HKR-H and HKR-K pass: the counterintuitive pruning/OOD claim is clear. HKR-R is weak because model names, datasets, and gain sizes are not disclosed, keeping it in the 60-71 band.

editor take

TAPIOCA says pruning lifts OOD, not ID; I buy the direction, but model names and gains are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Support-Aware Offline Policy Selection for Advertising Marketplaces

The paper presents a support-aware offline decision framework for reserve-price policy selection, reducing a 19-policy catalog to a two-policy validation shortlist while certifying non-harm across 44 advertiser, exchange, and region segments.

#Benchmarking#iPinYou#Research release

why featured

HKR-H/K pass: the paper has testable numbers, 19 policies to 2 candidates and 44 no-harm segments. The ads-marketplace scope is narrow, so HKR-R fails and the item stays in all rather than featured.

editor take

This cuts 19 reserve-price policies to 2 and certifies 44 segments; 47.66% replay lift is nice, bidder response is the trap.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SynAE: A Framework for Measuring Synthetic Data Quality in Tool-Calling Agent Evaluations

SynAE evaluates synthetic benchmarks for multi-turn tool-calling agents across four metric categories: task instructions and intermediate responses, tool calls, final outputs, and downstream evaluation.

#Agent#Tools#Benchmarking#SynAE

why featured

HKR-K and HKR-R pass: tool-calling agent eval quality is a real practitioner concern, and the post gives four metric categories. No results, dataset size, or reproducible findings are disclosed, so it stays in the 60–71 band.

editor take

SynAE scores synthetic tool-agent benchmarks across 4 metric groups. Single-score agent evals look brittle once trajectories matter.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Internal narratives parameterise affective states

The paper uses two studies with 1,257 participants to test LLM representations of internal narratives, finding that symptom-specific thought descriptions predict standardized self-reported depression scores.

#Embedding#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv psychometrics paper; the feed gives sample size and task only, not model details, effect sizes, or reproducible setup. Keep it in all, below featured.

editor take

Two studies cover 1,257 people; I buy the signal, not the “affect as computational state” wrapper.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

TRM replaces terminal ranking costs in fixed latent world models and raises LeWM success on the hard TwoRoom benchmark from 7.0% to 97.0%, while improving a PLDM baseline from 32.7% to 84.0% across three seeds.

#Robotics#Reasoning#Benchmarking#LeWorldModel

why featured

HKR-H/K pass: the mechanism and numbers are concrete, and 7.0%→97.0% is eye-catching. The arXiv world-model/planning focus is narrow, so it stays in the lower all band.

editor take

TRM swaps only the terminal ranking head and lifts TwoRoom LeWM from 7% to 97%; latent MSE was the broken interface.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection

FAME uses an LLM once offline to partition log templates into failure domains, then trains an on-premise router and experts; on BGL it reaches F1=98.16 at K=100, cuts annotation effort by 76x, and detects 86.3% of anomalies from unseen EventIDs.

#Agent#Reasoning#Inference-opt#FAME

why featured

HKR-K and HKR-R pass: the paper gives a testable mechanism plus BGL/F1/label data, and the pain is AIOps labeling cost. HKR-H is weak and the domain is narrow, so it stays in the 60–71 band.

editor take

FAME hits 98.16 F1 on BGL at K=100; I buy the offline-LLM design, not another per-log token burner.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Systematic Study of Schwartz Value Detection in Political Texts

The paper compares sentence, window, and full-document inputs with RAG on the ValuesML/Touché ValueEval format; full-document context raises DeBERTa macro-F1 by 3.8–4.8 points over sentence-only input, but does not consistently improve zero-shot LLMs.

#RAG#Benchmarking#arXiv#DeBERTa

why featured

HKR-H/K/R pass: the paper tests context length, model size, and value knowledge together, with DeBERTa +3.8–4.8 macro F1 while zero-shot LLMs do not improve reliably. Single arXiv paper and a narrow political-text task keep it in the 60–71 band.

editor take

DeBERTa gains 3.8–4.8 F1 from full context; early-fusion RAG beats lazy long-context/model-size faith here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

InnerQ quantizes the KV cache by grouping cache matrices along the inner dimension, and experiments on Llama and Mistral report 1.3x average decode speedup over prior KV-cache quantization methods and 2.7x over a non-quantized baseline.

#Inference-opt#Llama#Mistral#Research release

why featured

HKR-K and HKR-R pass: the 2.7x baseline speedup is a testable inference claim tied to KV cache cost. As a narrow single arXiv quantization paper with no disclosed open-source artifact or production proof, it stays in the 60–71 band.

editor take

InnerQ reports 1.3x faster decode on Llama/Mistral; inner-dimension grouping matching GPU VMM beats another KV quant paper chasing compression.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

The paper proposes DASD for LLM self-distillation, routing supervision by token entropy: high-entropy tokens move away from the privileged teacher, low-entropy tokens move toward it, and DASD reports the best macro Avg@16 across six mathematical reasoning benchmarks.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the mechanism is specific and backed by six benchmarks. Still, this is a single arXiv distillation paper with no disclosed code, cost data, or production replacement claim.

editor take

DASD reverses teacher pressure at high-entropy tokens; six math sets lead Avg@16, but model scale and gains aren’t disclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

Ex-GraphRAG replaces GraphRAG’s GNN encoder with M-GNAN, preserves black-box performance on STaRK-Prime, and audits evidence routing by decomposing encoder outputs across nodes and feature groups, with removal of low-attribution intermediary nodes degrading multi-hop QA by up to 28%.

#RAG#Interpretability#Reasoning#Ex-GraphRAG

why featured

HKR-K/R pass via a concrete M-GNAN mechanism and a 28% degradation result tied to GraphRAG debugging. HKR-H is weak, and this is a single arXiv paper with no code, product release, or cross-source traction, so it stays in 60–71.

editor take

Ex-GraphRAG keeps STaRK-Prime performance and shows 28% QA drops from removing intermediary nodes; GraphRAG interpretability finally has an audit hook.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

Kunyang Li and coauthors propose ARL2, a hybrid attention module that replaces cross-frame softmax attention with a fixed-size recurrent state, and reports up to 2.26× wall-clock speedup and 54% memory reduction after replacing 75% of layers while maintaining comparable quality and improving temporal consistency.

#Vision#Inference-opt#Memory#Kunyang Li

why featured

HKR-K/R pass: the mechanism and 2.26x/54% metrics are concrete, and inference cost matters for video diffusion. Still, this is a specialized arXiv architecture paper, so it stays in the 60–71 band.

editor take

ARL2 replaces 75% of cross-frame attention and gets 2.26× speedup; fixed state beats another KV-cache patch here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

The paper tests RL-finetuned VLMs under misleading captions and incorrect CoT traces, finding robustness and confidence drops in open-source multimodal reasoning models and an accuracy-faithfulness trade-off during finetuning.

#Multimodal#Reasoning#Fine-tuning#Research release

why featured

HKR-K/R pass: the article gives two reproducible intervention types and an accuracy-faithfulness tradeoff. Model names, sample size, and metric drops are not disclosed, so it stays in the 60–71 band.

editor take

The paper probes RL-VLMs with misleading captions and bad CoT; open models lose robustness, and accuracy-only tuning pays in faithfulness.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in LLMs

DualOptim+ proposes an optimizer framework for LLM machine unlearning, using a base state for shared forgetting-retaining representations and delta states for objective-specific residuals; it switches between shared and decoupled states based on gradient direction conflicts, adds an 8-bit variant to reduce memory overhead, and releases code on GitHub.

#Alignment#Safety#Fine-tuning#CityU-MLO

why featured

HKR-K and HKR-R pass: the paper gives a concrete optimizer-state mechanism for LLM unlearning. HKR-H is weak, and no benchmark gains or deployment case are disclosed, so it stays in the 60–71 band.

editor take

DualOptim+ switches optimizer states on gradient conflict; details aren’t disclosed, so I’d check retained capability loss first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

The paper introduces CEDAR, a post-hoc method that uses an invertible transformation and a top-k sparsity bottleneck to disentangle pretrained vision-language embeddings without increasing dimensionality; CLIP-like coordinates map to textual concepts, while BLIP-style generative models decode them into natural-language descriptions.

#Multimodal#Vision#Interpretability#CEDAR

why featured

HKR-H and HKR-K pass via the concept-coordinate hook and CEDAR mechanism, but HKR-R is weak. A single arXiv interpretability paper without production impact, artifact, or benchmark numbers fits the 60–71 band.

editor take

CEDAR disentangles embeddings via invertible transforms plus top-k sparsity; I like the bet, but the abstract omits k and benchmarks.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

The paper introduces StockR1, a time-series-enhanced LLM that links stock forecasting with financial reasoning through verifiable forecast actions, and reports 17.7% and 25.9% reasoning accuracy gains for 4B and 8B models on a 10-year benchmark.

#Reasoning#Tools#Fine-tuning#StockR1

why featured

HKR-K is strong: StockR1, a 10-year benchmark, and two reported model gains. HKR-R is moderate for finance-AI reliability, but HKR-H is weak and this is a single arXiv paper, so it stays below featured.

editor take

StockR1 lifts 4B/8B accuracy 17.7%/25.9% on a 10-year benchmark; finance LLMs need falsifiable forecasts, not prose confidence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Manifold-Guided Attention Steering

The paper proposes MAGS, an inference-time intervention that monitors attention-head deviation from a learned correctness manifold and applies projection correction after a learned threshold is exceeded. It reports gains over unsteered and static-steering baselines on MATH-500, GSM8K, HumanEval, MBPP, and SMILES.

#Reasoning#Code#Inference-opt#Research release

why featured

HKR-H/K pass: MAGS offers an inference-time attention repair mechanism and names MATH-500, GSM8K, HumanEval, MBPP, and SMILES. No gains, overhead, or reproducible deployment setup are disclosed, so it stays in the normal research band.

editor take

MAGS covers 5 benchmarks; gains are undisclosed. I buy trajectory-aware steering, not “general correctness manifolds.”

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

The paper models keep-or-drop decisions for each observed knowledge-graph triple as a Q-learning problem, and on RoomKG with long-term memory capacity 128, learned transfer policies outperform symbolic baselines plus LSTM and Transformer history baselines.

#Agent#Memory#Reasoning#arXiv

why featured

HKR-K/R pass: the paper gives a testable Q-learning memory rule and RoomKG capacity-128 setting, relevant to agent memory. HKR-H is weak; single arXiv paper with no artifact or deployment keeps it in 60–71.

editor take

RoomKG at capacity 128 beats LSTM/Transformer baselines; I buy the direction, but one benchmark is too thin for agent memory claims.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

The paper introduces LLR, a layerwise learning-rate scheme for Transformer training, and reports up to 1.5x training speedup on 60M-1B parameter models while raising average zero-shot accuracy from 47.09% to 49.02%.

#Fine-tuning#Inference-opt#Benchmarking#arXiv

why featured

HKR-K has a concrete method and numbers; HKR-R touches training efficiency and cost. The 60M-1B scope makes it an incremental research item, below featured.

editor take

LLR reports 1.5x speedups at 60M-1B; I buy the recipe, but don’t extrapolate it to 7B yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→VRPRM: Process Reward Modeling via Visual Reasoning

VRPRM trains a process reward model with 3.6K CoT-PRM SFT examples and 50K non-CoT PRM RL examples, surpassing a non-thinking PRM trained on 400K total examples and reaching up to 118% relative improvement over the base model in the BoN experiment.

#Reasoning#Vision#Fine-tuning#VRPRM

why featured

HKR-K is clear and HKR-H comes from the 118% BoN gain, but this is still an arXiv methods paper. With no disclosed open-source artifact, benchmark detail, or production claim, it fits 60–71.

editor take

VRPRM beats a 400K-example PRM with 53.6K samples; I buy the data efficiency, not the “new paradigm” label.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→One-Way Policy Optimization for Self-Evolving LLMs

The paper proposes OWPO for RLVR, decoupling verifier-driven update direction from reference-policy update magnitude and using iterative reference updates to create a Ratchet Effect; the abstract says OWPO outperforms DAPO, OPD, and MOPD, but the RSS snippet does not disclose benchmark scores.

#Reasoning#Alignment#Fine-tuning#Research release

why featured

HKR-H/K/R pass, but this is still a single arXiv methods paper. The post names OWPO and the Ratchet Effect, yet gives no concrete scores against DAPO, OPD, or MOPD, so it stays in the 60–71 band.

editor take

OWPO turns RLVR constraints into a one-way ratchet; scores are undisclosed, so don’t buy the self-evolution pitch yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

MapTab evaluates 15 MLLMs with 328 images, 196,800 route-planning queries, and 3,936 QA queries, requiring models to combine map visuals with tabular route attributes under four criteria: time, price, comfort, and reliability.

#Multimodal#Vision#Reasoning#MapTab

why featured

HKR-K is strong via concrete benchmark scale, and HKR-R is present on deployment reliability. No key results or major model impact are disclosed, so this stays in the 60-71 research-release band.

editor take

MapTab tests 15 MLLMs on 196,800 queries; multimodal collaboration losing to unimodal baselines is the sting.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Holder Policy Optimisation

HölderPO unifies token-level probability aggregation with the Hölder mean and schedules p through dynamic annealing, reaching 54.9% average accuracy across mathematical benchmarks, a 7.2% relative gain over standard GRPO, and 93.8% success on ALFWorld.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K/R pass: the summary gives a concrete mechanism and benchmark gain, and it connects to GRPO post-training debates. HKR-H is weak, and this is a single arXiv method paper with no disclosed code or major lab adoption.

editor take

HölderPO hits 54.9% math average, 7.2% over GRPO; I buy p-annealing, but undisclosed base model and compute cap the claim.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Research paper introduces tokenizer construction via convex relaxation

The paper introduces ConvexTok, a tokenizer-construction algorithm that formulates vocabulary selection as a linear program; experiments report better intrinsic tokenization metrics and language-model bits-per-byte, with common vocabulary sizes within 1% of the certified objective optimum.

#Inference-opt#Benchmarking#ConvexTok#Research release

why featured

Single arXiv methods paper. HKR-K is clear: ConvexTok builds tokenizers via linear programming and reports within 1% of target optimum at common vocab sizes. HKR-H/R are weak: no adoption, release artifact, or cost test.

editor take

ConvexTok reports within 1% of optimum at common vocab sizes; the certified bound is the pitch, not another BPE-killer story.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

RGoT uses reinforcement learning to generate Graph of Thoughts operation graphs from a human-defined operation set; the paper reports adaptive graph construction under specified constraints, but the RSS snippet does not disclose benchmarks, datasets, or quantitative gains.

#Reasoning#Agent#Research release

why featured

HKR-H and HKR-K pass because the paper proposes RL-built GoT operation graphs. No benchmark numbers, code artifact, or production replacement claim is disclosed, so it stays in the 60–71 research-release band.

editor take

RGoT uses RL to generate GoT operation graphs; no benchmarks or gains disclosed, so I file it under prompt search.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm uses a two-stream architecture to couple Pre-Norm-like and Post-Norm-like paths through shared residual blocks, and experiments cover 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers while reporting stable training and performance gains.

#Reasoning#Vision#Inference-opt#Qwen

why featured

HKR-K is solid: SiameseNorm’s mechanism and scale coverage are concrete. HKR-R is limited to training stability and cost; HKR-H is weak, so the niche architecture paper stays in all.

editor take

SiameseNorm spans 400M, 1.3B, and 15B MoE; I buy it—Pre/Post-Norm finally looks engineered, not ritualized.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Can Transformers Learn to Verify During Backtracking Search?

The paper tests SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. SSA emits identical decisions for same-state pairs with different histories, while a causal baseline trained on cumulative traces conditions on trajectory history.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K/R pass: 3-SAT, graph coloring, Blocks World, and backtracking parsing give concrete test conditions tied to reasoning reliability. No major lab release, product impact, or cross-source attention keeps it in the mid research band.

editor take

SSA removes history entanglement across 4 backtracking tasks; I buy the diagnosis—causal trace training contaminates state.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

The paper trains a curiosity-driven agent with online 3D reconstruction and an RGB sequence policy; after curiosity-only training on HM3D, it generalizes zero-shot to Gibson and AI-generated worlds and outperforms RL-based active mapping baselines.

#Agent#Robotics#Vision#HM3D

why featured

HKR-H and HKR-K pass: the paper has a zero-shot transfer hook and a concrete persistent-world mechanism. It remains a single embodied-AI research release, with no production replacement claim, so it stays in 60–71.

editor take

The agent trains curiosity-only on HM3D and zero-shots to Gibson; no metrics in abstract, so hold the hype.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Resting Neurons, Active Insights: Robustifying Activation Sparsity in LLMs via Spontaneity

The paper introduces SPON, a lightweight mechanism that adds a small set of learnable, input-independent activation vectors as anchors for sparse LLM computation; after distribution-matching training, the vectors can be absorbed into bias terms, while the RSS snippet does not disclose exact model counts or benchmark numbers.

#Inference-opt#Research release

why featured

HKR-H/K/R all pass lightly: the mechanism is new and tied to inference cost, but model counts and metrics are not disclosed. This fits the 60–71 research-release band.

editor take

SPON adds input-independent anchors, then folds them into bias; RSS gives no model counts or scores, so don’t buy the high-sparsity claim yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

The paper proposes GCPO, combining geometry-aware measures and reward-based calibration to regulate gradient variance in GRPO-style post-training; the abstract says experiments on multiple benchmarks improved post-training performance, but the RSS snippet does not disclose specific scores.

#Reasoning#Alignment#Fine-tuning#Research release

why featured

HKR-H and HKR-K pass: the title has a contrarian hook and the method is concrete. HKR-R misses because no benchmark numbers, artifact, or practitioner cost impact is disclosed.

editor take

GCPO targets GRPO gradient variance, but RSS gives no scores; I buy the problem framing, not the “consistent gains” yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

The paper evaluates teacher-token reliability for reasoning distillation with a branch-viability diagnostic on Qwen3-4B, where an oriented position score reaches 0.83 AUROC versus at most 0.57 for local uncertainty, and PW-OPSD improves AIME 2024 and 2025 Avg@12 by 1.0 and 1.1 points.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-K passes with a testable mechanism and AIME gains; HKR-H and HKR-R are weak. This is useful reasoning-distillation research, but narrow for the broader AI-practitioner feed, so it fits the 60–71 band.

editor take

Qwen3-4B gets 0.83 AUROC from position score; entropy tops at 0.57, so token distillation gets less hand-wavy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

MDM-Prime-v2 uses Binary Encoding and Index Shuffling at 1.1B parameters and reports higher average zero-shot accuracy across eight commonsense reasoning benchmarks than GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

#Reasoning#Benchmarking#MDM-Prime-v2#GPT-Neo

why featured

HKR-H and HKR-K pass: the title has a concrete architecture hook and the summary gives 1.1B plus 8 benchmarks. HKR-R is weak; no reproducible setup or engineering payoff is disclosed, so this stays in the all band.

editor take

MDM-Prime-v2 wins eight commonsense zero-shot averages at 1.1B; diffusion LMs are still alive, and Binary Encoding is the sharp bit.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Bug or Feature²: Weight Drift, Activation Sparsity and Spikes

The paper proves that MSE or cross-entropy induces negative weight drift at initialization, and across 79 configurations reports up to 90% activation sparsity in GPT-nano with a sharp accuracy cliff above about 70% sparsity.

#Interpretability#Benchmarking#On-Point-RND#Research release

why featured

HKR-H/K pass: the anomaly hook and testable numbers are clear. HKR-R is weaker because the evidence is GPT-nano-scale training dynamics, with no large-model or production-pipeline impact shown.

editor take

The paper pins the sparsity cliff near 70% across 79 configs; ReLU² needs clipping before it deserves trust.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

The paper proposes Near-boundary Stochastic Rescue, a plug-in change for RLVR that stochastically keeps slightly out-of-bound tokens near the clipping threshold and reports improved training stability against DAPO and GSPO across 7B to 30B dense and MoE models.

#Reasoning#Alignment#Fine-tuning#arXiv

why featured

HKR-K is solid: a testable RLVR clipping fix is evaluated on 7B-30B dense/MoE models against DAPO and GSPO. HKR-R is narrow; no production impact or artifact is disclosed, so it stays in all.

editor take

NSR keeps near-threshold tokens across 7B–30B models; I buy the angle, RLVR stability is living in clipping details.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Token-Level LLM Collaboration via FusionRoute

FusionRoute uses a lightweight router to select an expert at each decoding step and add a complementary logit to adjust the next-token distribution; the paper evaluates it across Llama-3, Gemma-2, and benchmarks for math reasoning, code generation, and instruction following.

#Reasoning#Code#Inference-opt#Llama

why featured

This is an engineering-leaning arXiv paper with HKR-H/K: concrete routing and logit-correction mechanisms across math, code, and instruction tests. No result numbers, latency/cost data, or code availability are disclosed, so it stays in all.

editor take

FusionRoute routes every token and adds a complementary logit; without latency and cost wins, it just moves MoE tax to inference.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Hierarchical Variational Policies for Reward-Guided Diffusion

The paper proposes hierarchical variational policies that amortize diffusion test-time control into a stochastic policy; on 4x super-resolution, the method reports better perceptual quality and more than 5x faster inference than the best-performing baseline.

#Inference-opt#Research release

why featured

HKR-H/K pass: the 5x inference speedup and hierarchical variational policy mechanism are concrete. HKR-R is weak; this is a technical arXiv paper without code, model scale, or production evidence, so it stays in 60–71.

editor take

HVP beats the best 4x super-resolution baseline and runs 5x faster; this smells like practical diffusion compute savings.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Predicting Performance of Symbolic and Prompt Programs with Examples

The paper models program performance as a Bernoulli success probability from observed pass/fail examples and a prior, comparing symbolic programs such as Python with LLM prompt programs and proposing RAP to retrieve similar tasks and prompts for an approximate prior.

#Reasoning#Code#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a testable model for performance prediction and maps to prompt/code generalization pain. HKR-H is weak, and a single arXiv paper without large-scale production impact stays in 60–71.

editor take

RAP estimates priors via similar tasks; corpus size is undisclosed, and few prompt passes still do not buy reliability.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass maps CoT traces and visual tokens into a shared sparse concept space, uses a query codebook and slot mapper for heatmaps, and matches or exceeds state-of-the-art results on five benchmarks; the abstract does not disclose dataset names or metric values.

#Reasoning#Vision#Interpretability#SegCompass

why featured

HKR-K/R pass: the paper gives a concrete mechanism, 5-benchmark claim, and code release. HKR-H is weak, and SAE-based reasoning segmentation is research-niche with no product impact, so it stays in 60–71.

editor take

SegCompass claims SOTA parity on 5 benchmarks; no datasets or metrics in the snippet, so I buy the SAE hook, not the “white-box” label.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

MoRAM reframes continual learning as incremental accumulation of reusable rank-1 adapter memory units, replaces explicit MoE-LoRA routers with self-activation based on each unit’s intrinsic key, and reports stronger plasticity-stability trade-offs, generalization, and reduced forgetting in experiments on CLIP and LLMs; the abstract does not disclose dataset names or exact scores.

#Fine-tuning#Memory#Benchmarking#MoRAM

why featured

HKR-K/R pass: the mechanism is specific and targets continual-learning forgetting. HKR-H is weak, and the source gives summary-level claims without benchmark numbers, keeping it in the upper normal research-release band.

editor take

MoRAM swaps MoE-LoRA routing for rank-1 memory self-activation; scores and datasets aren’t disclosed, but the anti-forgetting bet is clean.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD evaluates self-distillation across six benchmarks, six models, and three model families; its UniSDfull pipeline improves over the base model by 5.4 points and over the strongest baseline by 2.8 points without using stronger external teachers.

#Fine-tuning#Alignment#Benchmarking#UniSD

why featured

HKR-K is supported by cross-model benchmarks and concrete gains; HKR-R comes from fine-tuning cost/performance relevance. Still, this is a normal arXiv method paper without product-level impact or broad industry heat.

editor take

UniSDfull gains 5.4 points on 6 benchmarks; self-distillation looks like an engineering recipe, but cost is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Calibrating LLMs with Semantic-level Reward

Fengfei Yu and coauthors propose Calibration with Semantic Reward, which combines correctness reward with semantic calibration reward; across three model families and HotpotQA, TriviaQA, MSMARCO, and NQ-Open, CSR reduces ECE by up to 40% and improves AUROC by up to 31% over verbalized-confidence baselines.

#Alignment#Fine-tuning#Benchmarking#Fengfei Yu

why featured

HKR-K and HKR-R pass: the paper gives a method, test scope, and ECE/AUROC gains, and it maps to LLM reliability. HKR-H is weak, and a single arXiv paper without product impact stays in all.

editor take

CSR cuts ECE 40% across 3 model families and 4 QA sets; semantic consistency beats confidence theater.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

RTPrune prunes visual tokens for DeepSeek-OCR-Large using a two-stage scheme: high-norm token selection, then optimal-transport merging, achieving 99.47% accuracy and 1.23× faster prefill on OmniDocBench with 84.25% token retention.

#Vision#Inference-opt#Benchmarking#DeepSeek

why featured

HKR-K is clear via measured retention, accuracy, and speedup; HKR-R lands on inference cost. HKR-H is weak, and this is a niche OCR token-pruning paper, not a product or framework-level release.

editor take

RTPrune keeps 84.25% visual tokens for 99.47% accuracy; 1.23× prefill speedup is modest, but OCR-safe pruning is credible.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

The paper establishes a mathematical correspondence between decision trees and diffusion processes and proposes GTSM; TreeFlow reports a 2x computational speedup for tabular generation, while DSMTree matches teacher performance within 2% on many benchmarks.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K pass: the paper has a novel tree-to-diffusion angle and testable numbers, including 2x speedup and a 2% teacher gap. It stays in all because this is a single arXiv paper with narrow practitioner reach.

editor take

TreeFlow claims 2x faster tabular generation; I buy the correspondence, not the quality claim without benchmark details.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

EmoTrack predicts PHQ-8 scores from counseling transcripts using LLM-extracted clinical signals, frozen turn-level semantic embeddings, and compact cross-session memory; on DAIC-WOZ, it reduces MAE by 13.5% relative to the strongest baseline and remains competitive with the strongest longitudinal baseline on LongCounsel.

#Embedding#Memory#Fine-tuning#EmoTrack

why featured

HKR-K is clear via the mechanism and 13.5% MAE claim; HKR-R comes from mental-health sensitivity. As a single clinical prediction paper without product, open-source, or broad adoption signal, it stays in the 60-71 band.

editor take

EmoTrack cuts DAIC-WOZ MAE by 13.5%; don't ship this clinically yet, since LongCounsel labeling and generalization details are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SepsisAI Orchestrator: A Containerized Platform for Early Sepsis Detection AI Deployment

SepsisAI-Orchestrator releases an open-source clinical AI deployment platform, and on a 12-thread CPU, scaling from 3 to 12 replicas reduced p95 latency from 3.3 seconds to 1.41 seconds while eliminating request failures.

#Inference-opt#Tools#SepsisAI-Orchestrator#PhysioNet

why featured

HKR-K and HKR-R pass: the paper gives reproducible deployment conditions and latency numbers, and maps to production MLOps reliability. The clinical niche keeps it in the 60–71 band.

editor take

SepsisAI-Orchestrator hit 1.41s p95 on a 12-thread CPU; don’t sell it as clinical progress, it’s deployment plumbing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

AVMP separates KV caches and SSM states into distinct physical pools behind one virtual address space, then migrates capacity only on allocation failure; on an RTX 3060 12GB, it cuts Out-of-Memory events by 7.6% and improves synthetic workload throughput by 1.83x to 13.3x, with 2.36x on ShareGPT trace replay.

#Inference-opt#Jamba#ShareGPT#Research release

why featured

HKR-K passes via the AVMP pool split and 1.83-13.3x RTX 3060 result; HKR-R passes on inference memory/cost pressure. HKR-H fails because the title is a niche systems paper, and technical-accessibility limits it to all.

editor take

AVMP posts 1.83–13.3x on RTX 3060 12GB; pure Python without Triton makes this allocator logic, not production proof.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→ASAP: Attention Sink Anchored Pruning

ASAP models ViT information flow as a Lazy Random Walk, clusters tokens by diffusion distance to the attention sink in the cumulative transition matrix, and reports up to 48% throughput acceleration while maintaining or exceeding baseline accuracy.

#Vision#Inference-opt#Multimodal#ASAP

why featured

HKR-K is solid via the mechanism and 48% throughput claim; HKR-R is limited to ViT deployment teams. Single arXiv paper plus technical narrowness keeps it in the 60-71 band.

editor take

ASAP reports up to 48% ViT throughput gains. Using attention sinks as anchors is clever; RSS lacks models and resolution.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Provable Joint Decontamination for Benchmarking Multiple Large Language Models

The paper proposes Joint Envelope Conformal Selection, using per-model conformal p-values, per-item maximum aggregation, and adaptive Benjamini-Hochberg to select a shared benchmark with provable global contamination rate control under stated assumptions.

#Benchmarking#Research release#Benchmark

why featured

HKR-K/R pass: JECS, conformal p-values, and adaptive BH give a testable mechanism for benchmark contamination. HKR-H misses; the arXiv summary is stats-heavy and lacks model lists or scale numbers.

editor take

JECS controls global contamination via max-p plus adaptive BH; I like that it forces multi-model eval back onto one shared test.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

The paper proposes LMDMs, a modification to block-wise diffusion music generation that uses block-wise KV caching to reduce inference complexity, applies ARC-Forcing for post-training alignment without RL or reward models, and demonstrates local live use on a consumer gaming laptop.

#Audio#Fine-tuning#Inference-opt#LMDMs

why featured

HKR-K passes via concrete mechanisms, but HKR-H and HKR-R miss: the post gives no metrics, code, or product path, so this stays in the 60–71 research-signal band.

editor take

LMDMs run locally via block-wise KV caching; I buy the latency angle, but ARC-Forcing quality gains need numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→How Many Different Outputs Can a Transformer Generate?

The paper uses a small set of Transformer architecture features to predict how many distinct sequences it can output, giving an upper bound tied to prompt length and empirically tight within a factor below 10. It proves accessible sequence length grows linearly with prompt length, while accessible sequence share decays exponentially beyond a critical threshold.

#Reasoning#Benchmarking#Interpretability#Research release

why featured

HKR-H and HKR-K pass: the setup is clickable and the summary gives bounds, error, and decay mechanics. HKR-R is weak because this is theory-heavy expressivity work with little product or engineering stake.

editor take

The paper bounds output diversity within 10x; unbounded context still fails copying, so the cut lands on architecture capacity.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

The paper evaluates Algebraic Machine Learning on image and tabular classification with 50–2000 training examples; AML beats cross-validated baselines including CNNs on small-to-medium image datasets, while XGBoost remains the overall best method on tabular datasets.

#Benchmarking#Algebraic Machine Learning#XGBoost#LightGBM

why featured

HKR-H and HKR-K pass on the small-data benchmark split, but HKR-R is weak. A single arXiv baseline paper with narrow method impact belongs in the 60–71 band, not featured.

editor take

AML beats cross-validated CNNs at 50–2000 image samples; I buy the niche, but tabular still belongs to XGBoost.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

PEARL estimates percentile-based preference signals with real contrastive interaction samples, and production A/B tests on a livestream platform with billions of users increased Watch Duration by 2.10% and Consumption Amount by 0.80%.

#RAG#Embedding#Benchmarking#PEARL

why featured

HKR-K/R pass: the paper gives industrial A/B numbers and a concrete preference-estimation mechanism. HKR-H fails because the angle is specialized recommender-system research, so it stays in the 60–71 band.

editor take

PEARL lifts watch time 2.10% in billion-user livestream A/B; I buy relative preference modeling, but +0.80% spend is no silver bullet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection

HIDBench evaluates LLMs for host-based intrusion detection using three public system-log datasets, DARPA-E3, DARPA-E5, and NodLink; many models exceed 0.8 precision on simpler datasets, but MCC often drops below 0.5 as logs become noisier and more complex.

#Reasoning#Benchmarking#HIDBench#DARPA-E3

why featured

HKR-K and HKR-R pass: the item gives a new benchmark, three public log datasets, and a testable MCC<0.5 result. The host-intrusion niche adds technical-accessibility drag, so it stays in all.

editor take

HIDBench tests HIDS on 3 public log sets; MCC often falls below 0.5 under noise, so LLM agents are not replacing SIEMs yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner reconstructs an unconstrained image collection into a gravity-aligned 3D scene, projects it into a 2D density-map floorplan proxy, and aligns it with a raster floorplan using a 2D similarity transform; the paper reports experiments in sparse settings with as little as one input image, while code and data are marked for public release.

#Vision#Fine-tuning#SceneAligner#Research release

why featured

HKR-H/K pass: the one-image floorplan-localization setup is concrete and testable. HKR-R is weak because this is niche 3D vision research with no product, open-source, or benchmark impact disclosed.

editor take

SceneAligner tests even 1 input image; the raster-floorplan fit is useful, but success rates and building scale are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

The paper proposes ShaPO, a geometry-aware preference optimization framework that constrains alignment-critical parameter subspaces and applies token-level and reward-level variants to improve safety robustness under noisy preference supervision and distribution shift.

#Alignment#Safety#ShaPO#Research release

why featured

HKR-K and HKR-R pass, but the feed gives abstract-level detail only: no metrics, artifact link, or reproducible setup. The technical framing keeps it in the 60-71 research-release band.

editor take

ShaPO constrains alignment-critical subspaces; model scale is undisclosed. I buy the geometry angle, but replication beats the label.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→ECPO paper introduces evidence-coupled policy optimization for candidate ranking

The paper introduces ECPO for evidence-certified candidate ranking on MAVEN-ERE and RAMS, requiring each Top-K output to include doc_id:span evidence certificates whose cited spans can reconstruct the decision under closed-, predicted-, and hybrid-roster settings.

#RAG#Reasoning#Benchmarking#MAVEN-ERE

why featured

HKR-K and HKR-R pass: the paper gives a concrete evidence-certificate mechanism and MAVEN-ERE/RAMS evaluation setting. HKR-H is weak, and this remains a single arXiv methods paper, so it stays in the interesting band.

editor take

ECPO binds Top-K ranking to doc_id:span certificates; good RAG eval pressure, and a direct hit on post-hoc citation theater.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→BEiTScore: Reference-Free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore evaluates image caption quality with a lightweight cross-encoder initialized from a VQA checkpoint, uses adversarial LLM-based data augmentations during supervised training, and introduces one benchmark for detailed caption evaluation across diverse scenarios.

#Vision#Multimodal#Benchmarking#BEiTScore

why featured

HKR-K passes with a concrete method, training mechanism, and benchmark; HKR-H is weak and HKR-R is limited to multimodal-eval specialists. This is useful research signal, not a product or industry-level event, so it sits in the 60-71 band.

editor take

BEiTScore uses a VQA-initialized cross-encoder for caption scoring; no efficiency numbers, so I don't buy the SOTA-plus-cheap claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Vendi Novelty Scores for Out-of-Distribution Detection

The paper introduces Vendi Novelty Score for OOD detection, measuring how much a test sample increases the in-distribution set’s Vendi Score, and reports state-of-the-art results across image benchmarks while retaining performance with only 1% of training data.

#Safety#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with a concrete mechanism and 1% training-data condition; HKR-R links OOD to deployment reliability. HKR-H is weak, and this is a single arXiv method paper, not a product or industry event.

editor take

VNS reports SOTA OOD using 1% training data; I like the angle, but the snippet gives no benchmark table.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling

The paper introduces PMCTS, a parallel MCTS algorithm using particle-based search for neural network evaluations, and claims it preserves formal policy improvement guarantees while scaling with parallel compute.

#Reasoning#Inference-opt#Research release

why featured

HKR-K and HKR-R pass: the topic fits inference-time scaling and names a mechanism. No benchmarks, code, task setup, or gain numbers are disclosed, and the technical barrier keeps it in all.

editor take

PMCTS claims policy-improvement guarantees; domains and scaling curves are undisclosed, so don’t call it an AlphaZero moment yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Identifiable Token Correspondence for World Models

The paper introduces Identifiable Token Correspondence, a decoding step that frames next-frame prediction as structured assignment with latent token correspondence variables, and reports state-of-the-art results on 4 challenging benchmarks without changing the transformer architecture or training procedure.

#Reasoning#Robotics#Tools#SNU MLLAB

why featured

HKR-K passes with a concrete mechanism and 4-benchmark SOTA claim. HKR-H/R are weak, and the summary lacks code or reproducibility details, so this sits in the 60–71 research-release band.

editor take

ITC hits 72.5% return on Craftax-classic; a decode-only patch that smells like object permanence for token world models.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→F-TIS: Harnessing Diverse Models in Collaborative GRPO

The paper introduces F-TIS for collaborative GRPO with heterogeneous models, using filtered truncated importance sampling to train with off-policy samples; experiments report identical final convergence to purely on-sample training and up to a 12% performance gain on out-of-distribution tasks in some setups.

#Reasoning#Fine-tuning#Inference-opt#Research release

why featured

HKR-K is supported by the F-TIS mechanism and 12% OOD claim; HKR-R lands for reasoning-model fine-tuning costs. HKR-H is weak, and GRPO/off-policy depth keeps it in the lower research band.

editor take

F-TIS claims heterogeneous GRPO matches on-policy convergence and adds up to 12% OOD; I buy the mechanism, not the generalization yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Long-term Fairness with Selective Labels

The paper studies long-term fairness under selective labels, introduces a framework combining observed data with a label predictor, and reports that its reinforcement learning algorithm reaches comparable fairness and performance to an oracle-label agent in semisynthetic environments.

#Alignment#Benchmarking#Research release#Safety/alignment

why featured

HKR-K/R pass via a concrete fairness mechanism and selective-label deployment relevance, but HKR-H fails. No code, real deployment, or benchmark impact is disclosed, so this stays in the interesting-but-not-featured band.

editor take

The paper plugs selective-label bias with a predictor; semisynthetic results approach oracle, but fairness rests on predictor confidence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Temporal Contrastive Transformer for Financial Crime Detection

The paper introduces Temporal Contrastive Transformer, which learns transaction-sequence embeddings with a self-supervised contrastive objective; embeddings alone reach AUC 0.8644, while adding them to engineered features does not beat the 0.9245 baseline.

#Embedding#Benchmarking#Temporal Contrastive Transformer#Research release

why featured

HKR-K is strong and HKR-R is moderate: the paper gives a reproducible mechanism and AUCs, including a failed lift over a 0.9245 baseline. HKR-H is weak, and the domain is narrow, so it stays in the 60-71 all band.

editor take

TCT embeddings hit 0.8644 AUC alone, then 0.9205 with features; engineered baselines still win at 0.9245.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

The paper introduces partial fusion for neural networks, aggregating only the weights of the most similar neurons and using partial optimal transport to match them, so models can trade off ensemble computation cost against the lower accuracy of full weight aggregation.

#Inference-opt#Research release#Open source

why featured

HKR-K and HKR-R pass: the paper gives a partial-fusion mechanism and targets ensemble inference cost. HKR-H is weak, and no metrics or deployment proof are disclosed, so it stays in the 60–71 band.

editor take

Partial fusion merges only the closest neurons; no accuracy numbers in the abstract, so ensemble replacement depends on code reproducibility.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token uses a sparse projection matrix W for cross-tokenizer distillation, improving Llama-3.2-1B over GOLD by 3.82 average points with a Qwen3-4B teacher and adding 1.3 points in a two-teacher setup.

#Fine-tuning#Reasoning#Llama#Qwen

why featured

HKR-K and HKR-R pass: the paper gives a testable sparse-projection KD method and deltas. HKR-H is weak, and the impact is still a niche Llama-3.2-1B benchmark, not a broad product change.

editor take

X-Token beats GOLD by 3.82 points on Llama-3.2-1B; cross-tokenizer KD is finally fixing ugly digit-token failures.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Two is Better Than One: A Collapse-Free Multi-Reward RLIF Training Framework

The paper proposes a multi-reward RLIF framework for LLM training, combining cluster-voting answer rewards with token-wise self-certainty completion rewards; the RSS abstract says it improves stability across math reasoning and code-generation benchmarks but does not disclose specific benchmark scores.

#Reasoning#Code#Alignment#Research release

why featured

HKR-K/R pass: the mechanisms are concrete and relevant to post-training stability. HKR-H is weak, and the post discloses no benchmark scores or reproducible conditions, so it stays in the 60–71 research-release band.

editor take

This splits RLIF into dual rewards plus KL-Cov, but gives no scores; don’t buy “close to RLVR” without tables.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

The paper introduces an MLLM training pipeline for safety-critical driving videos, fusing downsampled frames, synchronized IMU/GPS telematics, and specialized vision-model outputs, then fine-tunes QwenVL-2.5 with DoRA adapters using fewer than 50 million trainable parameters.

#Multimodal#Vision#Fine-tuning#QwenVL-2.5

why featured

HKR-K passes: the summary gives a sensor-fusion training pipeline and <50M trainable params. HKR-H and HKR-R are weak; this is a single applied arXiv paper, below featured threshold.

editor take

QwenVL-2.5 gets DoRA tuning under 50M parameters; I don’t buy “safety-critical” without disclosed crash-event recall.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Evaluation of Pipelines for Data Integration into Knowledge Graphs

The paper proposes KGI-Bench for evaluating knowledge graph data integration pipelines, using coverage, correctness, and consistency metrics to compare 12 pipelines on movie-domain datasets with three input formats.

#RAG#Benchmarking#Research release#Benchmark

why featured

A narrow evaluation paper with HKR-K: KGI-Bench, three metrics, and 12 pipelines give testable facts. HKR-H and HKR-R are weak, no hard exclusion applies, so it stays in the interesting-not-featured band.

editor take

KGI-Bench tests 12 movie-KG integration pipelines; for RAG memory, this plumbing benchmark beats another model leaderboard.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Attacking the Spike: Transferability and Security of SNNs to Adversarial Examples

The paper introduces the MDSE attack across CIFAR-10, CIFAR-100, ImageNet, and 19 classifier models, reporting up to 91.4% higher effectiveness on SNN/ViT ensembles and a 3x boost over Auto-PGD on adversarially trained SNN ensembles.

#Vision#Safety#Benchmarking#arXiv

why featured

HKR-H and HKR-K pass: the paper offers a named attack, model coverage, and concrete gains. HKR-R is weak because SNN adversarial transfer is a narrow research topic with no product deployment impact disclosed.

editor take

MDSE spans 3 datasets and 19 models; SNNs can’t hide behind spike dynamics when mixed gradient estimation breaks them.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→CoFEH: LLM-driven Feature Engineering with Collaborative Bayesian Hyperparameter Optimization

CoFEH interleaves LLM-based feature engineering with Bayesian hyperparameter optimization, using Tree of Thought and a mutual conditioning mechanism to share context between the LLM and BO modules; the abstract says it outperforms traditional and LLM baselines in standalone FE and joint FE+HPO settings, but the post does not disclose dataset counts or metric values.

#Agent#Reasoning#Tools#CoFEH

why featured

HKR-K passes: the alternating FE+Bayesian HPO setup is a testable mechanism for AutoML practitioners. HKR-H and HKR-R are weak, and the body lacks dataset count or gain size, so this stays in the normal research band.

editor take

CoFEH interleaves LLM feature engineering with Bayesian HPO; only the abstract is shown, no dataset count or metrics, so treat it as AutoML orchestration.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

The paper evaluates four IDS model families on UNSW-NB15, proves that multicollinearity inflates SHAP/LIME attribution variance, and proposes Explanability Fragility Score plus two mitigations, CAA-Filtering and SHARP, using Kendall’s tau across bootstrapped explanations to quantify instability.

#Interpretability#Safety#Benchmarking#arXiv

why featured

HKR-K is solid and HKR-R is narrow; there is no product impact, cross-source cluster, or major model release. The IDS explainability focus keeps it in the lower 60–71 band.

editor take

The paper tests 4 IDS families on UNSW-NB15 but omits effect sizes; tying SHAP/LIME variance to multicollinearity hits a real security-XAI blind spot.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation

SeqLoRA jointly optimizes both LoRA factors with bilevel optimization for continual multi-concept text-to-image personalization; experiments report improved identity preservation and scalability up to 101 concepts while avoiding post-hoc fusion and reducing attribute interference in composed generations.

#Fine-tuning#Multimodal#Vision#SeqLoRA

why featured

HKR-K and HKR-R pass via a concrete LoRA mechanism and the 101-concept claim. HKR-H is weak, and the single arXiv paper lacks product or artifact details, so it stays in the 60-71 band.

editor take

SeqLoRA reaches 101 concepts, but the snippet omits base model, dataset, and runtime; don’t treat theory as deployment proof.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R fine-tunes PointLLM on PoCoTI, a 55K-sample point-text instruction dataset with explicit reasoning paths, and reports state-of-the-art results on generative 3D classification, captioning, real-world scanned point clouds, and multi-turn dialogue settings.

#Reasoning#Multimodal#Fine-tuning#PointLLM-R

why featured

HKR-K passes on the 55K reasoning dataset and SOTA claims; HKR-H and HKR-R are weak because the angle is niche 3D multimodal research rather than a broad practitioner talking point.

editor take

PointLLM-R fine-tunes PointLLM on 55K PoCoTI samples; I trust the data pipeline more than undisclosed SOTA margins.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Winner-Take-All Bottlenecks Enforce Disentangled Symbolic Representations in Multi-Task Learning

The paper proves that a WTA bottleneck extracts categorical latent factors under defined conditions, and validates on two datasets that the resulting symbolic representations support generalization.

#Reasoning#Interpretability#Benchmarking#arXiv

why featured

HKR-K passes via a specific WTA mechanism and 2-dataset validation; HKR-H/R are weak because the angle is academic and application spillover is limited. No hard exclusion, but it stays below featured.

editor take

WTA bottlenecks force symbolic representations on 2 datasets; I buy the mechanism, not the “symbolic interface” pitch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Towards Explainability of SLMs by Investigating Token-Level Activation

The paper introduces AFN, a model-agnostic framework that ranks token importance by the L2 norm of BERT Layer 8 hidden states, then splits tokens into high- and low-activation buckets using an empirical upper-quartile threshold.

#Interpretability#BERT#Research release

why featured

HKR-K passes because the post gives a concrete AFN mechanism. HKR-H/R are weak: the title is routine arXiv framing and no production-safety or debugging impact is disclosed.

editor take

AFN ranks tokens via BERT Layer 8 L2 norms; I don’t buy “model-agnostic” without cross-model validation.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM releases over 13 million transit route planning records from four Chinese cities, covering 120,845 stations and 13,666 lines, with a continual pre-training corpus, benchmark data, and three evaluation tasks for map-free route generation.

#Benchmarking#TransitLM#Hugging Face#GitHub

why featured

HKR-K passes with concrete dataset scale and tasks. HKR-H/R miss because the transit-routing benchmark is niche and has limited pull for general AI product or agent practitioners.

editor take

TransitLM ships 13M transit-planning records; “map-free” is a bold claim, but cross-city generalization error is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

The paper compares spectral, random-walk, and adjacency graph tokenizations, proving that random-walk tokenization is lossy for any walk length, while spectral tokenization is lossless but ill-conditioned for local tasks.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the paper offers a clear mechanism comparison and a counterintuitive claim. Its graph-tokenization theory is specialist and lacks product, open-source, or adoption signals.

editor take

The paper proves random-walk tokenization is lossy at any length; graph Transformers can’t treat tokenization as preprocessing trivia.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Automatic Contextual Audio Denoising

The paper introduces ACAD, restricts context to acoustic scene classes, labels events outside a scene distribution as out-of-context noise, and reports better standard objective metrics than baselines without context inference, with oracle context, and with separately provided uninformative context on paired clean/noisy data.

#Audio#Research release#Benchmark

why featured

HKR-K passes: the article gives ACAD’s context definition, OC-noise mechanism, and baseline comparisons. HKR-H and HKR-R are weak, making this a niche research release rather than featured material.

editor take

ACAD reduces context to acoustic scene class; metrics win, but RSS gives no dataset or margin, so don’t call it general audio understanding.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models

The paper decomposes MDM training variance into 3 sources and proposes 6 variance-reduction methods; P-POTS and MIRROR improve accuracy by 7-8% over standard MDM training on complex reasoning tasks and reduce run-to-run variability near ARM levels.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-K is solid: 3 variance sources, 6 methods, and 7-8% gains. HKR-H/R are weak, and no hard exclusion triggers, so this stays in the low-60s research bucket.

editor take

MDM variance gets split into 3 sources, and P-POTS/MIRROR add 7-8%; this smells like paying down a training-paradigm debt.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

The authors introduce Symphony for Speech-to-Text, a medical speech recognition system for real-time streaming and batch clinical transcription, using three specialized components for recognition, formatting, and contextual correction while releasing a clinical benchmark dataset and offering a production API for live dictation, conversational transcription, and batch audio processing.

#Audio#Benchmarking#Symphony#Research release

why featured

HKR-K passes because the post gives concrete system components. HKR-H and HKR-R are weak, and the arXiv summary lacks benchmark numbers or adoption evidence.

editor take

Symphony splits ASR into 3 stages; no WER in the snippet, so don’t trust “substantially outperforms” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Prototype-Grounded Concept Models for Verifiable Concept Alignment

The paper introduces Prototype-Grounded Concept Models, which ground CBM concepts in learned visual prototypes for direct semantic inspection. In arXiv:2604.16076v2, the abstract says PGCMs match state-of-the-art CBMs on predictive performance while adding prototype-level human intervention for correcting concept misalignment.

#Vision#Interpretability#Alignment#Research release

why featured

HKR-K passes because PGCM links learned visual prototypes to CBM concept constraints and claims near-SOTA performance plus human intervention. HKR-H/R are weak; this is niche academic interpretability signal.

editor take

PGCM grounds CBM concepts in visual prototypes; the abstract omits datasets and metrics, so “verifiable alignment” stays unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

The paper analyzes multimodal failure in action-chunking behavioral cloning, showing that latent-variable policies depend on posterior-prior regularization strength while action-space generative policies are constrained by Lipschitz smoothness, with evidence from synthetic multimodal tasks and robotic simulation benchmarks.

#Robotics#Multimodal#Benchmarking#Research release

why featured

HKR-K passes: the paper gives concrete mechanisms for multimodal failure in action-chunking behavioral cloning and validates them in synthetic and robot simulation tasks; HKR-H/R are weak, and technical density keeps it in all.

editor take

This paper pins action-chunking BC failures on KL regularization and Lipschitz limits; more useful than another robotics benchmark drop.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Learning Causal Orderings for In-Context Tabular Prediction

The paper introduces TabOrder for in-context tabular prediction, using causal order-constrained attention and an unsupervised likelihood objective to learn topological variable orderings under observational, missing-data, and intervention settings.

#Reasoning#Benchmarking#TabOrder#Research release

why featured

HKR-K passes because the paper offers a concrete mechanism: causal-order-constrained attention and unsupervised topology learning. HKR-H and HKR-R are weak; as a single arXiv methods paper with no product or deployment claim, it stays in low all.

editor take

TabOrder constrains attention by learned causal order; no benchmark numbers disclosed, so I’m skeptical on real tabular drift gains.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

GOEN-NoCenterLoss achieves 0.9483 average OOD AUROC on CIFAR-10 benchmarks, while adding CenterLoss lowers it to 0.9366 despite improving classification accuracy; the pipeline uses multi-scale features, L2 normalization, Mahalanobis distance, and a calibration head trained with real hard OOD examples, with training under 20 minutes on one GPU.

#Safety#Benchmarking#GOEN#CIFAR-10

why featured

HKR-H and HKR-K pass: the title has a counterintuitive hook and the post gives AUROC plus training conditions. Narrow OOD benchmark research lacks product, agent, or major-model pull, so it stays below featured.

editor take

GOEN-NoCenterLoss hits 0.9483 AUROC on CIFAR-10; CenterLoss drops to 0.9366, so stop treating classifier geometry as uncertainty geometry.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→A Mechanistic Explanatory Strategy for XAI

arXiv:2411.01332v5 proposes a mechanistic explanatory strategy for XAI, using decomposition, localization, and recomposition to identify functionally relevant neurons, layers, circuits, or activation patterns in deep learning systems.

#Interpretability#Vision#Reasoning#OpenAI

why featured

HKR-K passes because the paper states a concrete explanatory workflow. HKR-H/R are weak: no experiment numbers, target models, or practical impact are disclosed, so this stays in all.

editor take

arXiv v5 frames XAI as decomposition, localization, recomposition; solid philosophy, but reproducible engineering details are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

The paper introduces Alike Parts, a framework that highlights shared feature subsets between a classified instance and its nearest prototype for local explanations, and tests feature-informed global prototype selection on six benchmark datasets.

#Interpretability#Benchmarking#Research release

why featured

HKR-K passes because the paper names a mechanism and 6 benchmark tests. HKR-H and HKR-R fail: the angle is technical, with no product implication or practitioner nerve, so it stays below the interesting-news band.

editor take

Alike Parts keeps surrogate fidelity on 6 benchmarks; task mix is undisclosed, so interpretability gains need a harder audit.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

The paper benchmarks deep ensembles for message-passing GNNs on seven graph datasets and finds only marginal gains over a single model; the gains mainly come from stabilizing optimization noise in point predictions, not from better uncertainty estimates.

#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass because the paper tests a specific uncertainty claim across 7 graph datasets. The niche GNN focus lacks HKR-R and has no product, open-source, or safety implication, so it stays below featured.

editor take

Seven graph datasets show GNN ensembles mainly stabilize point predictions; I don’t buy importing the CV uncertainty default here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→AMUSE: Anytime Muon with Stable Gradient Evaluation

AMUSE combines Muon orthogonalized momentum with Schedule-Free averaging, using a time-varying interpolation coefficient that shifts gradient evaluation from the fast Muon sequence to the averaged sequence, and reports better performance-iteration Pareto frontiers than AdamW variants and Muon across vision tasks and LLM pretraining.

#Fine-tuning#Inference-opt#Benchmarking#AMUSE

why featured

HKR-K passes on the AMUSE mechanism and LLM pretraining setting. HKR-H/R miss because the title is specialist optimizer language, and the body gives no effect sizes, code, or replication setup.

editor take

AMUSE removes LR schedules, but the snippet omits LLM scale and compute; I don’t buy the anytime claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→What Are the Right Symmetries for Formal Theorem Proving?

The paper introduces rewriting categories for formal theorem proving, defines proof equivariance and success invariance, and tests aggregation over equivalent input rewrites as a test-time method to reduce LLM prover sensitivity to semantically equivalent formulations under fixed inference budgets.

#Reasoning#Benchmarking#Inference-opt#Research release

why featured

HKR-K passes via concrete mechanisms and two named symmetry definitions. HKR-H/R are weak, and the formal-proving/category framing narrows access, so it stays in all rather than featured.

editor take

The paper defines two prover symmetries but gives no experiment scale; I buy the framing, and rewrite aggregation beats blind sampling.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

The paper introduces a causal attribution model that uses do-operators to build interventional scenarios, score LLM causal reasoning components, and guide precision fine-tuning for pairwise causal discovery across multiple domains.

#Reasoning#Fine-tuning#Interpretability#Research release

why featured

HKR-K passes via the causal-intervention and fine-tuning mechanism. HKR-H/R are weak, and the post discloses no metrics, benchmark gains, or released artifact, so it stays in the upper low-value band.

editor take

The paper scores causal components with do-operators; models, datasets, and gains are undisclosed, so I don’t buy the precision-tuning claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

The paper introduces Rule-State Inference, a Bayesian framework that uses formal rule sets as priors and infers latent compliance states on a benchmark of 2,000 synthetic enterprises; the abstract says full numerical validation is forthcoming.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with a clear mechanism and a 2,000-company synthetic benchmark; HKR-H/R are weak, and validation is not complete. This fits all, not featured.

editor take

RSI tests compliance inference on 2,000 synthetic firms; numerical validation is still pending, so the guarantees are not deployment evidence.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→End-to-End Semantic ID Generation for Generative Advertisement Recommendation

Jie Jiang and 10 coauthors propose UniSID, an end-to-end framework that jointly optimizes embeddings and semantic IDs from raw ad data; experiments report up to a 4.62% Hit Rate improvement over the strongest SID-generation baseline in downstream advertising scenarios.

#Embedding#Jie Jiang#Xinxun Zhang#arXiv

why featured

HKR-K passes with UniSID’s mechanism and a 4.62% Hit Rate gain. HKR-H and HKR-R miss: it reads like a standard IR paper and matters mostly to ad-recsys teams, so it stays in the low-value research band.

editor take

UniSID trains ad embeddings and semantic IDs end-to-end, lifting Hit Rate up to 4.62%; smells like a practical SID debt fix.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Discrete Stochastic Localization Method for Non-autoregressive Generation

The paper introduces DSL, a continuous-state framework using unit-sphere token embeddings for non-autoregressive generation; fine-tuning one pretrained MDLM checkpoint improves MAUVE on OpenWebText across T=128 to T=1024 and supports a hybrid continuous-then-discrete sampler with T=48 total steps.

#Reasoning#Inference-opt#arXiv#OpenWebText

why featured

HKR-K passes: DSL uses unit-ball token embeddings for non-autoregressive generation and reports OpenWebText MAUVE plus T=48 sampling. HKR-H/R are weak, so this stays a low all-tier arXiv method paper.

editor take

DSL runs one MDLM from T=48 to 1024; the no-distillation sampling flexibility is stronger than the undisclosed MAUVE gain.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SceneSelect: Selective Learning for Trajectory Scene Classification and Expert Scheduling

SceneSelect uses unsupervised clustering over geometric and kinematic scene features to route trajectory inputs to expert predictors, and reports a 10.5% average improvement over strong single-model and ensemble baselines on ETH-UCY, SDD, and NBA.

#Robotics#Benchmarking#SceneSelect#Research release

why featured

HKR-K passes on a concrete mechanism and three benchmark results; HKR-H/R fail because this is a narrow trajectory-prediction paper with no product or broad industry impact.

editor take

SceneSelect gains 10.5% on 3 trajectory benchmarks; I buy expert routing, but the snippet omits overhead, so hold the hype.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks

The paper analytically solves Mountain Car optimal control and introduces Chebyshev policies, reporting 4.18x lower regret and 277x fewer parameters than neural nets on low-dimensional control tasks.

#Robotics#Reasoning#Benchmarking#Research release

why featured

HKR-K passes with a concrete mechanism and two metrics, but Mountain Car is a toy control benchmark with little product, agent, or competitive spillover. Lower-band research signal.

editor take

Chebyshev policies cut regret 4.18x with 277x fewer parameters; low-dimensional control keeps exposing neural-net overkill.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery

The paper introduces EMO-STA, a two-stage framework for LLM-guided program discovery that evolves a shared archive before adapting candidates to target tasks; across eight task families, matched-compute tests show gains in most settings, and roughly balanced shared and adaptation budgets are often optimal.

#Agent#Code#Reasoning#Research release

why featured

HKR-K passes with a concrete two-stage framework, 8 task families, and a budget allocation result. HKR-H/R are weak because this is niche program-discovery research, so it stays in all.

editor take

EMO-STA wins across most of eight task families; I buy shared archives here, single-task evolution overfits noise too easily.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→When to Switch, Not Just What: Transition Quality Prediction in Clash Royale

The study analyzes 926,334 matches from 34,619 Clash Royale players and proposes TQP, a Who-When-What transition recommendation pipeline that reaches a +10.4 percentage-point SwitchGap at a 5.4% recommendation rate.

#Benchmarking#Clash Royale#Research release#Benchmark

why featured

HKR-H/K pass because the paper has a concrete game-switching hook and measurable dataset/result. HKR-R fails: this is narrow game recommender research, not an agent, model, or AI-product shift.

editor take

TQP gets +10.4pp SwitchGap on 926k matches; I like that it gates switching itself, not another strategy leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

The paper proposes Energy-Gated Attention, which gates value aggregation using spectral energy from key token embeddings; on TinyShakespeare it reduces validation loss by 0.103 with 12,480 extra parameters, under 0.26% overhead and no measurable compute cost.

#Reasoning#Inference-opt#Research release

why featured

HKR-K passes via a concrete mechanism and small benchmark number; HKR-H/R fail. The evidence is limited to TinyShakespeare, so this stays a low-value research signal rather than a featured item.

editor take

EGA cuts TinyShakespeare loss by 0.103 with 12,480 params; the spectral-energy story needs WikiText-scale proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

The paper proposes a plug-and-play SPCL framework for MERC, using utterance-level and conversation-level difficulty scores to schedule training, and reports weighted F1 gains of about 1.2% to 6.6% on IEMOCAP and up to 10.4% on MELD.

#Multimodal#Audio#Benchmarking#arXiv

why featured

HKR-K passes with a named SPCL mechanism and benchmark gains on IEMOCAP/MELD. HKR-H and HKR-R fail because the angle is narrow academic MERC work with no product, open-source, or adoption signal.

editor take

SPCL adds 1.2%-6.6% F1 on IEMOCAP; MERC’s pain isn’t missing modalities, it’s lopsided training.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Optimal Recourse Summaries via Bi-Objective Decision Tree Learning

SOGAR formulates recourse summary learning as an optimal decision tree problem and finds the Pareto front between recourse effectiveness and cost; the paper uses shallow axis-parallel trees and sparse leaf actions, but the RSS snippet does not disclose dataset counts or exact benchmark numbers.

#Reasoning#Benchmarking#SOGAR#Research release

why featured

HKR-K passes via the bi-objective decision-tree mechanism and Pareto-frontier framing. HKR-H/R are weak: the title is standard paper phrasing, and dataset count or deployment conditions are not disclosed.

editor take

SOGAR uses shallow trees for the effectiveness-cost Pareto frontier; dataset counts are undisclosed, so treat it as audit-tool refinement.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→ARC-STAR: Auditable Post-Hoc Correction for PDE Foundation Models

ARC-STAR reduces Poseidon velocity rollout error by at least 36x across 5 flow benchmarks and 10 regime cells, using a frozen-solver pipeline with global correction, blockwise local refinement, and label-free routing to high-risk blocks under a compute budget.

#Inference-opt#Benchmarking#ARC-STAR#Poseidon

why featured

Hard-exclusion-4 applies: this is a PDE/fluid-benchmark correction paper with no agent or product implication, plus low technical accessibility. HKR-K is strong, but HKR-H and HKR-R fail, so it is capped as excluded.

editor take

ARC-STAR cuts Poseidon error 36x across 10 flow cells; frozen correction beats reflexive PDE foundation-model fine-tuning here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Ternary Decision Trees with Locally Adaptive Uncertainty Zones

The paper introduces ternary decision trees, adding a locally computed uncertainty zone with half-width delta around each CART split threshold, and reports that five delta methods outperform standard CART on decided accuracy across 72 OpenML-CC18 datasets with 5-fold cross-validation.

#Benchmarking#OpenML#Research release#Benchmark

why featured

HKR-K is present: the paper gives a concrete mechanism and benchmark setup. HKR-H and HKR-R miss; this is a niche classical-ML method paper, not an agent/model/product event, so it stays in the lower 40–59 band.

editor take

Ternary trees beat CART on 72 OpenML sets; I trust zero-hyperparameter margin more than the +0.71% medical vignette.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Multi-Stage Training for Abusive Comment Detection in Indic Languages

The paper proposes an abusive-comment detection pipeline for Indic languages, using language-based preprocessing and an ensemble of several models; the abstract says experiments target lower false-positive rates, but the RSS snippet does not disclose datasets, model names, or scores.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K passes on the stated training mechanism, but the post gives no result numbers or reproducible setup. No hard exclusion applies, so this stays in the low-value research band.

editor take

The paper claims lower false positives for Indic abuse detection, but discloses no datasets, model names, or scores; don't buy safety without baselines.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification

CASE-NET performs multivariate time series classification with masked self-attention, causal convolutions, and adaptive channel recalibration; evaluations across six domains report state-of-the-art results on four tasks and a peak accuracy of 98.6% on the AWR dataset.

#Reasoning#Benchmarking#CASE-NET#Research release

why featured

This is a narrow multivariate time-series classification paper: HKR-K passes via mechanisms and the 98.6% AWR claim. HKR-H and HKR-R are weak because there is no product, agent, or industry-competition hook.

editor take

CASE-NET claims 4/6 SOTA and 98.6% on AWR; I’d check ablations first, causal attention often hides a plain mask.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Researchers use large language models to infer stellar parameters and chemical abundances

The paper proposes a two-stage large language model framework that infers stellar effective temperature, surface gravity, metallicity, and abundances for about 20 chemical elements from continuous stellar spectra.

#Reasoning#Research release

why featured

Triggers hard-exclusion-4: traditional science plus AI, with no agent, product, or general AI tooling implication. HKR-H and HKR-K pass, but HKR-R fails, so it stays capped below 40.

editor take

A two-stage LLM estimates stellar parameters and ~20 abundances; no error table in the body, so don’t crown “spectra as language” yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection

UNAD+ evaluates unknown network attack detection on CICIDS2017 and NSL-KDD, combining a benign-only unsupervised ensemble, Weighted Majority Voting, supervised refinement on pseudo-labels, and post hoc explainability, with F1 scores above 98% across both benchmark datasets.

#Benchmarking#Interpretability#UNAD+#Research release

why featured

HKR-K passes via concrete mechanisms and F1>98% on CICIDS2017 and NSL-KDD. HKR-H/R are weak because this is specialized network-security ML, not a broad AI product or model-ecosystem story.

editor take

UNAD+ tops 98% F1 on two old benchmarks; I don’t buy zero-day claims without cross-dataset and time-split tests.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

SplAttN replaces hard projection with Differentiable Gaussian Splatting for point cloud completion, evaluates on PCN, ShapeNet-55/34, and KITTI, reports state-of-the-art results, and releases code at the project repository.

#Multimodal#Vision#SplAttN#KITTI

why featured

HKR-K passes for a concrete mechanism, benchmarks, and open code; HKR-H/R miss. The work is narrow 3D-vision research with limited product or agent spillover, so it stays in the lower band.

editor take

SplAttN tests point completion on PCN, ShapeNet-55/34, and KITTI; soft splatting sounds plain, but it targets a real projection failure.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Hybrid Kolmogorov-Arnold Network and XGBoost Framework for Electricity Price Forecasting

The paper proposes a KAN+XGBoost framework for week-ahead electricity price forecasting in Australia’s NEM, evaluates it on real-world data with an expanding-window setup, and reports about 12% lower MAE than XGBoost and over 50% lower MAE than a naive baseline.

#Benchmarking#arXiv#XGBoost#Australia National Electricity Market

why featured

HKR-K passes on the hybrid method and 12% MAE claim. HKR-H/R fail because this is a niche electricity-price forecasting paper with no product, agent, platform, or practitioner-impact hook.

editor take

KAN+XGBoost cuts NEM week-ahead MAE by 12%; abstract only, with splits, features, and spike-error behavior undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light

The arXiv paper uses synthetic RAW low-light samples to evaluate pedestrian detection in dark autonomous-driving scenes, characterizing a state-of-the-art object detector’s performance as a function of scene illumination; metrics on real and synthetic low-light data are similar, and the abstract does not disclose dataset size or model name.

#Vision#Benchmarking#arXiv#Research release

why featured

HKR-K passes because the paper offers a testable synthetic RAW low-light evaluation mechanism. HKR-H and HKR-R are weak: it is a narrow vision benchmark without product, open-source, or major-model stakes.

editor take

Synthetic RAW tests low-light pedestrian detection; model and sample count are undisclosed, so trust the sensor-noise setup before the generalization claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

17d ago

arXiv · cs.LG· atomEN04:00 · 05·23

→RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching

RobustSpeechFlow improves TTS alignment robustness with length-preserving repeat and skip latent augmentations; on Seed-TTS-eval, a 0.06B-parameter setup reduces WER from 1.44 to 1.38 without external aligners or preference data.

#Audio#Fine-tuning#Benchmarking#RobustSpeechFlow

why featured

HKR-K passes with a concrete mechanism and Seed-TTS-eval numbers. HKR-H/R fail: this is a narrow TTS research paper with limited accessibility and little practitioner resonance.

editor take

RobustSpeechFlow cuts Seed-TTS-eval WER from 1.44 to 1.38 at 0.06B params; TTS alignment still pays off in the loss.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

papers · 2026-05-23

more

feeds

admin