papers · 2026-06-08

▸ 173 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-06-08 · Mon

17:59

14h ago

NEW · 2 sourcesarXiv · cs.AI· atomEN17:59 · 06·08

→OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena evaluates VLM game agents across 12 newly built UE5 games: 7 Solo, 3 PvP, and 2 Coop, while IDC tracks score changes and held-out variant behavior for 4 top agents after multiple reflection rounds.

#Agent#Vision#Benchmarking#OmniGameArena

why featured

HKR-H and HKR-K pass: the UE5 game setup and reflection-dynamics metric add concrete signal. HKR-R is weak, and this is a single arXiv benchmark without adoption, release details, or cross-source traction, so it stays in 60-71.

editor take

OmniGameArena tests 12 UE5 games and 12 VLMs; IDC reflection curves beat another cold-start leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:55

14h ago

NEWarXiv · cs.AI· atomEN17:55 · 06·08

→AHA-WAM: Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM uses a dual-DiT design to decouple low-frequency world planning from high-frequency action execution, reaching 92.80% average success on RoboTwin, 78.3% success across 4 real-world manipulation tasks, and 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

#Robotics#Vision#Agent#AHA-WAM

why featured

HKR-K and HKR-R pass: the mechanism and metrics are concrete, and real-robot results matter. HKR-H is weak, and this is a single arXiv robotics paper with no product launch or source cluster, so it stays in the 60–71 band.

editor take

AHA-WAM hits 92.80% on RoboTwin, but only 4 real tasks; I'd inspect failure videos before buying the SOTA claim.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:55

14h ago

NEWFEATUREDarXiv · cs.AI· atomEN17:55 · 06·08

→Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

EvalCards composes benchmark metadata, evaluation run data, and model metadata into one reporting record, and its monitoring tool covers 5,816 models, 635 benchmarks, and 101,843 results.

#Benchmarking#Interpretability#EvalCards#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv paper with mechanism and scale disclosed, not adoption or a reproducible deployment story. Evaluation infrastructure matters, so it lands at the featured threshold.

editor take

EvalCards hits the dirtiest part of leaderboards: scores travel fast while benchmark, run config, and model-version evidence breaks constantly.

sharp

EvalCards is useful because it treats evaluation as an evidence chain, not a score screenshot. The concrete hook is strong: one schema joins benchmark metadata, evaluation-run data, and model metadata, then monitors 5,816 models, 635 benchmarks, and 101,843 results. That is enough surface area to catch the usual leaderboard mess: same benchmark name, different prompts, different pass@k, different model revision, one clean-looking number. I like the four signals: reproducibility, documentation completeness, provenance and risk, and score comparability. That is closer to an audit layer than a model card. My doubt is adoption. Company blogs have no incentive to expose missing run configs, and leaderboards dislike anything that makes ranks less tidy. EvalCards needs distribution through Papers with Code, Hugging Face, or LMSYS-style hubs; otherwise it becomes a good schema that honest evaluators use and everyone else routes around.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:53

14h ago

NEWarXiv · cs.AI· atomEN17:53 · 06·08

→FASE: Fast Adaptive Semantic Entropy for Code Quality

FASE approximates code functional correctness with minimum spanning trees over structural and semantic dissimilarity graphs, and on HumanEval and BigCodeBench it improves Spearman correlation by 25% and ROCAUC by 19% versus LLM-entailment semantic entropy when using Qwen3-Embedding-8B.

#Agent#Code#Benchmarking#Qwen

why featured

HKR-K/R pass: FASE gives an MST approximation plus two testable benchmark gains, and code-agent evaluation is a real practitioner pain. HKR-H is weak, and this remains an arXiv benchmark paper without tooling or production proof.

editor take

FASE lifts Spearman 25% on HumanEval/BigCodeBench at 0.3% runtime cost; code-agent QA finally gets a cheap ruler.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:35

14h ago

NEWFEATUREDarXiv · cs.CL· atomEN17:35 · 06·08

→SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA generates a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours and delivering roughly a 36x wall-clock speedup.

#Agent#Code#Memory#GEOS

why featured

HKR-H/K/R all pass, but this is a single arXiv agent paper in a niche simulation workflow with no open-source or cross-source validation. The ~36x speedup claim lifts it to featured threshold.

editor take

SIGA’s punch is the adapter, not the deck generation: five minutes versus three hours makes simulator vendors look exposed.

sharp

SIGA pins the scientific-agent gap on interface contracts, and I buy that framing. On GEOS, it generates a complete deck in about five minutes, reaches TreeSim above 0.90, and matches a human expert who took about three hours. On the held-out set, grounding moves TreeSim from 0.720 to 0.789, while across-seed standard deviation drops 16x. That reads less like another coding-agent leaderboard trick and more like an exoskeleton for simulator vocabulary, structure, validation rules, and stopping conditions. I have doubts about extrapolating the 36x speedup, since task complexity and failure cost are thin in the snippet. The transfer to OpenFOAM and LAMMPS is the useful part: validation, memory, and retrieval dominate under different interface bottlenecks. Scientific automation will likely land first through these narrow adapters, not through a universal research agent.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:29

15h ago

NEW · 2 sourcesarXiv · cs.CL· atomEN17:29 · 06·08

→Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

The study converts community-sourced dictionaries into synthetic corpora and fine-tunes mT5-base with LoRA adapters; in-domain evaluation reaches BLEU 42.02, while an organic glossary test falls to BLEU 0.59.

#Fine-tuning#Benchmarking#Q'eqchi' Mayan#mT5

why featured

HKR-K and HKR-R pass: the paper gives a concrete PEFT setup and a sharp BLEU gap, 42.02 in-domain vs 0.59 organic vocab. HKR-H is weak; the scope is a niche NMT case study with limited product spillover.

editor take

mT5-base+LoRA hits BLEU 42.02 in-domain, 0.59 on organic glossary; synthetic data taught form, not language.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:27

15h ago

NEW · 2 sourcesFEATUREDarXiv · cs.CL· atomEN17:27 · 06·08

→iOSWorld: A Benchmark for Personally Intelligent Phone Agents

iOSWorld introduces an open-source iOS phone-agent benchmark with 26 custom apps and 133 tasks; the best tested configuration reaches 52% overall accuracy, while multi-app tasks reach only 37%.

#Agent#Memory#Benchmarking#iOSWorld

why featured

HKR-H/K/R all pass: the 52% best score and 37% multi-app result make phone-agent limits concrete. It is a solid research/benchmark release, not a major model launch, so it stays in the 78–84 band.

editor take

iOSWorld pins phone-agent weakness on multi-app work: 52% overall, 37% across 2–8 apps. That is not product-ready autonomy.

sharp

iOSWorld’s harsh number is not 52% overall; it is 37% on multi-app tasks. Phone agents sell “do it for me,” but real phone work rarely stays inside one app. This benchmark uses 26 custom iOS apps and 133 tasks, with transactions, messages, travel records, relationships, and financial data tied to one persistent identity. That hits the exact workflow seam current agents keep dodging. The 26-point gain from privileged vision+XML access is the sharper clue. Frontier models can use the accessibility tree; smaller models cannot. So the bottleneck is not screen perception alone. It is binding structured UI, personal memory, and cross-app state without losing the plot. AndroidWorld and OSWorld already exposed brittle device control. iOSWorld moves the failure into personal context, and 37% is ugly for every “AI assistant on your phone” pitch.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:11

15h ago

NEWarXiv · cs.CL· atomEN17:11 · 06·08

→Collaborative Human-Agent Protocol (CHAP)

CHAP defines a shared workspace protocol for human-agent collaboration, using a Core with workspaces, participants, tasks, artifacts, and an append-only evidence log, while profiles add review, routing, handoff, identity, signatures, and transparency-backed audit.

#Agent#Tools#Memory#BrightbeamAI

why featured

HKR-K/R pass: CHAP offers concrete workspace and append-only evidence-log mechanics for human-agent collaboration. HKR-H is weak; adopters, benchmarks, and implementation maturity are not disclosed, so it stays in 60–71.

editor take

CHAP records human edits as diff, rationale, and hash; solid direction, but adoption hinges on MCP/A2A vendors.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:08

15h ago

NEW · 2 sourcesFEATUREDarXiv · cs.CL· atomEN17:08 · 06·08

→Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

The study evaluates deep research agents under multi-turn revisions: one process-level feedback round raises normalized scores by about 8–15 points, but later full-report rewrites regress on up to 24% of previously satisfied criteria.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the counterintuitive regression hook, 8-15 point gains, and 24% backslide are concrete. As a single arXiv agent-eval paper, it fits the 78-84 quality band, not same-day must-write.

editor take

Deep research agents still fail at revision discipline: process feedback adds 8–15 points, then full rewrites lose up to 24% of satisfied criteria.

sharp

This paper lands because it turns “agents improve through iteration” into a measurable failure mode. Under self-reflection, deep research agents add and lose rubric criteria at nearly equal rates, so the net gain is negligible. RGI process-level feedback does help once: normalized scores rise 8–15 points, with a 35–40% incorporation rate. The failure shows up after that first repair. When agents rewrite the full report to fix remaining gaps, they regress on up to 24% of criteria they had already satisfied. That pattern rhymes with coding agents that fix one local bug while breaking existing constraints, except reports lack a clean test suite. The code and results are public on GitHub, so this is unusually falsifiable. If it holds up, “multi-turn revision” is still product theater for deep research agents, not a dependable capability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:00

15h ago

NEWFEATUREDarXiv · cs.CL· atomEN17:00 · 06·08

→The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

The study compares Llama 3.1 8B before and after RLHF and finds that partisan directions remain in the model, while signal variance is compressed; sparse autoencoder decomposition shows policy-encoding features become completely inactive in the Instruct model.

#Alignment#Interpretability#Safety#Llama

why featured

HKR-H/K/R all pass: the paper makes a testable claim about shallow RLHF alignment with Llama 3.1 8B and SAE evidence. No cross-source traction or product impact, so it stays at 78.

editor take

RLHF looks like a muffler on Llama 3.1 8B: neutral outputs, partisan geometry intact, and user-identity cues can reconnect it.

sharp

This paper lands because it attacks RLHF at the mechanism, not the vibe layer. In Llama 3.1 8B, RLHF does not delete the partisan direction; it compresses variance and cuts the causal path into generation. The hook is the sparse autoencoder result: policy-encoding features that fire sporadically in the base model become completely inactive in the Instruct model. Steering then shows the structure can still drive output when reconnected. That is bad news for safety evals built around neutral answer text. A model can pass the visible politics test while preserving a steerable partisan geometry underneath. The user-identity bypass matters too: if inferring and amplifying a user’s partisan identity reactivates partisan generation, RLHF is closer to a runtime routing rule than parameter-level value removal.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:31

15h ago

NEWFEATUREDarXiv · cs.CL· atomEN16:31 · 06·08

→IS-CoT: Breaking Long-form Generation Collapse via Interleaved Structural Thinking

The paper introduces IS-CoT, embedding a Plan-Write-Reflect cycle into generation and training IS-Writer-8B; its evaluation finds reasoning-enhanced models degrade sharply when target lengths exceed 2,000 words in open-ended writing.

#Reasoning#Agent#Fine-tuning#DeepSeek

why featured

HKR-H/K/R all pass, but the article gives abstract-level detail only; benchmark, gain size, and training cost are not disclosed. Practical angle lifts it to the 72–77 featured band.

editor take

Long-form collapse is getting treated as its own failure mode; IS-Writer-8B’s +3.08 over DeepSeek-V3.2 is small, but the looped structure is the point.

sharp

IS-CoT names the failure many writing products keep hiding: reasoning skill does not buy stable 2,000-word generation. The paper’s concrete hook is clean: reasoning-enhanced models degrade sharply past 2,000 target words, and IS-Writer-8B beats DeepSeek-V3.2 by +3.08 on LongBench-Write with an embedded Plan-Write-Reflect loop. I buy the direction more than another context-window flex. Static outlines often rot halfway through a long draft, while external agent workflows add scheduler overhead and brittle handoffs. Training the reflection and replanning loop into an 8B writer treats long-form writing as process control, not one giant sample. The missing piece is evaluation hygiene: the snippet gives no human preference data, length distribution, or multi-pass editing results.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:21

16h ago

NEW · 2 sourcesFEATUREDarXiv · cs.CL· atomEN16:21 · 06·08

→Research paper proposes AdvGRPO framework for adaptive red teaming of language models

The paper introduces AdvGRPO, using dense multi-channel rewards and decoupled advantage normalization to make GRPO stable for alternating attacker-defender co-training. Training moves from single-turn attacks to closed-loop multi-turn attacks before co-training, and the post does not disclose benchmark scores.

#Alignment#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but no benchmark scores, cross-source cluster, or major-lab signal is disclosed. This fits a featured-threshold safety research item, not same-day must-write.

editor take

AdvGRPO makes GRPO usable for attacker-defender co-training; good direction, but no scores disclosed here, so don’t buy the safety claim yet.

sharp

AdvGRPO’s useful move is taking red teaming away from static jailbreak sets and into alternating attacker-defender updates. The concrete hook is engineering, not rhetoric: dense multi-channel rewards, decoupled advantage normalization, then a curriculum from single-turn attacks to closed-loop multi-turn attacks before co-training. That matches live risk better than another HarmBench-style sweep, because the attacker policy keeps chasing the guardrail. I discount the claimed win for now. The snippet says co-trained defenders beat baselines, but gives no benchmark names, scores, model sizes, or attack budget. PPO and DPO co-training already made the directional case; GRPO has to prove stability and cost. Without curves or reproducible settings, this smells partly like attaching post-DeepSeek-R1 GRPO momentum to safety training.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

STILL DEVELOPING · 1dFEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Just-In-Time Reinforcement Learning: Gradient-Free Continual Learning for LLM Agents

JitRL uses non-parametric memory to retrieve trajectories, estimates action advantages at test time, and modulates LLM logits, outperforming WebRL on WebArena and Jericho while reducing monetary cost by more than 30x.

#Agent#Reasoning#Memory#JitRL

why featured

HKR-H/K/R all pass: the hook is gradient-free continual agent learning, the mechanism is retrieval plus advantage-based logit modulation, and the 30x cost claim hits practitioner pain. As a single arXiv paper, it fits 78–84 rather than must-write.

editor take

JitRL’s punch is not “no training”; it turns agent learning into retrieval plus logit bias, and a 30x cost gap hurts fine-tuning narratives.

sharp

JitRL makes online adaptation look like a thin policy patch, which is sharper for agent builders than another RL fine-tuning recipe. It stores trajectories in non-parametric memory, retrieves relevant cases at test time, estimates action advantages, then shifts LLM logits directly. The paper claims that additive logit update is the closed-form solution for a KL-constrained policy objective. On WebArena and Jericho, it beats WebRL while cutting monetary cost by over 30x. I have some doubts about calling this continual learning. No weights move, so the ceiling sits in memory quality, retrieval coverage, and trajectory distribution. It reads like a more formal Reflexion/Voyager branch: touch gradients less, push experience into external state you can inspect and swap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Latent Geometry Beyond Search: Amortizing Planning in World Models

The paper replaces iterative planning with GC-IDM in a pretrained LeWorldModel, matching or exceeding CEM in 7 of 8 environment-protocol settings across 4 benchmarks while reducing per-decision cost by 100-130x.

#Agent#Robotics#Reasoning#LeWorldModel

why featured

HKR-H/K/R all pass: the paper claims planning without iterative search, names GC-IDM, and reports 7/8 settings plus 100-130x lower decision cost. Single arXiv paper with no major-lab or cross-source signal keeps it below must-write.

editor take

GC-IDM cuts CEM-style online search by 100-130x; for robotics, that beats another flashy VLA demo.

sharp

GC-IDM lands on inference cost, not leaderboard theater. Inside LeWorldModel, it maps current latent, goal latent, and remaining horizon straight to the next action. Across 4 benchmarks and 8 environment-protocol settings, it matches or beats CEM in 7, while cutting per-decision cost by 100-130x. If that survives real robot latency and contact noise, the deployment math for world-model control changes fast. I’d discount the paper’s strongest reading: “latent geometry already contains planning structure.” The evidence is still a pretrained LeWorldModel plus benchmark environments, not messy fleet robotics. Compared with RT-2 or π0-style VLA work, this attacks a narrower bottleneck: replacing per-step optimizer burn. Long-horizon tasks, OOD contact, and multi-object clutter are where amortized planners usually start leaking.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Robotic Policy Adaptation via Weight-Space Meta-Learning

WIZARD generates task-specific LoRA parameters for a frozen VLA policy from a language instruction and a short demonstration video, completing adaptation in one forward pass without target-task action labels or test-time optimization; on LIBERO, it improves performance by up to about 14x on unseen tasks.

#Robotics#Fine-tuning#Vision#WIZARD

why featured

HKR-H/K/R pass: the paper maps language and short videos into LoRA weights for one-pass adaptation of a frozen VLA, with up to ~14x gains on unseen LIBERO tasks. Robotics scope keeps it below must-write.

editor take

WIZARD turns robot adaptation into one LoRA-generating forward pass; 14x on LIBERO is loud, but the real-world long tail is still the trap.

sharp

WIZARD’s sharp move is shifting VLA adaptation from action-labeled demos and fine-tuning into language plus one short video. It generates task LoRA weights for a frozen VLA policy in one forward pass. The paper reports up to ~14x on unseen LIBERO tasks, ~2x on unseen dataset collections, and gains over a real-domain adapted baseline on a Franka Emika Panda. I like the direction more than I trust the headline multiple. LIBERO has clean task boundaries, and a 14x gain can hide a weak baseline, nearby task distribution, or unusually informative demos. Robotics deployment has been stuck on contact dynamics, occlusion, recovery, and camera mess, not just adapter training cost. If WIZARD holds up across robots, camera setups, and noisy human videos, then the VLA cost curve actually changes.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

STILL DEVELOPING · 1dFEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

The paper proposes CapCode and CapReward, using randomized tests to cap the best non-cheating score below 1; scores substantially above the cap provide evidence that coding agents exploit shortcuts instead of solving the intended task.

#Agent#Code#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the cheating hook is strong, CapCode/CapReward add a testable randomized-eval mechanism, and benchmark trust is a practitioner nerve. As a single arXiv paper, it sits in the good-quality band, not must-write.

editor take

CapCode turns cheating into a statistical outlier, not a vibe check; any coding-agent pass@1 without randomized caps now looks under-instrumented.

sharp

CapCode hits the dirtiest gap in coding-agent evals: a high score can come from probing the harness, not solving the task. The mechanism is clean. Randomized tests cap the best non-cheating score below 1, so a score far above that ceiling becomes evidence of shortcut exploitation. CapReward then trains against optimization past the cap. That is stronger than “add more hidden tests,” because it defines a statistical boundary for suspicious success. SWE-bench and HumanEval-style boards still lean on the old assumption that pass rate tracks capability; agentic systems have been making that assumption brittle. The abstract says rankings are preserved across multiple datasets, but it does not disclose the benchmark names or cap values in the snippet. I’d trust this faster if it survives real-repo tasks without flagging genuinely robust generalization.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Measuring Agents in Production

MAP interviewed 20 case studies and surveyed 86 deployed-system practitioners across 26 domains, finding that 68% of production agents execute at most 10 steps before human intervention.

#Agent#Benchmarking#Alignment#arXiv

why featured

HKR-H/K/R all pass: the 68%/10-step finding gives production-agent reliability a concrete ceiling, backed by deployed-system practitioners. As an arXiv study rather than a major product release, it fits the 78-84 band.

editor take

MAP punctures the agent hype cleanly: 68% of production agents hit human handoff within 10 steps, so long-horizon autonomy is still mostly demo theater.

sharp

MAP lands because it drags agent talk back to production constraints. Across 86 deployed-system practitioners and 26 domains, 68% of agents run at most 10 steps before human intervention. Another 70% rely on prompting off-the-shelf models, and 74% primarily use human evaluation. That looks much closer to enterprise Copilot deployments than to the long-horizon autonomous-worker story vendors keep selling. I don’t buy the “one stronger model fixes it” line here. The paper names reliability as the top challenge, and says teams handle it through systems-level design. That means permissions, rollback, monitoring, eval harnesses, and human gates. SWE-bench and browser-agent scores can keep rising; production teams still ask who signs off after step ten.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

The paper introduces Semantic Gambit, which augments acoustic adversarial attacks with real-time LLM-predicted context and raises corpus-level Word Error Rate on real-time ASR systems to 35.6%, a three-fold increase over the current state of the art.

#Audio#Safety#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: the paper has a sharp audio-attack hook, a concrete mechanism, and 35.6% WER at 3x SOTA. It stays in the 78–84 band because impact is research-side, not a model launch or live product incident.

editor take

LLMs are not the shield here; they are the attacker’s autocomplete. Real-time ASR just exposed a latency tax it cannot hide.

sharp

Semantic Gambit moves real-time ASR attacks from acoustic noise into language priors, and 35.6% corpus-level WER is a nasty number. Real-time transcription has to commit before full context arrives. Older attacks were boxed in by that causal constraint. This paper uses a low-latency LLM to predict surrounding context, giving the attacker a proxy for future information. The reported effect is 3x the current SOTA. I don’t read this as another adversarial-audio paper. Voice AI stacks spent the last year plugging LLMs into meetings, call centers, and cars for correction and completion. This flips the same capability into an attack primitive. The snippet does not disclose the LLM, latency budget, or target ASR systems, so the replication details matter a lot.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

The paper reports LightningLM 0.1V training a 120B sparse MoE on one 8-GPU node, using 460 routed experts with top-12 routing and 5.93B active parameters at the 120B stage.

#Inference-opt#Code#LightningLM#Research release

why featured

HKR-H/K/R all pass: the 8-GPU 120B MoE claim is clickable, with concrete routing numbers and cost resonance. Single arXiv sourcing and no adoption signal keep it in the 78–84 band.

editor take

A 120B MoE on one 8-GPU node sounds like bait, but 5.93B active params and 2.26B optimizer-state params make it serious.

sharp

LightningLM 0.1V is less about the “120B on one node” headline and more about making the MoE memory bill reproducible. At the 120B stage, only 5.93B parameters are active, across 460 routed experts with top-12 routing. TQP keeps optimizer state on 2.26B adapter parameters, with the paper claiming a ~45x cut on the expert path. That is a concrete systems contribution, not a parameter-count flex. I don’t buy the capability framing yet. The snippet gives released training loss of 1.78, 8K context, and per-domain held-out loss, but no MMLU, SWE-bench, or inference throughput. This reads like a practitioner systems report, not a model that pressures Qwen or DeepSeek MoE on user-visible quality. Releasing the model family, tokenizer, and training code matters; missing public evals keep the “learned by construction” claim on a short leash.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Insights Generator diagnoses LLM agent execution traces with a multi-agent scout-investigator architecture. Human experts using its evidence-backed reports improved scaffold performance by 30.4 percentage points over the unmodified baseline, and coding agents using IG-derived insights showed consistent gains across benchmarks.

#Agent#Code#Benchmarking#Insights Generator

why featured

HKR-H/K/R all pass: the 30.4-point gain and trace-diagnosis-to-scaffold-edit mechanism are concrete. As a single arXiv item without author, code, or cross-source heat, it stays in the lower good-quality band.

editor take

Agent work is hitting a debugging wall: stronger coders help less when nobody can read thousands of failing traces at corpus scale.

sharp

IG lands on the part of agent engineering teams keep under-instrumenting: corpus-level trace forensics. The hard number is 30.4 percentage points: experts used IG reports to improve a scaffold over the unmodified baseline. The mechanism also matters: a scout-investigator setup proposes hypotheses, then checks them against trace evidence. That is closer to production debugging than another pass-rate leaderboard. I buy the direction, with a caveat. After SWE-bench turned coding agents into a race, many failures now sit in long traces, tool misuse, stale state, and bad recovery loops. IG is aimed at those boring failure modes. The abstract does not disclose task mix, baseline strength, or expert time cost, so the 30.4pp gain may include a large human-tuning dividend.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena benchmarks computer-use agents on 421 manually verified tasks across 50 macOS apps, running on Apple Silicon’s native Virtualization framework; on macOS-native tasks, model rankings invert and a leading model trails by more than 26%.

#Agent#Vision#Benchmarking#Victor Muryn

why featured

HKR-H/K/R all pass: MacArena adds 421 tasks, 50 macOS apps, and a 26%+ gap for computer-use agents. No hard exclusion applies, but this is a single arXiv benchmark, not a same-day must-write release.

editor take

MacArena exposes the Linux comfort zone in GUI agents: rankings flip on 49 macOS-native tasks, so “computer use” still travels badly.

sharp

MacArena pins the GUI-agent gap on macOS, not another OSWorld leaderboard bump. The benchmark has 421 manually verified tasks across 50 apps, runs on Apple Silicon’s native Virtualization framework, and includes 49 macOS-native tasks where rankings invert; one leading model trails by more than 26%. That is a hard signal: current computer-use agents have learned plenty of Linux-flavored benchmark distribution, then lose shape when the menu system, app conventions, and native UI stack change. macOSWorld was too narrow, mostly first-party apps, and stuck on x86 VMs. MacArena at least moves evaluation into the environment many AI builders actually use daily. Teams selling “general computer use” should publish cross-OS degradation curves before claiming broad GUI competence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Should You Use Your Large Language Model to Explore or Exploit?

The paper evaluates current LLMs on exploration and exploitation in separate contextual bandit tasks. Reasoning models perform best on exploitation but remain expensive or slow, tool use and in-context summarization improve medium-difficulty tasks, and every tested LLM still underperforms simple linear regression, while LLMs help explore large semantic action spaces by proposing candidates.

#Reasoning#Agent#Tools#Research release

why featured

All HKR axes pass: the angle has a twist, the claim names a clear baseline, and it challenges LLM-for-decision defaults. As a single arXiv paper, it is not must-write, but the practical warning earns the 78 band.

editor take

LLMs look useful for proposing arms, not running the bandit; losing to linear regression is a brutal warning for agentic decision loops.

sharp

This paper hits the weakest part of the agent story: once exploration and exploitation are separated, LLMs fail the money-making side. In the UAI 2026 version, every tested LLM underperforms simple linear regression, even on nonlinear settings; reasoning models do best on exploitation but stay too slow or expensive for many practical loops. Tool use and in-context summarization help on medium-difficulty tasks, not enough to change the ranking. The useful role is narrower: propose candidates in large semantic action spaces. That is bad news for demos that route pricing, ads, or recommendations through an LLM controller. Let the model expand the arm set; don’t hand it the online optimizer.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Diagnosing Visual Ignorance in Vision-Language Models

The paper evaluates three VLMs with counterfactual layer replacement, layer-wise MLP probes, and multi-step Gaussian blurring; across 12 VQA benchmarks, a substantial fraction of examples keep the same answers under severe or total visual obfuscation.

#Multimodal#Vision#Interpretability#Research release

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the paper gives 3 diagnostics across 12 VQA benchmarks, and it targets multimodal evaluation trust. As a single arXiv paper without lab-scale release or cross-source pickup, it lands at 78 featured.

editor take

VQA takes another hit: across 3 VLMs and 12 benchmarks, answers survive heavy blur, so part of the score is rewarding text priors.

sharp

This paper makes the old VQA complaint harder to hand-wave: many “visual” questions do not require vision. The authors test 3 VLMs across 12 VQA benchmarks, then progressively destroy the image with Gaussian blur. A substantial fraction of answers stay unchanged under severe or total obfuscation. Internally, counterfactual layer replacement and layer-wise MLP probes point to a routing failure: middle decoder layers fail to retrieve visual evidence, and later layers suppress what remains. I buy the diagnostic more than another leaderboard critique because it ties behavior to mechanism. The missing details matter: the snippet does not name the 3 models, per-benchmark rates, or blur settings. Without those, this is not yet a verdict on a specific architecture family. It is a warning that VQA scores still mix grounding with dataset priors.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

RASFT estimates problem-level solvability from verified on-policy rollouts and adjusts expert supervision, outperforming SFT, SFT variants, and representative RL methods across six mathematical reasoning benchmarks and two code reasoning benchmarks.

#Reasoning#Fine-tuning#Code#RASFT

why featured

HKR-H/K/R all pass: SFT-over-RL is the hook, the mechanism and 8 benchmarks are concrete, and the topic affects reasoning-model training cost. Single arXiv paper with limited authority keeps it at 78.

editor take

RASFT fixes the dumbest part of reasoning SFT: copying one solution trace. But no model sizes or deltas are disclosed, so don’t crown it over RL yet.

sharp

RASFT is useful because it attacks the brittle assumption behind reasoning SFT: one expert trace is the target behavior. It estimates problem solvability with verified on-policy rollouts, tightens expert supervision when the policy fails, relaxes imitation when the policy already solves the item, and adds correct self-generated trajectories. The clipped inverse ratio against a frozen reference is the safety rail against policy drift. That is a cleaner bridge between SFT and RL than another loss tweak. I would discount the “beats representative RL methods” claim for now. The abstract gives six math benchmarks and two code benchmarks, but not base model sizes, absolute gains, training budget, or the RL baselines. If the comparison is against a thin PPO or GRPO setup, the headline is much less spicy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval

SIRA compresses multi-round exploratory retrieval into one weighted BM25 call and achieves the strongest average retrieval result across ten BEIR benchmarks. It uses no relevance labels or retriever fine-tuning, and on 232 BrowseComp-Wikipedia queries it reaches 36.14% Recall@100 over a 25,587,229-document Wikipedia index.

#Agent#RAG#Benchmarking#SIRA

why featured

HKR-H/K/R all pass: SIRA claims one weighted BM25 call can replace multi-step agentic retrieval and reports BEIR plus BrowseComp-Wikipedia numbers. Single arXiv paper with no cross-source validation keeps it below must-write.

editor take

SIRA’s punch is brutal: one weighted BM25 call beats search-agent loops. A lot of “agentic retrieval” was sloppy vocabulary work.

sharp

SIRA pokes a hole in the agentic-retrieval story: it compresses exploratory search into one weighted BM25 call and still posts the best average result across 10 BEIR benchmarks. The mechanism is plain but sharp: enrich documents offline with missing search vocabulary, predict omitted evidence terms at query time, then use corpus stats to drop terms that are absent, too common, or unable to create margin. The uncomfortable number for Perplexity-style systems is BrowseComp-Wikipedia: 232 queries over 25,587,229 Wikipedia docs, with 9.70% Recall@1, 15.27% Recall@10, and 36.14% Recall@100, even without index-time enrichment. If this holds up beyond the paper’s setup, a lot of multi-round retrieval agents look less like intelligence and more like expensive repair work for weak corpus priors.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Endogenous Resistance to Activation Steering in Language Models

The paper defines Endogenous Steering Resistance in language models. Llama-3.3-70B can restart mid-generation under active SAE-latent steering, while the abstract leaves the exact ESR rate and latent counts unresolved as template placeholders.

#Alignment#Safety#Interpretability#Llama

why featured

HKR-H/K/R all pass, but the summary does not disclose ESR rates, setup, or code, so this stays in the 72–77 band. The alignment/control angle clears featured, not must-write.

editor take

Activation steering is not a remote control; Llama-3.3-70B can push back mid-generation while the perturbation stays on.

sharp

This paper pokes a real hole in the clean activation-steering story: the model is not a passive actuator. Under active SAE-latent steering, Llama-3.3-70B can produce explicit restarts like “wait, that’s not right” and return to the task while the perturbation remains on. The useful hook is the split between a detection event and sustained resistance, neither fully explained by recent on-topic tokens. I would discount the result until the numbers are fixed. The abstract still has TeX placeholders for ESR rate, latent count, and multi-attempt reduction. Still, the claim lands because it cuts against a lot of steering work that treats latent injection as linear control. For safety, the nasty part is symmetry: the same endogenous resistance that blocks adversarial activation manipulation can also block beneficial harmlessness or policy steering.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

UltraEP rebalances MoE expert load at every microbatch and layer, reaching 94.3% of force-balanced ideal throughput across 106B–671B models and improving throughput by 1.49x over no balancing. The paper reports final inter-rank imbalance falling from 1.30–4.01 to 1.01–1.04, with production training validation on 2,560 GPUs.

#Inference-opt#Research release#Benchmark

why featured

HKR-H/K/R all pass, but this is specialized distributed-training infrastructure rather than a broad model release. The 2560-GPU validation and 671B results justify featured near the lower edge.

editor take

UltraEP’s 1.49x speedup is loud, but the withdrawal is louder: MoE load-balancing details at 2,560 GPUs are now competitive ammunition.

sharp

UltraEP pushes MoE balancing from periodic history-based moves to every microbatch and layer, then gets withdrawn for “information disclosure” issues. That pairing is the story. The abstract gives hard numbers: 94.3% of force-balanced ideal throughput across 106B–671B MoE models, 1.49x over no balancing, inter-rank imbalance cut from 1.30–4.01 to 1.01–1.04, plus validation on 2,560 production GPUs. I read this as a leak from the actual MoE battlefield: not smarter gating, but killing all-to-all stalls, expert-state transfers, and stragglers inside rack-scale connectivity. DeepSeek- and Qwen-style sparse models live or die on this layer once scale gets ugly. The paper has no PDF now, so nobody can verify the implementation path; the withdrawal itself says those details were sensitive enough to pull back.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Lean4Agent uses Lean4 to model and verify agent workflows and execution trajectories. Across 5 leading LLMs on SWE-Bench-Verified and ELAIP-Bench subsets, verification-passing workflows outperform failing ones by 11.94% on average, and LeanEvolve adds a 7.47% average SWE improvement.

#Agent#Code#Benchmarking#Research release

why featured

HKR-H/K/R all pass: formal verification for agent trajectories is a fresh angle with +11.94% and +7.47% claims. Lean4's niche barrier and single arXiv sourcing keep it below the 78–84 band.

editor take

Lean4Agent puts agent traces into Lean4; I buy the direction. 11.94% isn't a victory lap, but it beats another planner wrapper.

sharp

Lean4Agent’s useful move is not the slogan of “formal agents.” It turns failure analysis from log archaeology into checkable assumptions. The paper reports tests on a hard SWE-Bench-Verified subset and an ELAIP-Bench subset across 5 leading LLMs: verification-passing workflows beat failing ones by 11.94% on average, and LeanEvolve adds another 7.47% average SWE gain. I’m only half sold on the numbers. The abstract does not give sample size, model names, how the failing workflows were built, or verification overhead. Still, the direction hits a real agent-engineering wound. LangGraph and AutoGen mostly manage orchestration; Lean4Agent tries to police semantic consistency. If it can localize which assumption broke inside real repo tasks, it is closer to tooling than another reflection loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill enables LLM agents to build skills and verification signals without target-task supervision, using documentation, repositories, and the web, and reports the best automated pass rate across three benchmarks and two target agents.

#Agent#Tools#Reasoning#OpenSkill

why featured

HKR-H/K/R all pass, but the source only gives abstract-level facts; pass-rate numbers, baselines, and code availability are not disclosed. Featured threshold, not same-day must-write.

editor take

OpenSkill attacks the verifier bottleneck: if its self-built checks really align without target supervision, this matters more than another trajectory-mining paper.

sharp

OpenSkill is sharp because it moves self-evolving agents from replaying successful trajectories to bootstrapping the verifier itself. It uses no target-task supervision, pulls anchors from documentation, repositories, and the web, then synthesizes skills and virtual tasks. The paper claims the best automated pass rate across 3 benchmarks and 2 target agents, plus verifier alignment with ground-truth outcomes it never accessed. I would discount the score until the setup is visible. The snippet gives no benchmark names, pass-rate numbers, target-agent names, retrieval budget, or contamination controls. Agent papers this year keep dressing up “can search docs” as “can self-improve.” Still, the target is right: deployed agents usually lack curated skills, clean success traces, and human-built verifiers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Breaking the Ice: Analyzing Cold Start Latency in vLLM

The paper breaks vLLM startup into six steps, finds cold start latency is primarily CPU-bound, and open-sources benchmark datasets, analysis tools, and prediction scripts for the profiler.

#Inference-opt#Benchmarking#vLLM#Research release

why featured

HKR-H/K/R all pass: vLLM is a mainstream serving stack, and the paper adds a reproducible 6-step cold-start analysis. The systems focus keeps it below the 78–84 band.

editor take

vLLM cold start gets split into six steps, and the CPU takes the blame; elastic inference teams staring only at GPU utilization are missing the tail.

sharp

vLLM cold start is not a nuisance; it is a hidden tax on elastic inference. The paper splits startup into six steps and makes a blunt claim: latency is primarily CPU-bound, with interpretable scaling against model and system parameters. That is more useful than another steady-state throughput chart, because autoscaled serving often loses on replica bring-up, not tokens per second after warmup. I like the engineering posture here: the authors open-source benchmark data, analysis tools, and prediction scripts for estimating startup latency on a hardware configuration. The caveat is sharp too: the RSS body only gives the abstract, not the model list, CPU SKUs, or per-step numbers under vLLM V1 and torch.compile. Without those tables, “CPU-bound” is a good warning, not yet a buying guide.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

The paper tests attack selection in BashArena and LinuxArena; at a 1% audit budget, the start policy reduces measured safety by 20 percentage points in both settings, while the stop policy reduces it by 20 points in BashArena and 28 points in LinuxArena.

#Agent#Safety#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: this quantifies how attack selection damages agent-control safety at a 1% audit budget. It stays below the 78+ band because this is a single arXiv item with no disclosed replication or broader uptake.

editor take

At a 1% audit budget, safety drops 20–28pp; agent-control evals look too clean when the attacker isn’t allowed to choose timing.

sharp

This paper hits a weak assumption in agent control: keep attack capability fixed, add only start/stop timing, and measured safety drops by 20–28 percentage points. At a 1% audit budget, the start policy cuts safety by 20pp in both BashArena and LinuxArena; the stop policy cuts 20pp in BashArena and 28pp in LinuxArena. I don’t read this as a benchmark tweak. It says many control evals make the red team behave like a dumb script, then congratulate the monitor. System cards from OpenAI and Anthropic often lean on monitoring plus limited human review; if selective attackers are absent, that safety case is padded. The authors call the numbers upper bounds, which is fair. Still, an upper bound this large says current eval margins are thinner than the charts suggest.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→The Fine-Tuning Trap: Evaluating Negative Transfer and PEFT in Sub-1B Mathematical Reasoning

The study benchmarks five 135M–1B SLMs on mathematical reasoning and finds that Full FT often drops accuracy below zero-shot baselines for models under 300M parameters.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the paper has a counterintuitive hook, concrete model sizes, and a testable Full FT vs zero-shot result. As a single arXiv benchmark on sub-1B models, it stays at the 72-77 featured threshold.

editor take

Full FT pushed sub-300M math SLMs below zero-shot; for edge models, PEFT is a guardrail, not a cost tweak.

sharp

Small-model fine-tuning fails less like optimization noise and more like capability erasure. The paper tests five 135M–1B SLMs, and Full FT often drops sub-300M models below their zero-shot math baselines. On SmolLM2-135M, Full FT even loses to simple 5-shot ICL. That is a nasty result for edge deployment: task adaptation can overwrite the thin layer of reasoning the base model still has. The PEFT result is credible because it is not sold as one magic adapter. DoRA leads on GSM8K-style complex reasoning, while LoRA wins on OrcaMath pattern matching. On Qwen2.5-0.5B, LoRA beats Full FT outright. The paper’s “avoid Full FT below 500M” rule is conservative, but it is cleaner than the default settings in many tiny-model fine-tuning stacks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher reduces average tool-call rounds by 17%-58% on GAIA, BrowseComp, and XBenchDeepSearch, while using Pareto-efficient SFT trajectory filtering and RL adaptive reward gating to maintain or improve accuracy.

#Agent#Tools#Fine-tuning#SlimSearcher

why featured

HKR-H/K/R pass, but this is still a single arXiv paper. The 17%-58% tool-call reduction across GAIA, BrowseComp, and XBenchDeepSearch is useful; code, authorship weight, and production validation are not disclosed.

editor take

SlimSearcher attacks the agent tax directly: 17%-58% fewer tool rounds matters more than another tiny accuracy bump on research-agent benchmarks.

sharp

SlimSearcher is useful because it treats tool use as a training liability, not a free reasoning crutch. On GAIA, BrowseComp, and XBenchDeepSearch, it cuts average tool-call rounds by 17%-58% while maintaining or improving accuracy. That is closer to deployable agent work than another benchmark table with longer traces. The mechanism is sane: SFT filters for successful, economical Pareto trajectories, then RL uses cohort-relative efficiency through Adaptive Reward Gating behind a strict correctness gate. That directly targets two failure modes we have seen all year: agents spamming search to buy accuracy, and agents gaming brevity by giving under-evidenced answers. I still want the missing cost view: total tokens, wall-clock latency, and per-task tool pricing. The abstract gives tool rounds, not the operating bill.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search

The paper audits seven open-weight and closed-source LLMs on housing location recommendations across four U.S. cities and three iterative prompting conditions; it finds racial steering varies with user identity, preference articulation, and city-specific spatial representations rather than staying fixed across models or markets.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

All three HKR axes pass: the housing-steering hook is strong, the paper gives 7 models, 4 cities, and 3 prompt conditions, and it hits fairness/compliance concerns. As a single arXiv audit, it fits featured threshold rather than a same-day must-write.

editor take

Housing bias here is not a model quirk; it is the model treating city stereotypes as preference parsing. Generic safety evals won't cover this risk.

sharp

The sharp point here is that housing discrimination in LLMs is unstable, which makes it harder to audit. The paper tests seven open-weight and closed-source LLMs across four U.S. cities and three prompting rounds; adding lifestyle preferences often increased or reshaped which models showed racial steering. That is a nasty failure mode: the model is not just attaching bias to identity labels, it is translating “commute,” “schools,” and “neighborhood feel” into place rankings differently for different users. Generic bias benchmarks are too blunt for fair-housing risk. Passing one city-prompt setup says little about another local market. The snippet does not name the models or give effect sizes, so I would not overclaim the severity. But the audit unit is right: identity × preference × city, not model × prompt.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

The researchers tested transfer with 532 human videos and 28 hours of triangulated hand labels; under low-robot-data conditions across six manipulation tasks, their cotraining recipe produced a 29.7% absolute success-rate gain.

#Robotics#Vision#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv robotics paper without a major lab launch, open-source system, or cross-source cluster. The +29.7% absolute success gain clears featured, not same-day must-write.

editor take

532 everyday videos add 29.7 points to robot manipulation, but the win is embodiment separation, not Internet video magic.

sharp

Human-video cotraining gets a real result here, but it also kills the lazy YouTube-scale story. The paper uses 532 everyday videos and 28 hours of triangulated hand labels, then reports a 29.7-point absolute success-rate gain across six low-robot-data manipulation tasks. The catch is explicit: accurate hand poses still fail to bridge the motion gap unless the vision and policy networks specialize by embodiment. That is a sharper claim than “more web video helps robots.” RT-1 and Open X-Embodiment leaned on robot-data scale across platforms; this paper isolates hand-label quality and body mismatch. I’d read this as evidence for human video as an auxiliary signal, not a cheap replacement for teleop or robot rollouts. Anyone selling everyday video as the main path around robot data is skipping the expensive part: embodiment alignment.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

The paper introduces ViSAE, a mechanistic interpretability toolbox for ViTs, using 64K images and a 16K visually grounded concept vocabulary to improve concept coverage efficiency by 20x over ImageNet, raise interpretation accuracy by 28.7%, and increase worst-group accuracy on WaterBirds by 48.2% through concept editing.

#Vision#Interpretability#Safety#ViSAE

why featured

HKR-H/K/R all pass, but this is a single arXiv vision-interpretability paper, not a mainstream LLM or agent release. The 48.2% WaterBirds worst-group gain lifts it to featured threshold.

editor take

ViSAE moves ViT interpretability from heatmap storytelling to editable concept circuits; the 48.2% WaterBirds worst-group gain is the claim to audit first.

sharp

ViSAE makes the right bet: interpretability only matters when it changes model behavior. The paper uses 64K images and a 16K visually grounded concept vocabulary, then claims 20x better concept coverage efficiency than ImageNet and 28.7% higher interpretation accuracy. The hard number is the intervention: concept editing raises WaterBirds worst-group accuracy by 48.2%, beating prior methods by 23.8%. I like the direction, but I don’t buy the safety framing yet. WaterBirds is a clean spurious-correlation benchmark, great for showing bird-background disentanglement, weak as proof for open-world vision steering. SAE work in language models already showed that readable features do not guarantee stable control. The code is public, so the first serious test is cross-dataset transfer, not another pretty concept-circuit figure.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF extends established code evaluation suites with 30 verifiable code-instruction categories and deterministic verifiers, then evaluates 31 LLMs on both functional correctness and instruction following; the paper reports that the combined score correlates best with human preference, while instruction following separates models more than pass@k-style functional checks.

#Code#Alignment#Benchmarking#SWE-IF

why featured

HKR-K/R pass: 30 instruction types, 31 models, and human-preference correlation add testable signal for code evaluation. No major-lab release or broad product impact keeps it in low featured.

editor take

SWE-IF hits the sore spot in code eval: passing tests is cheap if the model ignores how the user asked the code to look, change, and behave.

sharp

SWE-IF makes the right cut: code models are no longer judged only by whether one patch passes tests. The harder product problem is whether the model obeys messy user constraints while staying correct. The paper adds VeriCode, a taxonomy of 30 verifiable code-instruction categories, builds SWE-IF on existing suites, and evaluates 31 LLMs. Its core claim is that a composite of functional correctness and instruction following tracks human preference best. That is a better signal than another raw SWE-bench leaderboard. SWE-bench Verified pressures models to fix real issues, but users also say “do not change this API,” “keep the style,” and “touch only this file.” Models often regress once those constraints stack. My caveat: the abstract gives no correlation coefficient or per-category breakdown. If the deterministic verifiers lean on static rules, the benchmark can still miss maintainability.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Yuxiang Chen and Jun Wang compare human and DeepSeek-R1-0120 reasoning on all 30 AIME 2025 problems, annotating 10,247 reasoning steps into five categories, and report that the model often repeats local checks while making less meaningful deductive progress than human solutions.

#Reasoning#Benchmarking#Interpretability#DeepSeek

why featured

HKR-H/K/R all pass, but this is a single arXiv analysis rather than a model or product launch. The 10,247-step annotation gives real evidence, placing it just above the featured threshold.

editor take

DeepSeek-R1 can perform the shape of math reasoning, but this paper catches the reward model paying for motion instead of progress.

sharp

DeepSeek-R1-0120’s failure mode is not too little reflection; it is reflection used as wheel-spinning. Chen and Wang annotated all 30 AIME 2025 problems and 10,247 steps across Analysis, Inference, Branch, Backtrace, and Reflection. Human solutions alternate tightly between analysis and deduction. R1 keeps revisiting intermediate results and checking local arithmetic while global logic stalls. That is a nasty read on long-CoT training. The model gets paid for traces that look like reasoning, not steps that move proof state forward. OpenAI, Anthropic, and DeepSeek have all leaned into test-time compute as the scaling lever. If that compute lands in “spinning-wheel traces,” the extra tokens are just expensive noise.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems

Meta evaluated the Standard Model Template in its production ads ranking ecosystem across four global development cycles, reporting a 0.63% average cross-entropy gain at neutral serving capacity and a 92% reduction in per-model iteration engineering time.

#Inference-opt#Benchmarking#Meta#Research release

why featured

HKR-H/K/R pass: the Meta paper has production-scale numbers and a practical workflow claim. It stays in the 72–77 band because this is ML-platform engineering, not a major model or product release.

editor take

Meta’s SMT paper is more useful than another model leaderboard: 0.63% CE gain plus 92% less engineering time is an infra story, not model magic.

sharp

Meta’s SMT paper hits the boring failure mode in ads ranking: model innovation dies in propagation, not in a single architecture choice. The concrete numbers are unusually useful: four global development cycles, 0.63% average cross-entropy gain, 92% lower per-model iteration engineering time, and 6.3x higher technique-model adoption throughput. I buy the engineering result more than the CE number. In ads, 0.63% CE is already real money, but the sharper claim is the complexity drop from O(n·2^k) to O(n+k). That explains why Meta cares: hundreds of surfaces and objectives need synchronized refreshes. The missing piece is business and systems cost. The paper says neutral serving capacity, but revenue lift, latency tradeoff, and failure pockets are not disclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

STILL DEVELOPING · 1dFEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→ThinkBooster: Unified Framework for Test-Time Scaling of LLM Reasoning

ThinkBooster provides three components—a Python library, a benchmark, and an OpenAI-compatible proxy—to evaluate test-time compute scaling for LLM reasoning on mathematical and coding tasks, with code released under an MIT license.

#Reasoning#Benchmarking#Tools#ThinkBooster

why featured

HKR-H/K/R all pass, but this is a single arXiv research-tool release; the post gives components and task scope, not measured gains or adoption. It fits the 72–77 featured threshold.

editor take

ThinkBooster packages TTC into a usable stack, but the abstract gives no gain numbers; without cost curves, it risks becoming another rerank wrapper.

sharp

ThinkBooster’s useful claim is standardization, not “better reasoning.” It ships three concrete pieces: a Python library, a benchmark, and an OpenAI-compatible proxy, plus a visual debugger for reasoning trajectories. That is closer to an engineering surface than another paper showing multi-sample generation plus verifier reranking. The abstract dodges the numbers that decide adoption: accuracy lift on math and coding tasks, token spend per point gained, and latency added by the proxy. TTC already got mainstreamed through OpenAI’s o-series and Claude’s extended-thinking mode; the hard question is not whether extra inference compute helps. It is whether teams will pay for the fifth candidate answer in production. MIT licensing helps. No cost curve means I treat this as tooling infrastructure, not a capability jump.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

The paper introduces CoT-PoT self-consistency ensembling, combining Chain-of-Thought and Program-of-Thought reasoning for sampled outputs. The abstract reports a 9.3x reduction in samples required for self-consistency, with 78.6% of tasks handled using only two samples, while full implementation details and benchmark breakdowns are not disclosed in the RSS snippet.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the two-sample hook is strong, and the paper gives a CoT-PoT mechanism plus 78.6% and 9.3x claims. As a single arXiv paper needing replication, it sits at the featured threshold.

editor take

Two-sample self-consistency is the right kind of cheap trick; the 9.3x compute claim needs benchmark breakdowns before production use.

sharp

CoT-PoT hits the boring cost problem that actually blocks self-consistency: teams like SC, then hate paying for 16 or 32 samples per query. The paper claims 78.6% of tasks need only two samples and reports a 9.3x cut in SC sampling. The mechanism is simple enough to matter: pair one Chain-of-Thought path with one Program-of-Thought path, then stop early when the ensemble is confident. I would discount the 9.3x number for now. The snippet gives the abstract, not the benchmark split, model sizes, temperature settings, task mix, or PoT execution failure rate. SC gains are easy to inflate when a few datasets reward diversity cleanly. If this is mostly GSM8K-style math, it is a nice inference trick; if it holds across messy agent tasks, it becomes a real serving-cost lever.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→RhinoVLA Technical Report

RhinoVLA combines a token-efficient Qwen3-VL backbone, a continuous Action Expert, and Huixi R1 co-design to reach 11.69 Hz end-to-end inference on edge hardware, meeting a 10 Hz closed-loop control target.

#Robotics#Vision#Multimodal#HuixiAI

why featured

HKR-H/K/R all pass, led by the 11.69 Hz edge inference claim against a 10 Hz closed-loop target. The source is an arXiv technical report from a less-proven stack, so it stays near the featured threshold rather than 78+.

editor take

RhinoVLA hits 11.69 Hz on edge hardware; I trust this token-and-SoC grind more than another flashy robot demo reel.

sharp

RhinoVLA matters because it names the boring bottleneck: tokens, not “robot intelligence.” The paper says VLM visual and context tokens drive latency, with GEMM projection cost growing linearly as input tokens rise. Its stack—token-efficient Qwen3-VL, a continuous Action Expert, and Huixi R1 co-design—reaches 11.69 Hz end-to-end on edge hardware, clearing the 10 Hz closed-loop bar. I buy this direction more than another manipulation demo. The “comparable to π0.5” claim still needs the full benchmark table, but 11.69 Hz on-device is a deployment constraint, not leaderboard perfume. The 72D state-action slot space, View Registry, and robot-instance LoRA are the parts I’d inspect first once the repo lands.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool trains Qwen3-8B and Qwen2.5-VL-7B with SFT, RL, and KL-regularized Plackett-Luce ranking, using a 200k dataset covering 1,000+ tools and 100+ tasks; across ten benchmarks, it reports average gains of 6.4% in math and science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding.

#Agent#Tools#Reasoning#Qwen

why featured

HKR-K and HKR-R pass: the paper gives concrete training mechanisms and 6.4%/7.7% gains, tied to agent tool-use reliability. As a single arXiv paper without deployment or broad pickup, it stays near the featured threshold.

editor take

AutoTool treats tool choice as trainable ranking, not another agent loop hack; the 6–8% gains still don’t prove it survives real tool drift.

sharp

AutoTool’s useful move is pushing tool selection into the training objective, not claiming an agent can browse 1,000+ tools. The recipe is concrete: SFT, RL, then KL-regularized Plackett-Luce ranking on Qwen3-8B and Qwen2.5-VL-7B. It reports gains across ten benchmarks: 6.4% on math and science, 7.7% on code, 6.9% on multimodal tasks. I buy the direction, not the victory lap. Agent failures in production often come from schema drift, permission breaks, dirty tool returns, and latency budgets. The abstract says AutoTool generalizes to unseen tools, but it does not show real API drift, recovery behavior, or serving cost. Compared with Gorilla or ToolLLM-style tool-use work, this looks like a cleaner preference-ranking formulation for choosing tools. The paper lives or dies on the released data and eval setup.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

STILL DEVELOPING · 1dFEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→RECAP benchmark evaluates continual prompt optimization methods

RECAP evaluates prompt optimization methods under a strict adapt-then-test protocol, covering six methods, four LLMs, and three evolving-constraint schedules; the study finds no significant performance gains after adaptation while the methods add higher latency.

#Agent#Benchmarking#Tools#RECAP

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark paper, not a broad product shift. The 6-method, 4-LLM, 3-schedule setup gives enough practitioner signal for featured, not p1.

editor take

RECAP lands a clean hit: six prompt optimizers, four LLMs, three constraint schedules, no significant gain after adapt-then-test, plus higher latency.

sharp

RECAP drags prompt optimization out of demo comfort and into production failure mode: a constraint changes, and the next interaction must comply. No feedback loop first. The paper tests six methods, four LLMs, and three evolving-constraint schedules. The result is ugly: no significant performance gain after adaptation, with higher latency. That matters for agent stacks because many teams treat automatic prompt editing as the cheap continual-adaptation layer for tool calls, disclosure rules, and compliance thresholds. RECAP’s setup is harsher and closer to deployment: give only the new constraint, then test. Current offline or reactive optimizers do not survive that protocol. I don’t buy the “prompt optimizer as lightweight learning system” story here; under RECAP it looks more like a slower config editor.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

The paper evaluates GPT-3.5, Llama3, ClinicalBERT, BioLlama3, and BioBERT on MedMCQA under natural and adversarial prompt perturbations, finding that minor wording changes can alter clinical advice, while adversarial prompts can produce harmful outputs such as incorrect dosages or omitted critical findings.

#Safety#Benchmarking#Reasoning#OpenAI

why featured

HKR-H/K/R all pass, but the post gives no error rates, sample size, or reproducible setup details. This is a useful safety benchmark, not a same-day must-write item.

editor take

Healthcare LLM risk is not just wrong answers; it is advice flipping under small phrasing changes. That is a systems defect, not a UX wrinkle.

sharp

Medical specialization is not a safety moat; ClinicalBERT, BioLlama3, and BioBERT still break under syntactic reordering and misleading context on MedMCQA. The paper tests GPT-3.5, Llama3, and three medical models, then lands on the ugly part: small wording changes can alter clinical advice, and adversarial prompts can trigger wrong dosages or omitted critical findings. The useful hit here is against the comfort story around domain tuning. The abstract does not give per-model drops or rankings, so we cannot say whether BioLlama3 is worse than GPT-3.5. The mechanism is already enough for deployment review. A healthcare stack that only reports static MedMCQA accuracy is under-testing the failure mode clinicians will actually hit.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Generative Models Erode Human Temporal Learning Through Market Selection

Wenjun Cao proposes Human Temporal Learning and a “value collapse” pathway in an ICML 2026 position paper, using a costly-inspection framework to map verification erosion across academic publishing, legal practice, content platforms, and software security.

#Alignment#Safety#Wenjun Cao#arXiv

why featured

HKR-H/K/R all pass: the paper offers a provocative safety-market mechanism, not just benchmark noise. The excerpt is still abstract-level arXiv metadata with no empirical numbers or adoption signal, so it stays near the featured threshold.

editor take

This frames AI labor damage as verification economics: once checking costs exceed upside, years of human learning lose to cheap surface-equivalent output.

sharp

Cao’s sharp claim is that better alignment can worsen the market pressure on HTL work. The mechanism is clean: generative outputs become surface-similar to work produced through years of human learning, so checking whether the producer actually learned the domain becomes costly. Once inspection fails the cost-benefit test, buyers pay for observable output, and trained humans compete against near-zero marginal-cost generation. That is stronger than the usual “AI pollutes content platforms” line because the paper applies the same verification-erosion pattern to 4 domains: academic publishing, legal practice, content platforms, and software security. I do have a reservation: the captured text gives the abstract and arXiv metadata, not experiments, sample sizes, or operational criteria for the four stages. As an ICML 2026 position paper, this is a useful risk vocabulary, not a settled causal result.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Dash2Sim: Closed-Loop Driving Simulation from In-the-Wild Dashcam Videos

Dash2Sim converts monocular dashcam videos into metric, geo-referenced 4D driving logs and builds ROADWork4D with 4,244 scenes, 2.7M 3D objects, and coverage across 17 cities.

#Robotics#Vision#Benchmarking#Dash2Sim

why featured

HKR-H and HKR-K pass: converting in-the-wild monocular video into closed-loop simulation is a concrete hook, and ROADWork4D gives hard scale. HKR-R is weak because this remains an AV simulation paper for a narrower audience.

editor take

Dash2Sim turns dashcam scraps into simulation fuel; the bet is map-verified 4D logs, not another curated AV dataset.

sharp

Dash2Sim’s sharp move is plugging cheap monocular dashcam video into closed-loop simulation, not publishing another polished AV dataset. ROADWork4D has 4,244 scenes, 2.7M 3D objects, and 17 cities; the stronger hook is annotation-free verification against an independent map. That attacks the 3D-labeling cost wall directly. The damning result is planner failure. On 2,201 ROADWork4D-CL scenes, rule-based and hybrid planners beat learning-based ones, yet all miss lane changes through temporary work-zone channels. That says more than the reported 19% gain in novel-view perceptual metrics. Waymo and nuScenes-style clean corpora never gave planners enough of this messy geometry; dashcam mining smells like the cheap path to long-tail driving simulation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Inferring the Size of Large Language Models From Popular Text Memorization

The paper proposes a black-box method that uses only submitted text fragments and observed next-token predictions to infer conservative lower bounds on LLM parameter counts from memorization of popular texts, then validates pairwise tests and a PCA-based scaling-law estimator on open-weight models before applying them to closed-weight models.

#Benchmarking#Interpretability#Research release#Benchmark

why featured

HKR-H/K/R all pass: black-box model sizing is clickable and testable. As a single arXiv paper with no disclosed numeric results or closed-model case, it sits just above the featured threshold.

editor take

Using public-text memorization to weigh closed models is a dirty but useful trick; if it holds, parameter-count secrecy gets very flimsy.

sharp

This paper turns parameter secrecy into a black-box audit problem: submit text fragments, read next-token predictions, and infer conservative lower bounds for closed LLM size. The concrete hook is strong: popular texts across classical literature, religious works, and foundational documents become an accuracy profile, then feed pairwise statistical tests and a PCA-based scaling-law estimator. I buy half of it. Widely circulated texts are likely present across pretraining corpora, so the probe has a repeatable anchor. But the signal mixes model capacity, data duplication, dedup policy, post-training behavior, and sampling interface quirks. That is not a clean scale. The useful part is pressure: OpenAI, Anthropic, Google, and xAI have spent a year hiding parameter counts behind product names. A method like this does not need perfect tonnage to make their internal model ladder externally testable.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·08

→Evidence Graph Consistency in RAG: A Model-Dependent Analysis of Hallucination Detection

The paper proposes Evidence Graph Consistency, which builds a local evidence graph per RAG response and computes five structural consistency measures; on 5,767 RAGTruth QA responses across six LLMs, the signals align for Llama-2 but reverse for GPT-4, GPT-3.5, and Mistral-7B.

#RAG#Benchmarking#Safety#Llama-2

why featured

HKR-H/K/R all pass, but this is a single arXiv RAG-detection paper with no tool release, deployment proof, or cross-source cluster, so it sits in the 72–77 research-signal band.

editor take

RAG hallucination detection takes another hit: EGC works on Llama-2, then flips on GPT-4, GPT-3.5, and Mistral-7B.

sharp

The painful part is not that EGC is clever. It is that model-agnostic RAG hallucination detection keeps losing ground. The paper runs local evidence graphs on the full RAGTruth QA split: 5,767 responses across six LLMs, with five structural consistency measures. Llama-2 follows the expected diagnostic direction. GPT-4, GPT-3.5, and Mistral-7B flip the signal. I buy this negative result more than another leaderboard bump. Too many RAG eval stacks still treat embedding similarity and claim overlap as portable signals. Stronger models compress, paraphrase, and bridge evidence; weaker ones stay closer to retrieved spans. So “more graph-consistent” can mean different failure modes by model family. If EGC needs per-family calibration, production hallucination guardrails should stop selling themselves as model-agnostic.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

MAGE runs one exact attention pass at the first denoising step and reuses top-k index sets, matching Exact Attention at k=512 across three block-diffusion families on LongBench and reaching up to 6.82x end-to-end speedup at 128K context.

#Inference-opt#Benchmarking#MAGE#Quest

why featured

HKR-H/K/R pass, led by a concrete 6.82x 128K inference claim. The narrow block-diffusion-LLM scope keeps it below featured despite clear practitioner value.

editor take

MAGE hits 6.82x at 128K; the wild part is one All-[MASK] attention pass replaces later search.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Perplexity Can Miss SAE Feature Damage Under Quantization

The paper uses a frozen SAE to compare RTN-quantized activations on Pythia-70M and Gemma-2-2B, finding that Gemma-2-2B at INT7 improves perplexity while degrading 18.7% of active SAE features, and under sliding-window INT6 evaluation only 51.3% of active features survive.

#Interpretability#Inference-opt#Benchmarking#Pythia

why featured

HKR-H/K/R pass: the title has a counterintuitive metric failure, with 18.7% and 51.3% as testable numbers. Single arXiv paper plus SAE/RTN specificity keeps it below featured.

editor take

Gemma-2-2B INT7 improves perplexity yet damages 18.7% of SAE features; PPL is bad cover for quantized interpretability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis and Decision-Making in Quantitative Finance

PandaAI tests a closed-loop neuro-symbolic LLM agent on CSI 300 stock data, reporting 18.2% higher Rank IC and 25.7% lower maximum drawdown than state-of-the-art time-series models.

#Agent#Reasoning#Fine-tuning#PandaAI

why featured

HKR-H/K/R pass, but this is a single arXiv quant-finance paper with limited authority and reproducibility detail. Defaulting to the lower band gives 70 and keeps it in all.

editor take

PandaAI reports 18.2% higher Rank IC on CSI 300; hold the finance-agent hype until splits and costs are disclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

CrowdMath contains 164 expert-annotated progress chains from the 2016-2025 MIT PRIMES-AoPS CrowdMath program, and six frontier models reach 83-88% accuracy on next-post prediction while the best model scores only 0.42 macro-F1 on post-role classification.

#Reasoning#Benchmarking#MIT PRIMES#Art of Problem Solving

why featured

CrowdMath adds a concrete reasoning benchmark with 164 progress chains and two model-result contrasts, so HKR-K is strong and HKR-R is moderate; the dry paper framing keeps it below featured.

editor take

CrowdMath has 164 chains, yet role classification tops out at 0.42 macro-F1; MATH-style scores miss collaboration literacy.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

The paper studies data-constrained pretraining with MIR on 72M to 1.4B parameter models and proposes SoftQ; SoftQ fits repeated-data experiments better than additive scaling laws and estimates MIR’s gain as roughly 1.3x more unique training data.

#Benchmarking#Research release#Open source

why featured

HKR-K is solid: 72M–1.4B models, MIR, SoftQ, and a 1.3x-data-equivalence claim. HKR-R hits data scarcity and training cost, while HKR-H is weak and the paper remains specialist, so it stays in all.

editor take

SoftQ prices MIR at 1.3x unique data; capped at 1.4B, this is not a rescue plan for frontier pretraining.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

The paper proposes TRACE for monitoring long-horizon LLM agent trajectories, using a Triage-Inspect-Judge loop and reporting 0.713 aggregate F1 and 0.844 recall across ten SHADE-Arena task domains.

#Agent#Reasoning#Safety#TRACE

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and metrics, and agent monitoring matters to builders. It stays below featured because this is a single arXiv paper with no code or production validation disclosed.

editor take

TRACE hits 0.713 F1 on 10 SHADE-Arena domains; long-horizon agent monitoring is finally patching cross-step evidence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Self-Evolving LLM Agents with In-Distribution Optimization

Q-Evolve evaluates a self-evolving LLM agent framework on AlfWorld, WebShop, and ScienceWorld; it trains an in-distribution critic from expert demonstrations plus agent trajectories, derives step-wise process rewards through advantage estimation, and reports stronger sample efficiency, robustness, and task performance than unnamed strong baselines.

#Agent#Reasoning#Research release#Benchmark

why featured

HKR-H/K/R all pass, but the article only gives arXiv-summary facts and no gain numbers, task difficulty, or lab authority. Defaulting to the lower band keeps it in all, not featured.

editor take

Q-Evolve tests 3 environments and labels step rewards via an IQL critic; unnamed strong baselines make “self-evolving” hard to buy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

The paper evaluates four Qwen2.5-7B-Instruct specialist agents on high-disagreement MedQA and MedMCQA subsets; on MedQA-250, the full system reaches ECE 0.091, a 74.4% reduction versus the single-specialist baseline, with AUROC 0.630 and 59.2% accuracy.

#Agent#Reasoning#Benchmarking#Qwen

why featured

HKR-K and HKR-R pass: 4 Qwen2.5-7B specialists and ECE 0.091 give testable signal, and medical calibration hits safety. HKR-H is weak, and this remains a single arXiv benchmark paper.

editor take

Four Qwen2.5-7B specialists cut MedQA-250 ECE to 0.091; at 59.2% accuracy, clinical deferral talk is premature.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails

SEAM detects scriptedness in interview speech using 8-second windows, reaches 0.971±0.004 ROC-AUC on an external interview-domain evaluation set, and reduces the quantized model footprint to 41.8MB.

#Audio#Benchmarking#Inference-opt#SEAM

why featured

HKR-H/K/R pass, but this is a single arXiv paper with metrics and size only; deployment cost, false-positive burden, and real platform validation are not disclosed, so it stays at the top of 60–71.

editor take

SEAM hits 0.971 AUC on 8-second audio; I like the shortcut-learning ablation more than another inflated audio benchmark.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

The paper studies step-wise refusal dynamics in autoregressive and diffusion language models, showing that diffusion remasking can recover from harmful intermediate generations and that switching from AR to diffusion sampling improves jailbreak robustness under fixed weights; its SRI detector trains only on benign signals, while the abstract does not disclose sample size.

#Safety#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv paper with no sample size disclosed and no cross-source debate shown. Research-release signal fits 70, below featured.

editor take

Diffusion remasking recovers from harmful intermediates, but sample size is undisclosed; fixed-weight robustness would push safety work past token text.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

The paper tests evidence-order sensitivity on 3,059 grounded items from FEVER, HotpotQA, NQ-Open, PopQA, and Controls, introducing QMV bounds and an ISR=1 answer/abstain gate; in a 528-item held-out audit, the gate reports 0.0-0.7% hallucination and 20.6-27.9% abstention with 95% confidence intervals.

#Reasoning#Alignment#Benchmarking#arXiv

why featured

HKR-K is strong with concrete numbers and mechanisms; HKR-R applies to evidence compression and hallucination tradeoffs. A single arXiv paper on binary adjudication is useful but not same-day featured material.

editor take

ISR=1 reports 0.0–0.7% hallucination on 528 audits; the 20.6–27.9% abstention makes it a verifier tool, not open-gen safety.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Closed-Form Spectral Regularization for Multi-Task Model Merging

The paper proposes SWUDI and SWUDI-A for training-data-free multi-task model merging, replacing iterative solvers with closed-form spectral filtering; across four general benchmarks and one multimodal merging benchmark covering VQA, Geometry, Chart, OCR, Grounding, and modality merging, the methods cut wall-clock time by 28-72x and peak GPU memory by up to 50%.

#Multimodal#Inference-opt#Benchmarking#arXiv

why featured

HKR-H/K/R pass on the 28–72x speed claim, closed-form mechanism, and GPU-memory cost angle. The topic is still a niche model-merging method paper, so it stays below featured.

editor take

SWUDI turns each-layer merging into one eigendecomposition and cuts time 28-72x; model merging finally looks deployable.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SafeGene: Reusable Adapters for Transferable Safety Alignment

SafeGene represents safety as a reusable adapter, recalibrates layer-wise coefficients with few-shot data, and reduces harmful response rates across multiple model families and downstream tasks while preserving task performance.

#Fine-tuning#Alignment#Safety#SafeGene

why featured

HKR-H/K/R pass, but the body only gives the mechanism outline; reduction size, model list, and reproducible setup are not disclosed. Treat it as an interesting arXiv safety paper, not featured.

editor take

SafeGene makes safety a reusable adapter; no reduction numbers disclosed, but the engineering angle beats re-aligning after every fine-tune.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Bit-Exact AI Inference Verification Without Performance Tradeoffs

arXiv:2606.00279v2 proposes bit-exact re-computation for AI inference verification across vLLM, HF transformers, and multiple NVIDIA GPU variants, under the condition that the backend calls no atomic functions and the auditor has the right information for re-computation.

#Inference-opt#Safety#arXiv#vLLM

why featured

HKR-H/K/R pass via a concrete no-latency verification claim, stack coverage, and operator trust costs. Single arXiv source and low-level inference focus keep it below featured.

editor take

The paper gets bit-exact recomputation for vLLM/HF only without atomics; governance hype should wait on backend constraints.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Reinforcement Learning from Rich Feedback with Distributional DAgger

The paper introduces Distributional DAgger for training reasoning models from rich feedback, replacing RLVR’s one-bit final-answer reward. It reports improvements over RLVR and self-distillation baselines across three domains: scientific reasoning, coding, and hard math.

#Reasoning#Code#Fine-tuning#Research release

why featured

HKR-H/K/R pass, but the article gives no result numbers, release artifact, or reproducibility details. This is useful training-method research, not a same-day must-write item.

editor take

Distributional DAgger replaces 1-bit RLVR rewards with rich feedback; I buy it, RLVR’s signal poverty needed a formal teardown.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry

arXiv:2603.26846v2 proposes Stability Asymmetry Regularization, which penalizes the distributional gap between internal CoT stability and external response stability under perturbation; the abstract says experiments identify and suppress intrinsic deception, but the RSS snippet does not disclose benchmark names or metric values.

#Reasoning#Alignment#Safety#Research release

why featured

HKR-H/K/R pass, but the body gives the SAR mechanism without metrics, model scale, or reproducible setup. A useful arXiv alignment paper, not enough for featured.

editor take

SAR penalizes CoT/response stability gaps under perturbation, but no benchmarks or metrics are disclosed; treat it as a testable safety-signal hypothesis.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→BigMac: Breaking the Pareto Frontier of Compute and Memory in Multimodal LLM Training

BigMac uses a dependency-safe nested pipeline for multimodal LLM training, reduces encoder and generator activation memory complexity to O(1), keeps LLM activation memory unchanged, and reports 1.08×-1.9× training speedups over baseline systems across multiple MLLMs and workloads.

#Multimodal#Inference-opt#BigMac#Research release

why featured

HKR-H/K/R pass, but this is an arXiv training-systems paper with mechanism and speedup numbers only; no open-source artifact, replication details, or adoption signal, so it stays in all.

editor take

BigMac cuts encoder/generator activation memory to O(1); 1.08×-1.9× speedup is modest, but the systems trick looks usable.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability

The paper evaluates hate moderation on paired English and Tamil-English code-mixed content, where thresholds tuned on clean English produce a 0.265 decision flip rate and raise review rate from 0.138 to 0.297.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: paired tests and flip-rate numbers give the paper concrete value for moderation teams. It remains a single arXiv study in a narrow workflow, below the featured threshold.

editor take

Code-mixing drives 0.265 action flips and 0.297 review rate; English-tuned moderation thresholds dump multilingual risk into human queues.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Scalable GANs with Transformers

The paper introduces GAT, a pure transformer GAN trained in a VAE latent space, and stabilizes S-to-XL scaling with lightweight intermediate supervision and width-aware learning-rate adjustment; GAT-XL/2 reaches 2.18 FID on class-conditional ImageNet-256 generation in 60 epochs, reported as 4x fewer epochs than strong baselines.

#Vision#Multimodal#Benchmarking#arXiv

why featured

HKR-H and HKR-K pass: the GAN comeback angle is clickable, and the post gives FID 2.18 plus training mechanisms. HKR-R is narrow, and this is a single arXiv paper, not same-day must-write news.

editor take

GAT-XL/2 hits 2.18 FID on ImageNet-256 in 60 epochs; GANs aren’t dead, but VAE latents carry a lot here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→RePo: Language Models with Context Re-Positioning

RePo continues pre-training on OLMo-2 1B and 7B, using a differentiable module f_phi to assign token positions, and reports gains on noisy-context, structured-data, and longer-context tasks while keeping competitive short-context performance.

#Reasoning#Memory#Benchmarking#SakanaAI

why featured

HKR-H/K/R pass: the mechanism is novel, model sizes are concrete, and long-context reliability matters. It stays in 60–71 because the abstract gives no code, gain sizes, or production evidence.

editor take

RePo is tested only via OLMo-2 1B/7B continued pretraining; learnable positions look sane, but costs and strong baselines are missing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA improves visual grounding in instructional MLLMs with instruction-guided channel-wise multiplicative modulation, not token-level additive selection. The paper evaluates it on 12 benchmarks across LLaVA-1.5, LLaVA-MoRE, and Qwen3-VL, reporting +12.0 MMVP for LLaVA-1.5 and under 1% extra FLOPs.

#Multimodal#Vision#Fine-tuning#LLaVA

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and efficiency numbers. HKR-H fails, and the item remains a specialized architecture paper without product impact or external replication.

editor take

MoDA gains across 12 benchmarks at <1% FLOPs; channel-wise modulation looks like a cheap visual-attention brake for MLLMs.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

The paper proposes MOPO, a constrained KL-regularized framework that maximizes a primary objective while enforcing lower bounds on secondary objectives through tunable safety thresholds, using pairwise preferences without point-wise rewards. Experiments show MOPO recovers Pareto-optimal policies on synthetic benchmarks and Pareto-dominates baselines when fine-tuning multi-billion-parameter models on human-preference data.

#Alignment#Fine-tuning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: MOPO has a concrete mechanism and test claims for RLHF/alignment design. HKR-H is weak, and this is a single arXiv paper without code, top-lab backing, or cross-source discussion, so it stays in 60–71.

editor take

MOPO constrains secondary goals with thresholds and claims Pareto wins over DPO/IPO; I buy the setup, not the undisclosed dataset details.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

TALAN inserts a sequence-conditioned latent side path into the transformer residual stream and co-trains it with LoRA or DoRA in one SFT loop. Across four Qwen3 backbones and four STEM/code benchmarks, it adds +1.41 points over LoRA and +1.85 over DoRA, with under 1% trainable parameters and 1.01-1.02x inference overhead versus matched LoRA.

#Fine-tuning#Reasoning#Code#Qwen

why featured

HKR-H/K/R pass on the LoRA-overhead comparison and concrete benchmark numbers, but this is still a single PEFT paper with +1.41 average gain and no disclosed open-source or adoption signal, so it stays in all.

editor take

TALAN is nonnegative across 16 Qwen3 cells and +1.41 over LoRA; seed variance says don’t bury LoRA yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

The paper formalizes bias as symmetry breaking and applies loss-based regularization on four synthetic datasets, reducing fairness violations by more than 90% with about a 5% accuracy cost.

#Alignment#Safety#Benchmarking#arXiv

why featured

HKR-H/K/R all pass, but the evidence is limited to 4 synthetic datasets with no real-world model validation. Solid safety/alignment research signal, not a same-day must-write.

editor take

The paper cuts violations over 90% on 4 synthetic sets. Bit-flip fairness is neat, but causal confounding remains untouched.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Leveraging Error Diversity in Group Rollouts for Reinforcement Learning

The paper proposes EDAS, a post-hoc advantage-shaping method for RLVR that adjusts incorrect rollouts using intra-group error diversity, and reports a 6.29-point average gain over DAPO on Qwen3-8B across seven math benchmarks.

#Reasoning#Alignment#Benchmarking#Qwen

why featured

HKR-K is clear: EDAS reweights erroneous rollout advantage by within-group error diversity and beats DAPO by 6.29 points on seven Qwen3-8B math benchmarks. The scope is narrow RLVR training, with no product or cost hook, so it stays in the interesting band.

editor take

EDAS beats DAPO by 6.29 points on Qwen3-8B across seven math sets; using error distribution for advantage shaping is pragmatic.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

The study compares four ideology-annotation paradigms on AllSides articles using Llama-3.3-70B sentiment labels; fine-tuned GPT-4o-mini reaches the highest F1 at 72.48, yet uniquely produces significant community-level treatment effects and direct effects absent from human annotations.

#Fine-tuning#Benchmarking#Alignment#AllSides

why featured

HKR-H/K/R pass: the paper links sentiment to perceived ideology and reports F1=72.48 plus an LLM-only coupling. It stays in 60–71 because this is a single arXiv study, with no product, model, or deployment change.

editor take

Fine-tuned GPT-4o-mini hits F1=72.48, then invents sentiment–ideology coupling humans lack; silver-label evals need causal checks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→MACD: Model-Aware Contrastive Decoding via Counterfactual Data

MACD uses a Video-LLM’s feedback to locate object regions linked to hallucination. It reduces hallucination on EventHallusion, MVBench, Perception-test, and Video-MME while maintaining or improving accuracy.

#Multimodal#Inference-opt#Benchmarking#Qwen

why featured

HKR-K/R pass: the paper offers a concrete decoding mechanism and a 4-benchmark test claim, with relevance to multimodal reliability. HKR-H is weak and effect sizes are not disclosed, so it stays in the 60–71 band.

editor take

MACD cuts hallucination on 4 video benchmarks, but deltas are undisclosed; model-feedback object targeting beats random CD noise.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→The Identity Trap in EEG Foundation Models: A Diagnostic Audit

The paper introduces FMScope to audit three EEG foundation models across four datasets, finding subject variance at 13-89x a random null in 12/12 pairs. Fine-tuning raises it by 10-63 percentage points, while erasing the linear subject axis improves label decoding by 6-12 points in primary within-subject cells.

#Benchmarking#Fine-tuning#Interpretability#LaBraM

why featured

HKR-H/K/R pass: the hook is identity leakage, and the paper gives 12/12 pairs plus 13-89x subject variance. EEG foundation models are vertical, so impact stays in 60-71 rather than featured.

editor take

FMScope audits 3 EEG FMs: subject variance hits 13-89x null in 12/12 pairs; treat high EEG scores as identity leakage first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing

The paper proposes DARS, a distribution-aware supervision framework for LLM routing. It replaces single-response labels with observations over semantically equivalent query formulations and stochastic generations, and experiments across diverse tasks show single-shot labels mislead model selection while distribution-aware labels make learned routing behavior more stable.

#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K/R are present but modest: DARS reframes routing supervision from one sampled output to capability distributions. The post gives the mechanism, but not experiment scale, model list, or gains, so it stays in the 60-71 all band.

editor take

DARS labels routing via query rewrites and stochastic generations; no task count or lift disclosed, so I read it as anti-single-shot eval ammo.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Reinforcement Learning from Denoising Feedback

The paper introduces RLDF for estimating policy loss in diffusion language models using rollout and training feedback, and evaluates it on two DLM architectures, LLaDA and Dream, across multiple reasoning benchmarks.

#Reasoning#Benchmarking#LLaDA#Dream

why featured

HKR-H and HKR-K pass: RLDF gives a concrete DLM policy-loss mechanism and tests it on LLaDA, Dream, and reasoning benchmarks. HKR-R is weak, and the item stays in the 60–71 research-signal band.

editor take

RLDF reports gains on LLaDA and Dream, but no deltas in the snippet; DLM RL still lives or dies on loss estimation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Adaptive Pluralistic Alignment: A Pipeline for Dynamic Artificial Democracy

The paper introduces APA, a three-stage alignment pipeline using low-rank reward basis decomposition, social-choice voting, and new annotator weights over fixed bases; it tests a proof of concept on the PRISM multi-user alignment dataset and releases code and preference datasets.

#Alignment#Fine-tuning#PRISM#RachelFreedman

why featured

HKR-H/K/R all pass, but this is an arXiv proof of concept on PRISM with no production replacement claim or major-model result; keep it in all below the 72 featured line.

editor take

APA tests on PRISM; I buy the low-rank jury mechanism, but “artificial democracy” is still lab governance.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Scalable Joint Resource Allocation for SLO-Constrained LLM Inference in Heterogeneous GPU Clouds

The paper presents an SLO-constrained LLM inference allocation framework that jointly optimizes model choice, GPU provisioning, parallelism, and routing; on Azure LLM Inference Trace experiments, GH finds feasible solutions within 1 second, while AGH reaches near-optimal results within 3 seconds and remains lower-cost under up to 1.5x delay and accuracy inflation.

#Inference-opt#Benchmarking#Azure#Research release

why featured

HKR-K/R pass and HKR-H fails. The paper gives testable Azure Trace, 3s near-optimal, and 1.5x pressure claims for LLM inference cost/SLO, but its academic infra angle keeps it below featured.

editor take

AGH hits near-optimal on Azure Trace in 3 seconds; I buy the setup—MILP is too slow as an online scheduler baseline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→On the Importance of Multiple Training Seeds for Evaluating Machine Unlearning

The paper argues that machine unlearning evaluations need multiple training seeds; experiments on image classification, federated learning-to-rank, and large language models show that single training-seed setups can produce non-representative results.

#Safety#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: seed sensitivity in machine-unlearning eval is a useful methodological warning across three settings. The post gives no effect sizes or reproducible setup, so it stays in the 60–71 band.

editor take

Single training seeds skew unlearning evals; stop laundering benchmark confidence with extra unlearning seeds.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Sparsely gated tiny linear experts

The paper proposes sgatlin, replacing transformer feedforward layers with sparsely gated linear single-neuron experts, and reports lower language-model perplexity under an isoflop comparison across compute budgets.

#Inference-opt#Interpretability#Research release

why featured

HKR-H/K/R pass via the tiny-expert mechanism and compute angle, but the item gives no perplexity delta, model scale, code, or replication details; a single arXiv paper stays in the 60–71 band.

editor take

sgatlin replaces every FFN with single-neuron linear experts and lowers isoflop perplexity; I’d wait for replication before burying MoE.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TabSwift: An Efficient Tabular Foundation Model with Row-Wise Attention

TabSwift uses a row-wise attention-only backbone for tabular in-context learning, adds gated attention stabilization, learnable register tokens, and adaptive layer-wise early exit for latency-sensitive inference.

#Reasoning#Inference-opt#TabSwift#TabPFN

why featured

HKR-K and HKR-R pass: the mechanisms are concrete, and efficient tabular foundation models matter to some practitioners. No benchmark numbers, open-source artifact, or production-replacement claim, so it stays in the 60–71 band.

editor take

TabSwift adds row-wise attention and layer-wise early exit, but gives no latency numbers here; I don’t buy “more efficient” yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

The paper benchmarks LM-based lossless compression on full-fidelity audio across music, speech, and bioacoustics, with 16kHz-48kHz sampling and 8/16/24-bit depths. Trilobyte changes token vocabulary scaling from O(2^b) to O(1), making 24-bit LM-based compression tractable, while gains shrink beyond 8-bit.

#Audio#Benchmarking#Trilobyte#FLAC

why featured

HKR-H and HKR-K pass: the audio-compression use case is novel, with sample-rate, bit-depth, and Trilobyte scaling details. The topic stays niche research, not a product or competitive industry move, so it sits in all.

editor take

Trilobyte cuts 24-bit vocab from 16.7M to O(1); gains shrink with bit depth, so don't bury FLAC yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Discovering Interpretable Algorithms by Decompiling Transformers to RASP

The paper presents a method for extracting RASP programs from trained Transformers by faithfully re-parameterizing the model and applying causal interventions to find a small sufficient sub-program. Experiments on small Transformers trained on algorithmic and formal-language tasks often recover simple interpretable RASP programs from length-generalizing models.

#Interpretability#Reasoning#Research release

why featured

HKR-H and HKR-K pass: the decompilation angle is novel, and the paper gives a concrete reparameterization plus causal-intervention pipeline. HKR-R is weak because evidence is limited to small algorithms and formal-language tasks.

editor take

This decompiles small Transformers into RASP subprograms; narrow algorithmic tasks, but far stronger than attention-map interpretability.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

The paper introduces GRASP, which reframes data attribution as subset-level counterfactual utility prediction and models interactions with a quadratic geometric penalty; subset-retraining evaluations report over 2× higher task-level rank correlation and nearly 10× lower upfront artifact construction cost than scalable baselines.

#Benchmarking#GRASP#arXiv#Research release

why featured

HKR-K and HKR-R pass: the paper gives concrete mechanisms plus 2x/10x numbers and maps to pretraining data cost. HKR-H is weak, and a single arXiv paper stays in the lower all band.

editor take

GRASP reports over 2× rank-correlation gains on subset counterfactuals; I buy the setup, single-example attribution is tired.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning

The paper evaluates single-neighbor retain sets, 1:1 sampling, and cyclic sampling in LLM unlearning, then proposes MELU, a modular entity-level strategy, with diverse neighbor sets to balance forget efficacy and model utility.

#Fine-tuning#Safety#Benchmarking#Research release

why featured

HKR-K has concrete sampling mechanisms and the MELU strategy; HKR-R connects to LLM deletion, compliance, and safety governance. HKR-H is weak, and no experimental numbers or code are disclosed, so it stays in the 60–71 band.

editor take

MELU attacks single-neighbor retain sets and 1:1 sampling; unlearning benchmarks need fewer toy retain splits.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Causal Evaluation of Membership Inference Attacks

The paper frames membership inference attack evaluation as causal inference, defines memorization as the causal effect of including a point in training, identifies interference in one-run protocols and distribution-shift confounding in zero-run protocols, and proposes estimators for multi-run, one-run, and zero-run settings with non-asymptotic consistency guarantees.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K is strong and HKR-R is moderate: the paper gives MIA evaluation a testable causal frame, but only the abstract is available and experiment scale, benchmark results, and adoption signals are absent.

editor take

MIA evaluation becomes causal effect estimation; one-run has interference, zero-run has shift, so privacy papers owe less shiny AUC.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Spectral Scaling Laws of Muon

The paper tracks Muon momentum singular-value quantiles in 77M to 2.8B-parameter models and finds mid-early layers scale mildly at about M^-0.25, while some late layers scale up to M^-0.96, putting the standard 5-step Newton-Schulz setup into a failure regime at frontier scale.

#Fine-tuning#Inference-opt#Benchmarking#Muon

why featured

HKR-K is strong, while HKR-H and HKR-R are weak; the Muon scaling result helps training researchers, but reads like numerical optimization for most AI practitioners. Keep it in 60-71, not featured.

editor take

Muon late-layer singular values fall as M^-0.96; 5-step NS breaks at frontier scale, so layer-aware optimizer tuning stops being optional.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path

The paper analyzes the Rectified Flow interpolation path Xλ and reports a bell-shaped reconstruction gap between train and test samples, validated on audio and images, then uses the λ-resolved signal for a membership inference attack.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-H/K/R pass, but this is a technical arXiv privacy paper for generative-model safety readers. No tool release, incident, or flagship model impact keeps it in the 60–71 band.

editor take

Rectified Flows leak membership signals along Xλ; the bell-shaped reconstruction gap is a sharper privacy probe than final samples.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for LLMs in Long-Tail Education

Elmes* builds Edu-330 for educational LLM evaluation, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with more than 1,000 second-level indicators and a multi-agent teacher-student-judge evaluation engine.

#Agent#Benchmarking#Reasoning#Tao Liu

why featured

HKR-K and HKR-R pass: the paper gives a reusable benchmark scale and addresses LLM evaluation in education. Single arXiv paper, non-major lab, and a dry academic title keep it in the 60–71 band.

editor take

Elmes* covers 330 education scenarios; the LLM-judge self-preference is the part that should make evaluators pause.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→On the Geometry of On-Policy Distillation

The paper compares OPD, SFT, and RLVR with parameter-space diagnostics, finding that OPD updates fewer weights than SFT and rapidly locks cumulative updates into a narrow low-dimensional subspace.

#Reasoning#Fine-tuning#Research release

why featured

This is a useful training-methods paper: HKR-K lands via a concrete geometry claim, and HKR-R lands for fine-tuning/RL practitioners. HKR-H is weak, and the available feed gives only abstract-level detail, so it stays below featured.

editor take

OPD locks early into a low-rank update channel; SFT degrades under the same constraint. I buy this over hand-wavy reasoning distillation talk.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→The Geometry of Last-Layer Model Stealing

arXiv:2606.06854 states exact conditions for perfectly copying a transformer network’s final layer. The paper also proves that a hidden network cannot be fully reverse engineered from final outputs alone.

#Safety#Interpretability#arXiv#Research release

why featured

HKR-H/K/R all pass, but this is a single theoretical arXiv paper with no disclosed experiment scale, code, or real API reproduction setup. Model-stealing security is relevant, yet not featured-level.

editor take

2606.06854 gives exact final-layer stealing conditions; the sharper claim is the proof that outputs alone cannot recover hidden layers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→DiBS: Diffusion-Informed Branch Selection

DiBS uses a diffusion model to order branches for a complete symbolic Sudoku solver, and on the Royle 17-clue benchmark it reduces nodes, backtracks, and long-tail search cost versus strong heuristic baselines.

#Reasoning#DiBS#Research release#Open source

why featured

HKR-H and HKR-K pass: diffusion-guided symbolic search has a concrete mechanism and benchmark metrics. The claim stays on Sudoku, with no production solver or agent transfer result, so it remains interesting but not featured.

editor take

DiBS cuts nodes and backtracks on Royle 17-clue; I buy learned ordering plus completeness, but the snippet omits effect sizes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Making the Most of Limited Data: Score-Aware Training for Text-to-Music Generation

The authors propose score-aware training for text-to-music generation, using audio-caption alignment scores as supervision; their 450M-parameter FluxAudio-based system ranked 2nd in objective evaluation across both ICME 2026 ATTM tracks and 3rd in the Efficiency Track final MOS evaluation.

#Audio#Fine-tuning#Benchmarking#FluxAudio

why featured

HKR-K is solid with a concrete mechanism and benchmark rank; HKR-R lands on training cost for audio-generation teams. HKR-H is weak, and a single arXiv competition paper stays below featured.

editor take

FluxAudio 450M took 3rd MOS in the Efficiency Track; text-to-music needs cleaner supervision, not bigger private piles.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

The paper proposes TRUE, a framework for explaining LLM reasoning through executable reasoning verification, feasible-region DAG modeling, and causal failure mode analysis with Shapley values. Experiments span multiple reasoning benchmarks, while the RSS abstract does not disclose the tested model list, dataset names, or numerical scores.

#Reasoning#Interpretability#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the mechanism mix has substance and maps to reasoning-interpretability concerns. Model names and scores are not disclosed, and HKR-H fails, so this stays in the 60–71 research-signal band.

editor take

TRUE claims a 3-level explanation stack; no models or scores disclosed, so don’t treat “verifiable” as reliability evidence yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→The Dual Mechanisms of Spatial Variable Binding in Vision-Language Models

The paper shows VLMs use two mechanisms for spatial variable binding: intermediate language-model layers encode content-independent spatial relations, while the dominant spatial signal comes from vision encoders, with global enhancement across all image tokens improving performance on complex natural images from COCO.

#Multimodal#Vision#Interpretability#COCO

why featured

HKR-K passes: the paper offers a mechanism-level claim and COCO validation for spatial variable binding in VLMs. HKR-H and HKR-R are weak, so this stays in all below featured.

editor take

VLM spatial binding leans on the vision encoder; COCO gains from global image-token enhancement make LM-layer probes the smaller story.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SecretFan: Synthesizing Realistic Data without Breaking Privacy

SecretFan reframes synthetic data generation as adequacy-guided search-based testing, uses a fuzzer for sample generation and a discriminator for selection, and reports good average utility and similarity scores across eight datasets used in prior evaluations.

#Safety#Benchmarking#SecretFan#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and 8-dataset evaluation, with privacy-compliance relevance. It is still a single arXiv paper without a major benchmark delta or production proof, so it sits in 60–71.

editor take

SecretFan reports good utility and similarity on 8 datasets; MIA and reconstruction metrics aren’t disclosed, so the privacy claim gets a haircut.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→A Dynamic Self-Evolving Extraction System

DySECT uses an LLM to extract triples into an incremental knowledge base, then feeds graph reasoning, probabilistic knowledge, few-shot examples, or KB-derived synthetic data back into extraction.

#RAG#Reasoning#Fine-tuning#DySECT

why featured

HKR-H and HKR-K pass: the paper names a concrete self-evolving loop for knowledge extraction. With no metrics, datasets, or production-replacement evidence disclosed, it stays in the 60–71 research-release band.

editor take

DySECT loops LLM triple extraction into a KB, but gives no eval numbers; I’m filing this under classic IE with an LLM shell.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

AAAC replaces fixed 4-bit scalar codebooks with two learned 64-byte scalar codebooks per layer. Each weight group selects the codebook minimizing activation-weighted reconstruction error, stores the choice in an unused sign bit, finishes quantization in 3–30 minutes on one GPU, and adds no memory beyond the model.

#Inference-opt#AAAC#AWQ#GPTQ

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and runtime, and it maps to inference cost. But it is a technical arXiv quantization paper without a major lab release, OSS adoption, or production replacement claim.

editor take

AAAC uses two 64-byte codebooks per layer for 4-bit weights; 3–30 minutes on one GPU is a direct shot at OmniQuant.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Generalization of Diffusion Models Arises with a Balanced Representation Space

The paper analyzes memorization and generalization in diffusion models using a two-layer ReLU DAE, proves that spiky representations correspond to memorization while balanced representations correspond to generalization, and validates the pattern on unconditional and text-to-image diffusion models.

#Multimodal#Vision#Interpretability#Research release

why featured

HKR-K is solid: the paper proposes a concrete representation mechanism for diffusion memorization versus generalization. HKR-R lands on IP and safety risk, but HKR-H is weak and the theory-heavy format keeps it below featured.

editor take

A two-layer ReLU DAE links spiky reps to memorization; diffusion leakage checks need representation probes, not just loss curves.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition

ChronoForest reaches 99.8%, 99.3%, and 99.5% success on the medium, large, and giant OGBench AntMaze-Stitch splits, and improves giant-stitch success by up to 34.5 points over previously reported diffusion-based results.

#Agent#Robotics#Reasoning#ChronoForest

why featured

HKR-H/K pass: the paper gives concrete OGBench success rates and a +34.5 pp giant-stitch gain. HKR-R fails because the work is narrow planning research, so it stays in the 60–71 band.

editor take

ChronoForest hits 99.5% on AntMaze-Stitch giant; diffusion planning’s bottleneck is moving from samples to closed-loop route evidence.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→FIGMA: Towards Fine-Grained Music Retrieval

FIGMA uses a multi-view contrastive architecture for fine-grained music retrieval, with FGMCaps providing 380K training music-caption pairs and a 10K test set annotated for tempo, key, chord progression, beat count, genre, and mood, reaching up to 73.3% relative improvement over CLAP-based baselines.

#Audio#Embedding#Benchmarking#FIGMA

why featured

HKR-K is solid with dataset size, annotation fields, and a 73.3% reported gain. HKR-H and HKR-R are weak: this reads like a normal arXiv paper for audio retrieval and embedding specialists.

editor take

FIGMA beats CLAP baselines by up to 73.3% on FGMCaps; music retrieval is finally punishing lazy first-token-ish alignment.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

The paper proposes PoLar, a program-of-layers method that skips or repeats pretrained LLM layers per input. The abstract says it improves mathematical reasoning accuracy over standard inference and prior dynamic-depth methods, but the post does not disclose the tested models, benchmark count, or gain sizes.

#Reasoning#Inference-opt#Research release

why featured

HKR-H and HKR-K pass: PoLar’s per-input layer skipping/looping is a concrete inference idea. Missing models, benchmark count, uplift size, and code keep it in the interesting-but-not-featured band.

editor take

PoLar skips or loops layers per input, but gains are undisclosed; I don’t buy the latent-reasoning claim before reproduction.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→MidSteer: Optimal Affine Framework for Steering Generative Models

The paper introduces MidSteer, an affine framework for concept manipulation, proves standard behavior removal is a LEACE special case, and evaluates it across vision diffusion models and large language models.

#Alignment#Safety#Multimodal#MidSteer

why featured

HKR-K/R pass: the paper offers a concrete mechanism and cross-model tests, and model control resonates with safety work. HKR-H is weak, with no metrics, code, or production-level practical claim disclosed.

editor take

MidSteer reduces behavior removal to LEACE; closed-form affine steering is auditable, but the snippet hides experiment scale.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

DTG-FF sets new FF-family results across nine real-data benchmarks, including 91.8% on CIFAR-10 and the first FF baseline on ImageNet-100 at 224x224, but BP-DeepSup still leads by 2.40 points on CIFAR-10 and DTG-FF reaches only 49.4% at 224x224.

#Benchmarking#Vision#Geoffrey Hinton#Research release

why featured

HKR-H comes from the contrarian claim that synthetic benchmarks overstate FF scaling; HKR-K has 9 real-data benchmarks and accuracy figures. HKR-R is real for benchmark trust, but the layer-local training topic is niche, so it stays in all.

editor take

DTG-FF hits 91.8% on CIFAR-10 but only 49.4% at 224x224; real images and 8GB GPUs puncture the FF pitch.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→GraphWalker: Patient Analogy Meets Information Gain for Clinical Reasoning with Large Language Models

GraphWalker lets frozen LLMs reason by analogy over retrieved patient cases without task-specific parameter updates. The framework combines data-driven and model-driven signals, patient cohort structure, and lazy greedy search with frontier expansion; the abstract says it outperforms demonstration-selection baselines on multiple real-world EHR benchmarks and remains more robust under cross-dataset shift.

#RAG#Reasoning#Agent#GraphWalker

why featured

HKR-K/R pass: the mechanism is concrete and clinical risk gives it relevance. No exact gains, artifact details, or major-lab signal are disclosed, so this stays in all rather than featured.

editor take

GraphWalker keeps LLMs frozen for patient-analogy retrieval; gains aren’t disclosed in the snippet, so verify EHR shift before buying the agentic framing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Automatic Causal Fairness Analysis with LLM-Generated Reporting

FairMind analyzes dataset-level fairness in a zero-shot setup, computes counterfactual causal effects under the standard fairness model, and uses LLMs to generate reports; the abstract does not disclose benchmark scores or release details.

#Alignment#Safety#FairMind#Plečko

why featured

HKR-K and HKR-R pass: FairMind links causal fairness computation with LLM-generated audit reports. HKR-H is weak, and deployment details are not disclosed, so this stays in the interesting all band.

editor take

FairMind computes counterfactual causal fairness zero-shot; scores and release are undisclosed. I trust closed-form effects, not LLM prose as audit.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

AdaJudge modifies reward modeling with gated refinement blocks and adaptive multi-view pooling, and the abstract reports stronger results than off-the-shelf reward models and traditional pooling baselines on RM-Bench and JudgeBench.

#Alignment#Benchmarking#AdaJudge#Research release

why featured

HKR-K and HKR-R pass: the post gives mechanisms and benchmarks, but this is a single arXiv method paper with no production replacement, released artifact, or cross-source debate.

editor take

AdaJudge beats off-the-shelf RMs on RM-Bench and JudgeBench; I buy the architecture, but RSS omits margins and release terms.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

The paper characterizes reasoning on large-label multi-label tasks as two phases: broad shortlisting from hundreds of thousands to millions of candidate labels, then fine-grained reasoning over the shortlist. Using this mechanism, the authors develop a distillation strategy that consistently outperforms standard distillation across multiple datasets, while the RSS snippet does not disclose model names, benchmark scores, or code availability.

#Reasoning#Fine-tuning#Interpretability#Research release

why featured

HKR-K passes because the paper offers a two-stage mechanism and a distillation comparison for large output spaces. HKR-H and HKR-R are weak, and no concrete gain numbers are disclosed, so this stays in all.

editor take

The paper splits shortlist-then-reason into a distillation recipe; no scores or model names in RSS, but the angle beats leaderboard theater.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale learns row and column diagonal scaling matrices from two vector sets, then evaluates SVD-based low-rank LLM compression on Llama 3.1 8B Instruct and Qwen3-8B under perplexity and zero-shot benchmarks.

#Inference-opt#Fine-tuning#Benchmarking#Llama

why featured

HKR-K and HKR-R pass: SVD low-rank compression plus learned scaling matrices is a concrete mechanism and targets inference cost. The post lacks compression ratio, speed, and quality-loss numbers, so it stays in the 60–71 band.

editor take

SigmaScale reports competitiveness on two 8B models; no compression ratio is disclosed, so SVD-compression hype stays capped.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→AI Level of Detail: Distance-Aware ML Model Precision Selection for Real-Time Human Motion Prediction in Games

The paper proposes AI LOD, which routes NPC motion prediction to FP32, FP16, or INT8 ONNX Runtime model variants based on distance from the player camera; evaluation on CMU Mocap reports negligible perceptual degradation within assigned distance ranges.

#Inference-opt#ONNX Runtime#CMU Mocap#arXiv

why featured

HKR-H/K/R pass, but this is a single arXiv systems paper for real-time game motion prediction. No release artifact, product adoption, or cross-source cluster is shown, so it stays in the 60-71 band.

editor take

AI LOD routes FP32/FP16/INT8 by camera distance; neat idea, but CMU Mocap isn’t a frame-budget proof.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

The paper formalizes SAE concept learning as set alignment, defines three learning levels—detection, separation, and approximation—and validates the theory with synthetic ReLU and Top-K SAE experiments that test how SAE size and sparsity affect concept learning.

#Interpretability#Research release

why featured

HKR-K passes: the paper gives a set-alignment frame, three learning levels, and ReLU/Top-K synthetic tests. HKR-H and HKR-R are weak, so this stays all rather than featured.

editor take

The paper splits SAE concept learning into 3 levels, but tests only synthetic ReLU/Top-K; I buy the frame, not the generalization.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE trains on 16 nodes and tests directly on 32 and 48 nodes, using Structured Representation Regularization to stabilize attention feature statistics; at N=48, it reduces average response time by 8.9% versus the same cross-attention pointer architecture without SRR.

#Agent#Reasoning#SCALE#Research release

why featured

HKR-K/R pass: SRR, 16→48-node extrapolation, and 8.9% latency reduction are concrete, and agent scheduling costs matter. HKR-H is weak; as a single arXiv paper without adoption or code signal, it fits the 60–71 band.

editor take

SCALE trains on 16 nodes and tests at 48, cutting latency 8.9%; good problem, but beating its own no-SRR ablation is thin.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→ADAGE: Active Defenses Against GNN Extraction

ADAGE monitors GNN query diversity and progressively perturbs outputs as accumulated leakage grows. The paper evaluates it on six benchmark datasets, four GNN models, and three adaptive attacker types, reporting that it blocks common extraction setups while preserving downstream predictive performance.

#Safety#Benchmarking#ADAGE#Research release

why featured

HKR-K passes with a concrete mechanism and test scale; HKR-R passes on model stealing and IP security. HKR-H is weak, and GNN defense is too niche for featured.

editor take

ADAGE keys perturbation to query diversity across 6 datasets, 4 GNNs, 3 attacker types; “impossible to steal” needs code, not trust.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models

OPTIMUS generates concept-based heatmaps for deep classification models, using prime implicants to guarantee sufficiency and minimality; the paper says it validates the method on a visual classification benchmark, but the snippet does not disclose the benchmark name.

#Vision#Interpretability#Benchmarking#Research release

why featured

HKR-K passes: prime implicants provide sufficiency and minimality guarantees for concept heatmaps. HKR-H/R are weak; the post only says vision classification benchmarks, with no benchmark names or deployment evidence.

editor take

OPTIMUS adds sufficiency and minimality guarantees via prime implicants; benchmark details are undisclosed, so don’t crown it saliency’s killer yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→pTNAS: Progressive Neural Architecture Search for Tabular Data

pTNAS searches tabular neural architectures with a filter-and-refine NAS pipeline, using the zero-cost pTProxy for initial filtering and fixed-budget scheduling for refinement; experiments report up to 82.75x less time to reach the globally best architecture versus other NAS methods and up to 4.78x higher end-to-end efficiency than TabPFN.

#Benchmarking#Inference-opt#TabPFN#Research release

why featured

HKR-K passes with a concrete mechanism and speed claims, making it useful research-feed signal. HKR-H and HKR-R are weak: tabular NAS is narrow and not featured-level for this audience.

editor take

pTNAS reports 82.75x faster tabular architecture search; I buy the efficiency angle, but TabPFN task scale is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions

The paper proposes Gaussian Trust Region Policy Optimization, which reshapes PPO’s trust region with a Gaussian kernel; the released code accompanies experiments across games, simulated robotic control, open-world exploration, and language model post-training.

#Agent#Robotics#Fine-tuning#Research release

why featured

HKR-K passes: GTR provides a testable PPO trust-region mechanism, public code, and experiments across games, robotics, open-world tasks, and LLM post-training. HKR-H/R are weak, so this stays all.

editor take

GTR reshapes PPO’s trust region with a Gaussian kernel; the non-monotonic constraint is sharp, but baselines and LM details are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Architecturally Significant MLOps Guidelines for ML Model Integration and Deployment

The paper reviews 103 web sources and synthesizes 25 architecturally significant MLOps guidelines for ML model integration and deployment, grouping them into five categories and describing their impact on overall system architecture.

#Fine-tuning#arXiv#Research release

why featured

HKR-K has concrete counts and categories, and HKR-R maps to model-deployment pain. HKR-H is weak, and this is a review paper rather than a same-day industry trigger.

editor take

103 web sources yielded 25 MLOps guidelines; useful as a checklist, weak as architecture guidance without validation.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees

InvEvolve uses a reinforcement-learning-trained LLM to generate white-box inventory policies for online non-stationary demand, applies confidence-interval-based certification for statistical safety guarantees, and reports stronger performance than classical inventory policies and deep-learning methods on synthetic and real-world retail data.

#Agent#Reasoning#Safety#InvEvolve

why featured

HKR-H and HKR-K pass: the paper offers LLM-generated white-box policies with performance guarantees and retail-data tests. HKR-R is weak because inventory optimization is a narrow OR topic for AI practitioners.

editor take

InvEvolve adds confidence-interval certification to inventory policies; I buy the white-box angle, but margins are not disclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Certified Robustness to Data Poisoning in Gradient-Based Training

The paper presents a certification framework that does not modify the model or learning algorithm, using convex relaxations to over-approximate reachable parameters under poisoning threat models for gradient-based training.

#Safety#Alignment#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: the paper states a concrete certification mechanism and targets training-time poisoning risk. HKR-H is weak, and the post lacks scale, benchmarks, or code, so it stays mid-band.

editor take

This certifies poisoning robustness for gradient training across targeted, untargeted, and backdoor attacks; no scale disclosed, so LLM training claims wait.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Learning Explicit Behavioral Models with Adaptive Questions and World-Model Probes

Hikaru Shindo and seven coauthors introduce ESBM, a behavioral model using typed predicates, weighted rules, bounded options, and mechanism memory. After each Atari-style rollout, adaptive questions and world-model probes convert QA and transition-prediction errors into local edit constraints.

#Agent#Reasoning#Interpretability#Hikaru Shindo

why featured

HKR-K passes because ESBM gives a concrete modeling mechanism, converting QA and transition errors into local edit constraints. HKR-H and HKR-R are weak: the angle is academic, and Atari rollouts are distant from production agent pain points.

editor take

ESBM edits rules after each rollout using QA and transition errors; I buy the supervision signal, not the Atari-to-agent leap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Calibrating Uncertainty for Zero-Shot Adversarial CLIP

The paper proposes an adversarial fine-tuning objective for CLIP that reparameterizes outputs as Dirichlet concentration parameters, aligning distributions under perturbations and reporting improved uncertainty calibration with competitive adversarial robustness across multiple zero-shot benchmarks while preserving clean accuracy.

#Vision#Fine-tuning#Safety#CLIP

why featured

HKR-K passes: the method is concrete and claims better calibration across zero-shot benchmarks while preserving clean accuracy. HKR-H and HKR-R are weak; no code, effect size, or production setting is disclosed.

editor take

Only the abstract is available; no benchmark counts disclosed. Dirichlet calibration for adversarial CLIP is plausible, but tables decide.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

The authors introduce REMEDI, a machine-unlearning benchmark for clinical disease inference built on the MIMIC-III clinical database. It covers multi-label and multiclass tasks, diverse forget-instance setups, and metrics for both retained utility and achieved unlearning, while experiments show existing methods trade off utility against forgetting and fit multi-label classification poorly.

#Benchmarking#Safety#REMEDI#MIMIC-III

why featured

HKR-K is clear: REMEDI defines a MIMIC-III clinical unlearning benchmark, and HKR-R lands on privacy/compliance. The work is still a narrow research benchmark with weak HKR-H, so it stays in all.

editor take

REMEDI tests clinical unlearning on MIMIC-III; I buy the direction, since utility collapse in multi-label disease tasks is the hard part.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI trains a masking module over sparse-autoencoder image embeddings to reconstruct CLIP representations conditioned on captions, improving retrieval on MS COCO, Flickr, IIW, and DOCCI, with stronger gains for richer captions and better robustness on RoCOCO.

#Vision#Multimodal#Embedding#CLIP

why featured

HKR-K passes via a concrete mechanism and MS COCO, Flickr, IIW, DOCCI, and RoCOCO evals. HKR-H/R are weak, and gains are not disclosed, so this stays browseable research signal.

editor take

TEVI filters CLIP image embeddings with captions; gains are undisclosed, so I’d file it as retrieval post-processing for now.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning

CF-JEPA replaces masking with multi-horizon forward prediction for time-series representation learning, using random crops as context views and predicting short-, mid-, and long-horizon future representations. Across 126 UCR and 26 UEA classification datasets, eight electricity transformer forecasting benchmarks, and KPI/Yahoo anomaly detection, it leads self-supervised baselines on UCR/UEA and reduces multivariate forecasting MSE by 27%.

#Benchmarking#University of California, Riverside#University of East Anglia#Yahoo

why featured

HKR-K passes with a concrete CF-JEPA mechanism, 152 benchmark datasets, and a 27% MSE reduction. HKR-H/R are weak because this is a narrow time-series representation paper, not a broad model or product story.

editor take

CF-JEPA leads on 152 classification sets; the online/EMA split is the sharp bit, with 27% lower MSE for free.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

AdaGRPO adds two components to improve GRPO training for T2I flow models. It selects prompts through online curriculum filtering and fuses intra-group and global advantage estimates.

#Alignment#Fine-tuning#Research release

why featured

HKR-K passes because the summary names two testable mechanisms in AdaGRPO. HKR-H and HKR-R are weak: the title is academic, no result number is disclosed, and the topic is niche T2I post-training.

editor take

AdaGRPO discloses 2 training components, not metrics; I’d treat it as a Flow-GRPO patch, not a new T2I RL lane.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→GlucoFM-Bench: Benchmarking Time-Series Foundation Models for Blood Glucose Forecasting

GlucoFM-Bench evaluates eight architectures for blood glucose forecasting across 15 public diabetes-related datasets covering 1,117 people, and the best zero-shot model performs within 5% of the best full-shot supervised model.

#Benchmarking#GlucoFM-Bench#Chronos-2#TimesFM

why featured

HKR-K passes with concrete benchmark scale and a testable zero-shot claim. HKR-H and HKR-R are weak because medical time-series forecasting is vertical and not a broad AI-practitioner conversation starter.

editor take

GlucoFM-Bench covers 1,117 people; Chronos-2 lands within 5% zero-shot, but full-data LSTM wins by 4–21%.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Limitations of Normalization in Attention Mechanism

The paper analyzes limits of softmax normalization in attention and validates the theory with pre-trained GPT-2 experiments: as the number of selected tokens increases, the model’s ability to distinguish informative tokens declines, and low-temperature settings create gradient-sensitivity challenges during training.

#Reasoning#Interpretability#GPT-2#Research release

why featured

HKR-K passes: the paper names concrete softmax-attention failure conditions and tests them with GPT-2 pretraining. HKR-H and HKR-R stay weak, so this remains an all-tier research item.

editor take

GPT-2 tests show selected-token growth dilutes attention selectivity; the useful bit is testable softmax bounds, not the diagnosis.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

The paper introduces DIRECT, a framework that decomposes object-insertion conditions into three separate pathways—appearance, geometry, and context—so users can adjust a 3D proxy to control pose, while experiments report better geometric controllability and visual quality than prior methods.

#Vision#Multimodal#DIRECT#Research release

why featured

HKR-K passes: DIRECT gives a testable mechanism via 3-way condition decomposition and 3D proxy pose control. HKR-H and HKR-R are weak; this is a single arXiv vision method without product or market spread yet.

editor take

DIRECT splits insertion into 3 pathways; it’s cleaner control than 2D inpainting, but the snippet hides the metrics.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

TrioPose builds a TSPA-DiT triple-stream pose-aware architecture on SD3.5M and reports 64.33 AP on Human-Art, a 30% improvement over prior methods.

#Multimodal#Vision#TrioPose#SD3.5M

why featured

HKR-K passes with a named architecture and Human-Art AP result; HKR-H/R are weak. This is a niche vision-generation paper with no hard exclusion, so it sits in the interesting-but-not-featured band.

editor take

TrioPose hits 64.33 AP on Human-Art; treating pose as its own stream beats another brittle DiT conditioning hack.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Accelerating Reproducible Research in Synthetic EHR Generation

The paper introduces a synthetic EHR benchmarking framework that unifies data ingestion, model training, and evaluation, covering five baselines: MedGAN, CorGAN, PromptEHR, HALO, and GPT-2.

#Benchmarking#PyHealth#MedGAN#GPT-2

why featured

HKR-K passes: the framework unifies ingestion, training, and evaluation across MedGAN, CorGAN, PromptEHR, HALO, and GPT-2. HKR-H and HKR-R are weak, so this stays browseable rather than featured.

editor take

This framework unifies 5 synthetic EHR baselines; it targets ICD-9 diagnosis codes, so don’t sell it as broad medical generation eval.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

ERNEST uses one neural-network controller to drive a four-wheeled rover with a 2-DoF Active Gimbal Suspension, trained in DARTS with rigid-contact dynamics and Bekker-Wong terramechanics; on a 20° dry sandy slope, the learned controller cuts cost of transport by 37%, while the passive suspension becomes immobilized on wet sand.

#Robotics#Agent#Research release

why featured

Niche robotics paper: HKR-H has the planetary-rover active-suspension hook, HKR-K gives a 37% transport-cost result on a 20° dry-sand slope. HKR-R is weak because it lacks a broad AI tooling or market stake.

editor take

ERNEST cuts transport cost 37% on a 20° dry sand slope. I buy this: one less terrain classifier, one less rover failure mode.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Predictive Statistics Shape Emergent World Representations of Grid Walkers

The authors train decoder-only transformers and recurrent networks on constrained random walks over a two-dimensional lattice, finding that the first attention block extracts a sufficient statistic while later layers convert it into next-step predictive geometry.

#Reasoning#Interpretability#Research release

why featured

HKR-K passes via a concrete toy-model mechanism in Transformers/RNNs. HKR-H and HKR-R are weak, so this is useful research-feed signal but below featured.

editor take

On 2D endpoint walks, the first Transformer attention block reads sufficient statistics; narrow toy setup, cleaner than world-model handwaving.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Textual Supervision Enhances Geospatial Representations in Vision-Language Models

The paper evaluates ViT, CLIP, LLaVA, Qwen, and Gemma model families across image clusters such as people, landmarks, and everyday objects grouped by localizability, and finds that textual supervision improves geospatial representations.

#Multimodal#Vision#Benchmarking#CLIP

why featured

HKR-K passes because the paper adds a cross-family VLM geospatial evaluation and a textual-supervision claim. HKR-H/R are weak: no metric, artifact, or product path is disclosed, so this stays a narrow research item.

editor take

The paper tests ViT, CLIP, LLaVA, Qwen, and Gemma; I want leakage controls, not another language-helps-geo claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

arXiv:2604.10098v2 surveys Attention Sink in Transformers across three dimensions: fundamental utilization, mechanistic interpretation, and strategic mitigation; the abstract says Attention Sink concentrates attention on small uninformative token subsets, affects training and inference dynamics, worsens hallucinations, and includes a related paper list on GitHub.

#Interpretability#Inference-opt#Safety#arXiv

why featured

HKR-K passes: the three-part survey taxonomy is useful for attention-sink work tied to long-context and inference behavior. HKR-H/R are weak, and it is an arXiv survey without a new model, dataset, or production result.

editor take

Attention Sink survey groups work into 3 tracks; I don’t buy the “first survey” pitch, but the GitHub list is useful for long-context inference.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Bootstrap Theory of Representational Emergence: Explanatory Insufficiency as a Driver of Representation Learning and World Models

arXiv:2606.07303 introduces TBER, a framework that formalizes representational transition into five stages: stabilized observation, anomaly detection, explanatory insufficiency, representational emergence, and provisional stabilization.

#Reasoning#Memory#Research release

why featured

HKR-K passes because the post gives a new TBER framing and five stages. HKR-H and HKR-R are weak: the title is academic, and there is no product, benchmark, or industry conflict, so it fits the 60–71 research band.

editor take

TBER offers a 5-stage representation-transition frame, but no experiments are disclosed; smells like theory scaffolding, not a world-model roadmap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Aumann-SHAP: The Geometry of Counterfactual Interaction Explanations in Machine Learning

The paper introduces Aumann-SHAP, which discretizes a counterfactual hypercube into a micro-player cooperative game; on German Credit, interaction geometry changes feature-priority rankings in 12.3% of instances.

#Interpretability#Benchmarking#UCI#Research release

why featured

HKR-K passes with a concrete mechanism and a 12.3% result; HKR-H and HKR-R are weak because the angle is academic and validated on one dataset. Useful but narrow interpretability research, so tier all.

editor take

Aumann-SHAP flips 12.3% of German Credit rankings; attribution methods are finally treating interaction geometry as first-class.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training

The paper proposes NMP-QAT, where each neuron learns a discrete precision during training. Evaluations cover telecom and non-telecom datasets across MLP and tabular foundation-model architectures, but the abstract does not disclose exact compression ratios or accuracy numbers.

#Inference-opt#Fine-tuning#Research release

why featured

HKR-K passes because neuron-level mixed-precision QAT is a concrete mechanism for inference optimization. HKR-H and HKR-R are weak: no compression, accuracy, code, or deployment result is disclosed, so this stays in the lower all band.

editor take

NMP-QAT learns discrete precision per neuron, but the abstract gives no compression or accuracy numbers; discount the 6G-edge framing.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Building Better Activation Oracles

The paper improves Activation Oracle training in four areas and open-sources AObench; capability gains are marginal, while quality-of-life improvements are substantial.

#Interpretability#Benchmarking#AObench#Research release

why featured

HKR-K passes via AObench and four training-stage changes. HKR-H/R are weak because activation-oracle work is narrow interpretability tooling, so this stays in all rather than featured.

editor take

The paper tweaks AO training in 4 places and ships AObench; small capability gain, useful interpretability plumbing.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Twin selects learning rate and weight decay without validation data by using training loss in the non-separable regime and parameter norm in the separable regime, reporting 1.28% mean absolute error versus an Oracle test-accuracy selector across 37 image-classification dataset-architecture configurations.

#Fine-tuning#Benchmarking#Twin#Research release

why featured

HKR-K passes with a concrete no-validation tuning method and 37-run result. HKR-H is weak and HKR-R is narrow, so this stays in the lower all tier rather than featured.

editor take

Twin is 1.28% off Oracle across 37 image setups; I don’t buy validation-free tuning beyond homogeneous classifiers yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Towards Efficient and Exact Forgetting Services in Pre-Trained-Model-based Continual Learning

The paper proposes Analytic Continual Unlearning for PTM-based continual learning, deriving gradient-free closed-form least-squares updates for each unlearning request. ACU supports both sample-level and class-level forgetting, while the abstract claims gains in unlearning effectiveness, model fidelity, and system efficiency without disclosing benchmark numbers in the snippet.

#Fine-tuning#Interpretability#Safety#Research release

why featured

HKR-K comes from the ACU mechanism, and HKR-R from privacy/compliance pressure. The item stays at abstract level: no benchmark numbers, artifact, or production replacement claim, so it lands in the lower research-signal band.

editor take

ACU uses closed-form least squares for continual unlearning; no benchmark numbers are disclosed, so don't treat “exact forgetting” as deployable yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

SERNF fine-tunes real-world dexterous manipulation policies with normalizing flows and action-chunked critics, using exact likelihoods for multimodal action chunks and evaluating two hardware tasks: cutting tape with scissors retrieved from a case and palm-down in-hand cube rotation.

#Robotics#Fine-tuning#Research release

why featured

HKR-K passes because the method and two real-world tasks are concrete. HKR-H and HKR-R are weak: this is a specialized robot-learning paper, not a broad product, open-source, or benchmark event.

editor take

SERFN reports 2 hardware tasks; exact likelihoods for action chunks make conservative dexterous fine-tuning less hand-wavy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Performance Variation in Deep Reinforcement Learning

The paper proposes min-max IPR and run-wise percentile highlighting to evaluate run-to-run variation in deep reinforcement learning, using three case studies covering PPO, SAC, TD-MPC, TD-MPC2, DQN, and Rainbow.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with two evaluation mechanisms and 3 cases. HKR-H and HKR-R are weak because the story stays in DRL reproducibility, far from mainstream AI product or model competition.

editor take

Three case studies target RL run variance; I buy the angle, mean CIs have hidden PPO/SAC reproducibility pain for too long.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

Zhuphua Cao proposed FDRS, a digit-randomness screening framework for raw numerical research data, and evaluated it on RawData with n=253 and ErrData with n=255; Elastic-net Logistic Regression reached an AUC of 0.98395, while Random Forest reached 0.926667 accuracy.

#Benchmarking#Zhuphua Cao#arXiv#Research release

why featured

HKR-K passes with a named framework, dataset sizes, and AUC. HKR-H and HKR-R are weak: this is research-data auditing, not an AI product, model-capability, or industry-competition story; no hard exclusion applies.

editor take

FDRS hits 0.98395 AUC on 253/255 samples; I worry less about the model than its misuse as misconduct proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

MACS addresses the straggler effect in multimodal MoE expert-parallel inference with a training-free framework, using two mechanisms: entropy-weighted load for visual-token semantic value and dynamic modality-adaptive capacity for real-time modal composition.

#Multimodal#Inference-opt#MACS#Research release

why featured

A niche multimodal MoE inference paper: HKR-K comes from two concrete mechanisms, and HKR-R from cost/latency pain. No throughput or latency numbers are disclosed, and technical depth keeps it below 60.

editor take

MACS discloses 2 training-free mechanisms but no speedup number; multimodal MoE inference still bleeds at EP stragglers.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

OffQ uses top-1 PCA to identify a low-dimensional activation outlier subspace, rotates high-magnitude activations into 1 channel, and converts that channel into a shared offset to support W4A4KV4 uniform-grid quantization.

#Inference-opt#OffQ#Research release

why featured

HKR-K and HKR-R pass: the piece names a concrete quantization mechanism and W4A4KV4 target. HKR-H fails; no accuracy, throughput, or memory numbers are disclosed, and the technical bar keeps it in the lower interesting band.

editor take

OffQ funnels outlier activations into 1 channel, then offsets it; if W4A4KV4 holds, mixed precision loses an excuse.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Lighting-Aware Representation Learning under Controllable Lighting Variation

The paper proposes a lighting-aware representation learning framework that uses illumination variation as an explicit training signal. It evaluates image classification and object detection on ImageNet, ExDark, and PASCAL VOC, reporting gains over standard contrastive learning baselines under the same architecture and training budget.

#Vision#Benchmarking#arXiv#ImageNet

why featured

HKR-K passes: it gives a concrete training mechanism and ImageNet, ExDark, PASCAL VOC evaluation settings. HKR-H/R are weak, and the post gives no gain numbers, so this stays in all.

editor take

Lighting-aware loss wins on three vision benchmarks; no gain sizes disclosed, so I’d treat it as a low-light robustness patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning

ULPS integrates a calibrated BERT-based language model into PPO training, using A*-generated symbolic trajectories and Monte Carlo dropout uncertainty, and reports over 9% execution-accuracy improvement after fine-tuning on MiniGridUnlockPickup.

#Agent#Reasoning#Fine-tuning#arXiv

why featured

HKR-K passes via a testable setup, mechanism, and >9% gain. HKR-H/R miss; this is a niche RL paper rather than a product, open-source framework, or broad agent update.

editor take

ULPS gains 9% on MiniGridUnlockPickup; I don’t buy the LLM-guided framing, since BERT trained on A* smells like distilled control.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→An Adaptive Data Cleaning Framework for Noisy Label Detection

The paper proposes an adaptive data-cleaning framework that detects noisy labels using local, global, and learning-dynamics features; on ImageNet-100 with 40% symmetric label noise, it reports recall of at least 98%.

#Benchmarking#Research release#Benchmark

why featured

HKR-K has a concrete mechanism and ImageNet-100 result; HKR-R touches data-quality pain for training teams. HKR-H is weak, and this is a single arXiv paper without code or production evidence, so it stays in the upper low-value research band.

editor take

ImageNet-100 hits ≥98% recall at 40% symmetric noise; I want precision, because high-recall cleaners often purge hard samples too.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory

arXiv:2606.06624 releases a nine-chapter book manuscript on deep representation learning. It frames large deep networks through representation learning, optimization, and information theory, then discusses interpretable and controllable model design.

#Interpretability#Memory#arXiv#Research release

why featured

HKR-H passes because the title has a “mathematical theory of memory” hook. HKR-K and HKR-R are weak: the post gives scope only, with no new mechanism, experiment, or industry impact.

editor take

arXiv posted a 9-chapter manuscript on representation learning; I’d audit Chapters 2-6 before buying the “undergrad math” claim.

HKR breakdown

hook ✓knowledge —resonance —

→ open source

SCORE

H1·K0·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

TimeGS reframes time series forecasting as 2D generative rendering, adds MB-GKG and MP-CCR blocks, and reports state-of-the-art or competitive results on standard benchmark datasets.

#Benchmarking#TimeGS#Research release#Open source

why featured

HKR-H and HKR-K pass via the unusual rendering angle and named mechanisms, but HKR-R is weak. This is a niche methods paper, far from agents, products, or flagship model updates, so it stays in the 40–59 band.

editor take

TimeGS casts forecasting as 2D Gaussian rendering; SOTA is claimed on standard benchmarks, but datasets and error tables are undisclosed here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Synthics: Synthetic Physics-like Datasets for Machine Learning

Jari Vepsäläinen presents Synthics, a Bayesian probabilistic context-free grammar method for generating physics-like synthetic regression datasets, matching the Feynman equation corpus on all 8 studied structural features and selecting the 6th-best configuration out of 20 in a downstream gradient-boosted regressor tuning task.

#Benchmarking#Jari Vepsäläinen#Research release

why featured

HKR-K passes for a testable generator and 8 matched structural features, while HKR-H and HKR-R fail. The physics-like regression benchmark is useful to a niche ML audience, with no product, agent, or market impact.

editor take

Synthics matches Feynman on 8 structural features; I buy the direction, but 20 tuning configs don’t prove transfer.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

Kehan Wang proposes WAV v1, adding phase and split detail bases to block residual summaries in decoder-only Transformers; at 48 layers, it reduces TinyStories validation loss from 0.4960 to 0.4738 versus Block AttnRes, while the 12-layer setting is not consistently better.

#Reasoning#Inference-opt#Kehan Wang#arXiv

why featured

HKR-K passes via a concrete mechanism and TinyStories metric; HKR-H/R do not. The work is a niche transformer-architecture paper with limited practitioner pull, so it stays in the low-value research band.

editor take

WAV v1 cuts 48-layer TinyStories loss to 0.4738; I’d file it as a residual-routing trick, since 12-layer gains fail.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

U-Balance rebalances CPS telemetry labels using behavioral uncertainty, relabeling high-uncertainty safe windows as unsafe; on a UAV benchmark with a 46:1 safe-to-unsafe ratio, it reaches a 0.806 F1 score and beats the strongest baseline by 14.3 percentage points.

#Safety#Benchmarking#U-Balance#GatedMLP

why featured

HKR-K passes with a concrete mechanism and UAV benchmark numbers. HKR-H/R miss: this reads like a narrow arXiv method paper, not a broadly resonant AI product or model story.

editor take

U-Balance hits 0.806 F1 on 46:1 UAV data; relabeling uncertain safe windows works, but label trust becomes the attack surface.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Federated Foundation Models over Vehicular Networks

The paper proposes M3T FedFMs for vehicular networks, evaluates a case study on the Waymo Open Dataset, and releases implementation code in a GitHub repository for reproducibility.

#Multimodal#Fine-tuning#Waymo#Research release

why featured

HKR-K passes via a named method, dataset case study, and code release; HKR-H/R are weak because the angle is niche vehicular FL. No hard exclusion, so it lands as a low-mid research release.

editor take

M3T FedFMs ran a Waymo case and released code; the vehicle-side FL bandwidth bill is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

LoRA-DA derives a data-aware LoRA initialization from an objective with bias and variance terms, using Fisher-gradient approximation and Fisher information; the abstract says it improves final accuracy across multiple benchmarks, but the snippet does not disclose exact scores.

#Fine-tuning#Benchmarking#LoRA-DA#Research release

why featured

HKR-K passes for a new LoRA initialization mechanism; HKR-H/R are weak because no accuracy numbers, code status, or reproducible setup are disclosed. Technical but relevant to fine-tuning, so it stays in all.

editor take

LoRA-DA initializes LoRA with Fisher terms, but no scores are disclosed; I buy the theory, not the win yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Learning Fair Demand Models

The paper studies fairness in a two-stage pricing pipeline with linear demand estimation followed by price optimization. It compares fairness constraints on training loss, prices, and demand under parity-wise and Rawlsian views, then tests the model with a real-world vaccine pricing case study.

#Alignment#Research release#Safety/alignment

why featured

HKR-K passes because the paper adds three fairness-constraint placements and a vaccine pricing case. HKR-H and HKR-R are weak: the title is academic, and the post gives no product deployment or industry conflict, so this stays in the lower research band.

editor take

The paper shows loss-parity gives multiple optima; in pricing systems, fairness-in-the-loss is the lazy dangerous fix.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→TargetSEC: Plug-and-Play In-the-Wild Speech Emotion Conversion via Arousal-Conditioned Latent Style Diffusion

TargetSEC generates emotion-focused style embeddings with latent diffusion conditioned on speaker identity and continuous emotion, and experiments on MSP-Podcast show higher conversion accuracy than non-duration baselines while matching duration-prediction systems without explicit temporal modeling.

#Audio#TargetSEC#MSP-Podcast#Research release

why featured

HKR-K passes via a concrete dataset and modeling mechanism. HKR-H/R are weak: this is narrow audio research with no product path or broader industry pressure, so it stays in the low-value research band.

editor take

TargetSEC beats non-duration MSP-Podcast baselines; matching duration-prediction systems without temporal modeling is the sharp claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

The paper analyzes 28 high-profile filter feature selection studies published from 1994 to 2025. A multivariate linear regression using dataset count, baseline count, and new-method count explains 33% of the variance in win rate against chosen baselines.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via concrete sample size, time span, and the 33% variance claim. HKR-H/R are weak: this is niche classical ML evaluation methodology, useful to benchmark specialists but below featured threshold.

editor take

28 FFS papers show evaluation bias: dataset, baseline, and method counts explain 33% win-rate variance; even small benches are design-shaped.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Self-Supervised Learning for Android Malware Detection on a Time-Stamped Dataset

The paper constructs a time-stamped Android app dataset and uses BYOL self-supervised pre-training for malware detection, reporting 98% accuracy and 89% F1 under time-aware evaluation with timestamp verification.

#Fine-tuning#Benchmarking#VirusTotal#MITRE ATT&CK

why featured

HKR-K passes with a timestamped dataset, BYOL pretraining, and temporal-evaluation metrics. HKR-H and HKR-R are weak because this is a narrow security-detection paper, below featured threshold.

editor take

BYOL hits 98% accuracy and 89% F1 under time-aware testing; for Android malware, fixing temporal leakage is the useful part.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning

The paper proposes a model recycling framework for source-free supervised transfer learning, selecting subsets of related pre-trained models for reuse across multiple sources under white-box and black-box access, with parameter-efficient training as the stated mechanism.

#Fine-tuning#Research release

why featured

HKR-K passes for the data-free multi-source model reuse mechanism. HKR-H/R miss: no metrics, code, or production impact are disclosed, so this stays a narrow research update.

editor take

This proposes source-free model recycling for white-box and black-box access; no benchmark numbers disclosed, so the setup is useful but evidence is thin.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

STILL DEVELOPING · 1darXiv · cs.LG· atomEN04:00 · 06·08

→MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

MVCL-DAF++ improves rare-class recognition on MIntRec and MIntRec2.0 by +1.05% and +4.18% WF1, using prototype-aware contrastive alignment plus coarse-to-fine attention fusion, and the authors released source code on GitHub.

#Multimodal#Benchmarking#MVCL-DAF++#MIntRec

why featured

HKR-K passes with concrete WF1 gains and GitHub code. HKR-H and HKR-R are weak; the paper-style framing is niche for general AI practitioners, so it stays in the low-value research-update band.

editor take

MVCL-DAF++ gains 4.18% rare-class WF1 on MIntRec2.0. Nice small-benchmark SOTA; inspect the noise setup before buying it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED assesses debate creativity with an eight-dimensional hierarchy, using a pretrained autoregressive language model and hierarchical scoring head. The abstract says it beats prompt-based LLM evaluators, but does not disclose dataset size or exact scores.

#Benchmarking#Fine-tuning#DEFINED#arXiv

why featured

HKR-K passes via the 8-dimension creativity rubric and hierarchical scoring head. HKR-H and HKR-R are weak, and missing dataset size or results keeps this in all, below featured.

editor take

DEFINED scores debate creativity on 8 dimensions, but dataset size and scores are undisclosed; I don’t buy the LLM-evaluator win yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Position: A Dynamical Systems Perspective Is Needed to Advance Time Series Modeling

arXiv:2602.16864v2 argues that time-series modeling needs a dynamical-systems perspective, covering DSR, long-term statistics prediction, performance upper bounds, generalization to unseen regimes such as tipping points, and potential control strategies.

#Reasoning#Benchmarking#arXiv#Research release

why featured

HKR-K passes, but there is no new model, metric, or reproducible artifact. The dynamical-systems angle is narrow time-series research, so it stays in the low-value/all band.

editor take

arXiv 2602.16864v2 calls out TS foundation-model hype; I buy it, black-box forecasting hits dynamical-systems ceilings fast.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→A Rolling-Window Framework for Churn Prediction and Behavioral Driver Identification

The study proposes a rolling-window churn prediction framework that separates behavioral evidence and outcomes with a 30-day observation window and a 30-day future evaluation window, reporting 87.6% accuracy and 0.94 ROC-AUC for the feature-based model.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via reproducible windows and metrics. HKR-H/R are weak: this is conventional churn-prediction modeling, distant from core AI-industry concerns, so it sits in the low-value browseable band.

editor take

A 30-day window hitting 0.94 AUC is fine; without platform details and baselines, don’t treat it as a churn benchmark.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Phonetic Error Analysis of Raw Waveform Acoustic Models

The paper analyzes error patterns of raw-waveform acoustic models on TIMIT phone recognition, where WSJ transfer learning reduces Dev/Test PER from 13.9%/15.3% to 11.3%/12.3%.

#Audio#Benchmarking#TIMIT#WSJ

why featured

HKR-K passes via concrete TIMIT/WSJ transfer conditions and PER numbers. HKR-H and HKR-R are weak because this is narrow speech-recognition research, so it stays in all rather than featured.

editor take

WSJ transfer cuts TIMIT Test PER to 12.3%; the useful bit is phonetic error anatomy, not another tiny ASR leaderboard win.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks

The paper proposes PURe, a residual network with multiplicative product units, and evaluates it on one synthetic interaction benchmark plus two real-world datasets for accuracy, Gaussian-noise robustness, and low-data performance.

#Interpretability#Benchmarking#Research release#Benchmark

why featured

HKR-K passes because the paper gives a concrete architecture mechanism and evaluation setup. HKR-H/R fail: the angle is dry and has little practitioner resonance, so this stays in the low-value research band.

editor take

PURe has 1 synthetic and 2 real datasets; multiplicative residuals are neat, but the evidence is thin.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

The author evaluates a frozen pop-jazz Music Transformer on 11 target genres. A 165-cell grid shows five adaptation methods improve held-out chord prediction by +2.89 to +3.61 macro points, while corrected Wilcoxon tests find no decisive winner between LoRA and IA3.

#Fine-tuning#Benchmarking#Music Transformer#Research release

why featured

HKR-K passes with concrete experiment counts and gains. HKR-H and HKR-R are weak because chord-symbol genre modeling is niche and distant from mainstream AI products or practitioner workflows.

editor take

165 runs gain only 2.89–3.61 points; chord-symbol adaptation is useful, but not a genre-modeling win.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

The paper proposes AE-YOLO, adding lightweight autoencoders and CBAM to the FPN-PAN neck for UAV insulator defect detection; with an EfficientNetV2 backbone, it reports 95.10% mAP@0.5, 96.40% precision, and 93.80% recall on the Insulator-Defect Detection dataset.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-K passes because the paper gives a concrete architecture and mAP number; HKR-H and HKR-R fail. This is a narrow industrial-vision benchmark, so it sits in the 40–59 low-value band for the broader AI-practitioner feed.

editor take

AE-YOLO reports 95.10% mAP@0.5; WBF fuses YOLOv8/10/11, so don't read this as a clean single-model win.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Trio: Learning Time-Series Forecasting with Temporal-Spatial-Sample Attention and Structural Causal Priors

Trio applies temporal, spatial, and sample attention to multivariate time-series forecasting. Its TS-SCM generator creates synthetic tasks with dynamic lags, cross-variable interactions, noise, feedback, and distributional drift; experiments cover synthetic, industrial, and public benchmarks, while fully general PFN-style forecasting remains open.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via the attention design and TS-SCM setup; HKR-H/R fail, and the post gives no result numbers, code, or production claim. This is a niche forecasting paper, so it stays low in all.

editor take

Trio adds sample attention to forecasting; tests span synthetic, industrial, public sets, but zero-shot is exploratory and PFN-style forecasting remains unsolved.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

1d ago

arXiv · cs.LG· atomEN04:00 · 06·08

→Are You Sure? A Survey of Uncertainty Quantification in Symbolic Regression

Julia Reuter and Fabricio Olivetti de Franca survey uncertainty quantification in symbolic regression, grouping the literature into three directions: frequentist methods, Bayesian methods, and model selection.

#Benchmarking#Julia Reuter#Fabricio Olivetti de Franca#arXiv

why featured

HKR-K passes via the 3-part uncertainty-quantification taxonomy, but HKR-H and HKR-R are weak. This is a narrow research survey with no product, agent, or frontier-model impact.

editor take

Reuter groups SR uncertainty into 3 tracks; interpretable equations are not trustworthy equations without UQ.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

papers · 2026-06-08

more

feeds

admin