papers · 2026-05-27

▸ 258 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-27 · Wed

21:10

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN21:10 · 05·27

→GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

GEO-Bench evaluates TAP, Zero-Shot, STS, RAF, StealthRank, and ten white-hat C-SEO strategies under one protocol, scoring them on five datasets against a fixed Llama-3.1-8B-Instruct ranker with effectiveness and stealth metrics.

#Benchmarking#Safety#Llama#Research release

why featured

HKR-H/K/R all pass, but this is a niche research benchmark rather than a major product release. The concrete 5-dataset setup and Llama-3.1-8B-Instruct ranker put it at the featured floor.

editor take

GEO-Bench makes GEO look less like SEO folklore and more like an attack surface; black-box rewriting beating gradients is bad news for AI search rankers.

sharp

GEO-Bench’s sharp result is simple: attackers do not need ranker weights to move generative-search rankings. The paper evaluates TAP, Zero-Shot, STS, RAF, StealthRank, and 10 C-SEO strategies under one protocol, across five datasets, against a fixed Llama-3.1-8B-Instruct ranker. It tracks both promotion metrics—NRG, Success@α, Promote@α—and stealth via keyword violations and perplexity ratio. The ugly part is that black-box content rewriting matches or beats gradient attacks on rank promotion, while producing more fluent text. It also evades keyword and perplexity detectors in some domains. That undercuts the lazy defense posture many RAG/search products still have: block prompt injection, ignore content-side ranking manipulation. Once Google AI Overview and ChatGPT Search put retrieved pages into answer pipelines, GEO stops being an SEO gimmick and becomes ranker security.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

12d ago

FEATUREDarXiv · cs.CL· atomEN17:59 · 05·27

→PEFT-Arena: Parameter-Efficient Finetuning Stability-Plasticity Analysis

PEFT-Arena evaluates PEFT methods on downstream adaptation and general capability retention, using the stability-plasticity trade-off as the frame. The paper reports that, under comparable parameter budgets, orthogonal finetuning reaches the strongest Pareto frontier, and links forgetting to non-isometric representation distortion in activation space.

#Fine-tuning#Benchmarking#Interpretability#PEFT-Arena

why featured

HKR-H/K/R pass, but this is a specialist arXiv benchmark. The post gives the PEFT-Arena framing and Pareto claim, but not model scale, task set, or reproducible setup, so it stays in the 60–71 band.

editor take

Three feeds point to the same arXiv paper; PEFT-Arena’s useful punch is forcing PEFT evals to price in forgetting, not just task gains.

sharp

All three sources carry the same title and point back to arXiv:2605.28819v1; this is not independent confirmation, but one 28-page technical report spreading through cs.CL, cs.LG, and HF feeds. I like the framing because it attacks a lazy PEFT habit: reporting downstream accuracy while ignoring how much pretrained competence got burned. The concrete hook is strong: under comparable parameter budgets, orthogonal finetuning claims the better Pareto frontier, and the paper links forgetting to non-isometric distortion in activation space. For teams shipping LoRA-style adapters, the practical warning is sharper than the benchmark name: final SFT checkpoints often overshoot the better target-retention operating point, so path-wise rewinding deserves a slot in the eval loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

12d ago

arXiv · cs.CL· atomEN17:59 · 05·27

→VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

The study compares tightly matched LLM and VLM pairs in a text-only setting, using whole-cortex fMRI responses and synchronized eye-tracking saccades to assess natural-reading alignment, and finds that multimodal pretraining gives no uniform global advantage, while VLMs show selective gains on sentences with stronger visual semantic content.

#Multimodal#Vision#Benchmarking#Research release

why featured

HKR-H/K pass: the title pushes against the multimodal-pretraining narrative, and the post gives fMRI plus eye-tracking conditions. HKR-R fails because the claim stays in cognitive-neuroscience evaluation, not product, cost, or safety impact.

editor take

VLMs show no global text-reading alignment gain. Sample size is undisclosed, so don’t oversell multimodal brain-likeness yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:59

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:59 · 05·27

→Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Gamma-World presents a multi-agent video world model for interactive simulation. Simplex Rotary Agent Encoding gives agents permutation-equivalent identities without learned slots. Sparse Hub Attention reduces cross-agent attention cost from quadratic to linear. A causal student distilled from a full-context diffusion teacher runs 24 FPS rollouts and generalizes from two to four players without extra training.

#Agent#Multimodal#Inference-opt#Gamma-World

why featured

HKR-H/K/R all pass, but this is a single paper summary with no known lab signal, code release, or large deployment evidence. The concrete linear-attention mechanism earns featured-level interest.

editor take

Gamma-World’s sharp bit is not 24 FPS; it is ditching learned agent slots. If 2-to-4 player transfer holds, game world models get a scaling path.

sharp

Gamma-World makes the right bet: multi-agent world models hit scaling pain at identity encoding and cross-agent attention, not at prettier video samples. Simplex Rotary Agent Encoding gives agents simplex-phase identities without learned slots, and Sparse Hub Attention cuts cross-agent attention from O(n²) to O(n). That is a cleaner contribution than stapling another diffusion backbone onto game footage. The 24 FPS causal student matters, but I would not anchor on the demo-speed number. The harder claim is transfer from two players to four players without extra training. The RSS snippet does not disclose environment complexity, action-space size, or eval scale. Compared with Genie-style controllable video and GAIA-1-like driving worlds, Gamma-World at least attacks the combinatorial part of multiplayer interaction head-on.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

12d ago

FEATUREDarXiv · cs.CL· atomEN17:59 · 05·27

→Self-Improving Language Models with Bidirectional Evolutionary Search

The paper proposes Bidirectional Evolutionary Search, which combines forward trajectory recombination with backward subgoal decomposition, and reports that BES outperforms existing open-source frameworks on three open problem-solving benchmarks at inference time.

#Reasoning#Agent#Inference-opt#Embodied-Minds-Lab

why featured

HKR-H/K/R pass: the paper has a clear self-improvement hook, a named search mechanism, and agent-reasoning relevance. The post gives 3 benchmark wins, not names, scores, or code, so it stays below must-write.

editor take

BES attacks the weak spot of best-of-N: exploration shape. But no benchmark names or scores are in the snippet, so hold the victory lap.

sharp

BES is aimed at the ceiling of best-of-N and plain tree search: they expand inside the model’s own high-probability shell, so the search looks broad but stays local. The mechanism is concrete enough to care about: recombine partial trajectories in forward search, then decompose the target into checkable subgoals for denser feedback. I buy the direction before I buy the result. The snippet says BES beats open-source frameworks on three open problem-solving benchmarks and still helps when mainstream post-training algorithms fail. It does not give benchmark names, absolute scores, sampling budgets, or verifier cost. Like most inference-time scaling papers now, the question is not whether it can move the score. The question is how many rollouts it burns per point.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

12d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 05·27

→Multi-fingered Hand Achieves Zero-Shot Sim-to-Real Transfer via Physics-Grounded Contact Representation

The paper introduces Center-of-Pressure, a physics-grounded tactile representation, and reports zero-shot sim-to-real transfer on a multi-fingered hand across two blind contact-rich tasks: peg-in-hole insertion and ball balancing. CoP-conditioned policies outperform coarse binary-contact and raw-taxel baselines, and the calibration scheme estimates taxel orientations without ground-truth force measurements.

#Robotics#Inference-opt#Research release

why featured

HKR-H/K pass: zero-shot sim-to-real and the CoP tactile representation are concrete. HKR-R is narrow: two blind manipulation tasks in an arXiv robotics paper, far from mainstream AI product or developer workflows.

editor take

Three listings trace to one arXiv paper; if CoP holds up, dexterous hands need better contact state, not just denser tactile arrays.

sharp

All 3 sources use the same title and point to arXiv:2605.28812v1; this is paper syndication, not independent validation. The paper moves tactile input from binary contact to Center-of-Pressure, then reports zero-shot sim-to-real on two blind tasks: peg-in-hole insertion and ball balancing. The strong hook is the calibration mechanism: taxel orientations are estimated with differentiable dynamics, without ground-truth force measurements. I buy the direction, not the broad victory lap. Dexterous manipulation has spent a year blaming policy learning and data scarcity; this paper says the contact representation itself is leaking the task. If that holds, piling more tactile taxels is the wrong default. The abstract gives no success rates, hand model, or trial count, so it cannot yet be compared cleanly with data-heavy robot-learning lines like ALOHA-style imitation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:59

12d ago

HuggingFace Papers (takara mirror)· rssEN17:59 · 05·27

→HarmoVid: Relightful Video Portrait Harmonization

HarmoVid proposes a video portrait harmonization method that matches foreground lighting to a target background using a lighting deflickering model and asymmetric alpha-mask conditioning; the post does not disclose dataset size, metric values, or code availability.

#Vision#Multimodal#HarmoVid#Research release

why featured

HKR-K passes because the paper names a concrete video-lighting stabilization mechanism. HKR-H and HKR-R are weak, and dataset size, metrics, and code are not disclosed, keeping it in the low-value research-update band.

editor take

HarmoVid fixes portrait relighting flicker; no dataset, metrics, or code disclosed, so I’m filing it as a demo for now.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:56

12d ago

arXiv · cs.AI· atomEN17:56 · 05·27

→Calibrating Conservatism for Scalable Oversight

The paper introduces Calibrated Collective Oversight, which uses Conformal Decision Theory to calibrate penalties online and bound undesirable outcomes under a user-specified target with finite-time guarantees and no distributional assumptions; experiments cover a modified SWE-bench setting and MACHIAVELLI, where violation rates track the specified targets.

#Agent#Alignment#Safety#SWE-bench

why featured

HKR-K and HKR-R pass: CCO uses online penalty calibration with finite-time, distribution-free violation control and tests on SWE-bench/MACHIAVELLI. HKR-H is weak, and no effect sizes are disclosed, so this stays in all.

editor take

CCO bounds violation rates to a user target with finite-time guarantees; this is a tunable brake, not oversight theater.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:56

12d ago

arXiv · cs.CL· atomEN17:56 · 05·27

→Personal Visual Memory from Explicit and Implicit Evidence

The paper introduces a personal visual memory benchmark and VisualMem, a hybrid visual-text architecture that adds a structured visual memory module to a text-memory backend; the RSS snippet does not disclose dataset size, model details, or exact performance numbers.

#Memory#Vision#Multimodal#Research release

why featured

HKR-H/K/R all pass, but the item is still abstract-level: it names a benchmark and VisualMem, while dataset size, scores, and reproduction details are not disclosed. No hard exclusion; keep it in all.

editor take

VisualMem stores identity, ownership, and durable facts; no dataset size or scores disclosed, so I treat it as benchmark land-grab.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:56

12d ago

arXiv · cs.CL· atomEN17:56 · 05·27

→Research paper introduces OmniVerifier-M1 multimodal verification model with structured recalibration

The paper trains OmniVerifier-M1 for visual verification, using symbolic outputs such as bounding boxes instead of textual rationales, and decoupling reinforcement-learning objectives for binary judgment and meta-verification.

#Multimodal#Vision#Reasoning#OmniVerifier-M1

why featured

HKR-K and HKR-R pass: the paper offers a structured visual-verification mechanism tied to multimodal reliability. HKR-H is weak, and no result numbers or release conditions are disclosed, so it stays in all.

editor take

OmniVerifier-M1 uses boxes over text rationales; I buy it, vision verification finally gets rewards away from judge models.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:55

12d ago

arXiv · cs.CL· atomEN17:55 · 05·27

→CAPO Method Learns Annotator-Specific Explanation Behavior from Label Variation

The paper tests human label variation on two sentence-pair tasks with four annotators each, and CAPO contrasts a target annotator’s response against other valid annotations for the same input, outperforming prompting and SFT on aggregation-aware imitation and judge-based attribution.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-K is solid: CAPO optimizes target annotator answers against other valid labels on the same input. HKR-R applies to RLHF data quality, but the academic framing and small setup keep it in all, not featured.

editor take

CAPO beats SFT on 2 sentence-pair tasks with 4 annotators each; useful signal, but too narrow for big alignment claims.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:49

12d ago

arXiv · cs.CL· atomEN17:49 · 05·27

→Skill-Conditioned Gated Self-Distillation for LLM Reasoning

SGSD builds a multi-teacher pool from retrieved skill-mistake pairs and validates each teacher’s polarity against the same plain-prompt student rollout; on Qwen3-1.7B, it averages 6.2% above GRPO and 1.7% above answer-conditioned OPSD across AIME24, AIME25, and HMMT25, while using a weaker privileged-information assumption.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-K has a concrete mechanism and AIME24/AIME25/HMMT25 gains; HKR-R fits small-model reasoning training. HKR-H is weak, and this is an arXiv method paper below featured threshold.

editor take

SGSD beats GRPO by 6.2% on Qwen3-1.7B math sets; treating retrieved skills as suspect teachers is the sane move.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:47

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:47 · 05·27

→Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

The paper tests CoT prefix transfer with a provider-receiver framework: AIME transfer is largely driven by explicit answer leakage, MMLU-Pro depends more on receiver competence, and ZebraLogic relies on partial structured-answer information rather than full-answer leakage alone.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass, but this is a single research paper without disclosed code, replication, or major-lab release signal; featured fit, not an 85+ same-day must-write.

editor take

Stop treating CoT transfer as portable reasoning: on AIME it often smells like answer leakage; MMLU-Pro tests the receiver more than the trace.

sharp

This paper cuts through a lazy assumption in CoT transfer: a trace that helps another model is not automatically reusable reasoning. The provider-receiver setup matters because receivers see progressively longer CoT prefixes, then answer in force-answer or free-generation mode. The ugly part is AIME. In force-answer mode, transfer is largely driven by explicit answer availability, which matches how math CoTs often end by spelling out the final value. MMLU-Pro depends more on receiver competence, while ZebraLogic uses partial structured-answer information. That pushes back on the common “strong model teaches weak model to reason” story. Sometimes the weak model gets the answer, sometimes the format, sometimes a search hint. The useful engineering hook is answer agreement across receivers as a gold-free early-stop signal for provider reasoning. That is a cleaner win than paying for ever-longer traces.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:46

12d ago

arXiv · cs.AI· atomEN17:46 · 05·27

→Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

The study compares a Baseline Agent searching open-web documents with a Semantic Agent using 90 million schema.org datasets. The Semantic Agent achieves 65.7% higher overall precision on FAIR-compliant datasets, while the Baseline Agent answers 40% more questions and often returns prose-heavy pages or portal landing pages.

#Agent#RAG#Benchmarking#schema.org

why featured

HKR-H/K/R all pass, but this is a single arXiv study without a released artifact, production replacement, or major-lab signal. Useful for Agent/RAG retrieval design, so it stays in the 60–71 all tier.

editor take

Semantic Agent is 65.7% more precise but answers 40% fewer questions; agentic RAG still leans on old schema.org plumbing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:42

12d ago

arXiv · cs.CL· atomEN17:42 · 05·27

→Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

The paper introduces MalayPrag, a benchmark that evaluates 10 off-the-shelf LLMs on three prediction tasks for colloquial Malay discourse particles, and tests five linguistically grounded attributes that improve links between particles and pragmatic functions.

#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R are present, but this is a niche multilingual benchmark paper; the body gives task scale, not key results or model rankings. That keeps it in the 60–71 research-release band.

editor take

MalayPrag tests 10 LLMs on 3 tasks; good niche benchmark, because English-heavy scores hide pragmatic failure modes.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:38

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:38 · 05·27

→Study of the Abstraction Gap in Vision-Language Causal Reasoning

The paper introduces CAGE to evaluate eight VLMs on 49,500 questions across 5,500 images, finding seven models with AG above 0.50, text scores of 6–8, and chain scores below 2.5.

#Vision#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single VLM benchmark paper rather than a model release or production update. Concrete dataset and failure-rate numbers put it in the low featured band.

editor take

CAGE lands a clean hit: 7 of 8 VLMs show AG above 0.50, so fluent causal talk is still being mistaken for visual reasoning.

sharp

CAGE’s sharp cut is separating fluent explanation from faithful visual causality. Across 8 VLMs, 5,500 images, and 49,500 questions, 7 models show AG above 0.50; text-only scores sit at 6–8, while explicit causal-chain scores fall below 2.5. That is not benchmark noise. It is the bill coming due for evaluations that reward plausible captions. The nasty detail is that fine-tuning on 45,000 chain-annotated examples still fails to close the gap. One model reaches near-zero AG, but the snippet does not name it. That makes the architecture/pretraining claim hard to audit, yet the direction is credible: SFT can teach causal phrasing, but it does not reliably install causal abstraction in VLMs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:38

12d ago

arXiv · cs.CL· atomEN17:38 · 05·27

→Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

The paper defines marker internal confidence and evaluates its stability with 7 metrics, finding that LLMs struggle to differentiate epistemic markers such as “likely” by intrinsic confidence across distributions while retaining a partly consistent ranking across tasks.

#Alignment#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass, but this is a single arXiv abstract with no model list, dataset size, or effect numbers disclosed. It is useful calibration research, below same-day must-write range.

editor take

The paper tests MIC with 7 metrics; LLMs still blur markers like “likely” across distributions, so verbal confidence stays shaky.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:37

12d ago

FEATUREDarXiv · cs.AI· atomEN17:37 · 05·27

→Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

LearnWeak uses a stronger reference agent to identify domain-specific weaknesses in small computer-use agents, synthesize targeted tasks, and build supervision automatically; on OSWorld, it improves average performance by 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B across eight domains.

#Agent#Tools#LearnWeak#EvoCUA

why featured

HKR-H/K/R all pass: the paper gives a clear weak-domain specialization hook, quantified OSWorld gains, and cost resonance for computer-use agents. Single arXiv source with no disclosed code or deployment keeps it below the 78+ band.

editor take

LearnWeak is the rare CUA paper that attacks the student’s failure modes first; an 11-point OSWorld gain beats another glossy agent demo.

sharp

LearnWeak lands because it treats small CUA failure as local, not as a generic data shortage. It uses a stronger reference agent to find weak domains, synthesize targeted tasks, and build supervision. On OSWorld, it gains 11.6 points over EvoCUA-8B and 11.1 over OpenCUA-7B across eight domains. The key negative result is blunt: naive large-scale synthetic data gives only marginal improvement. I buy this more than the “one big agent handles every app” story. Computer-use agents fail through mixed planning and execution errors, and broad trajectory training often teaches confident misclicking. LearnWeak’s error-aware objective separates those two update paths. The gap: this RSS body gives no per-domain table, reference-agent name, or dataset size, so the 11-point claim still needs the PDF and benchmark hygiene checked.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:35

12d ago

FEATUREDarXiv · cs.AI· atomEN17:35 · 05·27

→FluxMem: Rethinking Agent Memory as Continuously Evolving Connectivity

FluxMem models agent memory as a heterogeneous graph and refines topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation; across LoCoMo, Mind2Web, and GAIA, the paper reports consistent state-of-the-art performance, with code planned for release at zjunlp/LightMem.

#Agent#Memory#Tools#FluxMem

why featured

HKR-K/R pass: FluxMem gives a concrete memory mechanism and three benchmark claims. With only arXiv-summary detail and no scores, code status, or author context disclosed, it stays in the 72–77 featured band.

editor take

FluxMem puts agent memory back into graphs, not vector dumps. I like the direction, but SOTA without scores is still a paper claim.

sharp

FluxMem makes the right bet: agent memory breaks when it stays a fixed retrieval stack. It models memory as a heterogeneous graph, then runs three stages: connection formation, feedback refinement, and long-term consolidation. The useful part is concrete: it repairs missing links, prunes interference, aligns abstraction level, and distills successful trajectories into reusable procedural circuits. That is closer to workflow memory than dumping more history into RAG. I would not treat the SOTA claim as settled. The snippet names LoCoMo, Mind2Web, and GAIA, but gives no scores, base models, token budgets, or extra tool-call costs. Memory papers often win through evaluation plumbing; MemGPT-style systems had the same problem. The promised zjunlp/LightMem code release is the test. Until then, this is a strong architecture proposal, not proof that graph memory beats tuned retrieval in production agents.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:23

12d ago

arXiv · cs.AI· atomEN17:23 · 05·27

→SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

The paper proposes SwarmHarness, a decentralized protocol with three components: a DHT-based SwarmRegistry, a SwarmRouter using capability, load, latency, and trust, and SwarmCredit that assigns compute-credit rewards through a Shapley-value approximation.

#Agent#Tools#SwarmHarness#HarnessAPI

why featured

HKR-K/R pass: the mechanisms are concrete and relevant to multi-agent orchestration. No experiment numbers, open-source artifact, or deployment case are disclosed, so it stays in the lower 60–71 band.

editor take

SwarmHarness ships DHT routing plus Shapley-ish credits; no experiment scale is disclosed, so I’m reading it as Petals with accounting.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:22

12d ago

arXiv · cs.AI· atomEN17:22 · 05·27

→CubePart: An Open-Vocabulary Part-Controllable 3D Generator

CubePart takes a global text prompt and a user-defined open-ended parts schema, then generates one mesh per schema element; the paper uses a two-stage architecture that separates global shape synthesis from part-level decoding, and the snippet says assets can enter game engines without manual post-processing.

#Multimodal#Vision#CubePart#Research release

why featured

HKR-H and HKR-K pass: part-level controllable 3D generation has a concrete mechanism, with per-part meshes and a two-stage architecture. Scope stays research-heavy, with no metrics, code, or product adoption disclosed, so it fits the 60-71 all band.

editor take

CubePart emits one mesh per user-named part; I like the API, but dataset scale and failure rates are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:19

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:19 · 05·27

→Research paper shows LLM zeroth-order fine-tuning is an inference workload

The paper runs the repeated scoring phase of LLM zeroth-order fine-tuning through a vLLM serving runtime, reducing a 20k-step LoZO run on OPT-13B SST-2 from 4.15 to 0.51 estimated training hours under matched LoRA-only settings, an 8.13x speedup.

#Fine-tuning#Inference-opt#vLLM#OPT

why featured

HKR-H/K/R all pass: the title is counterintuitive, the post gives an 8.13x speedup with reproducible conditions, and it hits fine-tuning cost. Technical, but practical enough for the 78–84 band.

editor take

Putting LoZO through vLLM is not a neat systems trick; it says ZO fine-tuning should live in serving runtimes, not training loops.

sharp

The sharp claim here is that ZO fine-tuning has been misfiled as training work. On OPT-13B with SST-2, the 20k-step LoZO run drops from 4.15 hours to 0.51 hours, an 8.13x speedup. Across OPT-1.3B to OPT-13B core-step tests, the paper reports 2.34x to 7.72x. The trick is not a clever new optimizer; it routes repeated forward objective evaluations through vLLM’s serving path. Honestly, that lands. If the method avoids backprop, keeping it inside a fragmented training loop is mostly historical baggage. The pushback is scope: the headline result sits on OPT plus SST-2, with matched LoRA-only settings. Multi-task adaptation, many dynamic adapters, and production scheduling pressure are not settled by this paper. But for practitioners, the direction is clean: lightweight adaptation is starting to look like inference infrastructure work.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:35

12d ago

HuggingFace Papers (takara mirror)· rssEN16:35 · 05·27

→Stage-wise Distortion-Perception Traversal for Zero-shot Inverse Problems with Diffusion Models

The paper proposes MAP-RPS, a two-stage framework for diffusion-based zero-shot inverse problems: an MAP estimation stage approximates an MMSE low-distortion initialization, then a re-noised posterior sampling stage improves perceptual quality, with a latent-space extension called LMAP-RPS for pretrained latent diffusion backbones.

#Vision#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes because the MAP-RPS mechanism is concrete. HKR-H/R fail, and hard-exclusion-technical-accessibility applies: diffusion inverse-problem methodology has no clear industry on-ramp, so importance is capped below 40.

editor take

MAP-RPS splits D-P traversal into 2 diffusion stages; ICML 2026 accepted, but code and real-task metrics are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

15:48

12d ago

HuggingFace Papers (takara mirror)· rssEN15:48 · 05·27

→GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study

GraphLit extracts about 20,000 Dynamic Heterogeneous Character Networks from Project Gutenberg, trains literary representations with a masked graph autoencoder objective, and outperforms text-only and graph-only baselines across 12 character-related tasks, especially those requiring contextual understanding.

#Embedding#Benchmarking#Project Gutenberg#Research release

why featured

HKR-K passes via concrete dataset and benchmark details, but HKR-H and HKR-R fail. The work is niche digital-humanities research with no product, agent, or industry adoption angle.

editor take

GraphLit extracts ~20,000 DHCNs; I buy the literary-study benchmark, not any implied jump to general long-context understanding.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

15:33

12d ago

HuggingFace Papers (takara mirror)· rssEN15:33 · 05·27

→Interpretability Coverage Disparity and Fairness in Hybrid Interpretable Models

The paper defines Interpretability Coverage Disparity and evaluates routing fairness across four hybrid interpretable methods, three fairness benchmark datasets, and multiple sensitive attributes, finding substantial disparity in intermediate transparency regimes where both transparent and black-box components are used.

#Interpretability#Safety#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the angle has a clear inversion, and the post gives ICD plus 4 methods and 3 benchmarks. The impact stays academic; no open tool, deployment case, or visible industry debate is disclosed.

editor take

ICD audits four hybrid interpretable methods; measuring who gets explanations exposes a fairness gap most benchmarks skip.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

15:20

12d ago

HuggingFace Papers (takara mirror)· rssEN15:20 · 05·27

→Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

The paper introduces the VIP identification task and the Temporal-VIP dataset with 9,249 video segments, 11 categories, and aligned importance rationales; VIP-Net reaches 67.3% accuracy, above 37.5%-53.9% baselines, with 0.63 mean rationale similarity after feature-guided LLM refinement.

#Multimodal#Vision#Benchmarking#Temporal-VIP

why featured

HKR-K passes with concrete dataset size, scene count, and accuracy. HKR-H/R are weak because this is a niche video-understanding benchmark, not a product or model update likely to drive broad practitioner debate.

editor take

VIP-Net hits 67.3% on Temporal-VIP; 9,249 clips still leave me unconvinced on genre and surveillance-view transfer.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:44

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN14:44 · 05·27

→Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

The paper uses linear probes on per-layer residual stream activations to predict LLM refusal before decoding, and Mechanistic AutoDAN replaces full-model fitness evaluation with partial forward passes and probe scoring, reducing per-iteration search time by up to 72%.

#Safety#Interpretability#Alignment#AutoDAN

why featured

HKR-H/K/R all pass: the hook is concrete, the 72% speedup is testable, and jailbreak cost hits a safety nerve. No major-lab or cross-source signal is shown, so this stays in the 78–84 band.

editor take

Refusal is showing up as a readable feature before decoding; the 72% search-time cut makes safety instrumentation double as attack tooling.

sharp

The sharp part is how cleanly the defense signal becomes attack infrastructure. A linear probe over residual-stream activations at each transformer block predicts refusal before decoding, then Mechanistic AutoDAN uses partial forward passes plus probe scoring instead of full fitness evaluation. The reported payoff is up to 72% lower per-iteration search time, with attack success rates competitive with vanilla AutoDAN. That is rough for the “refusal is an output policy” story. The refusal feature is already structured before tokens are generated. If this probe generalizes, red teams get a cheap navigation signal for jailbreak search, while many safety stacks still audit final text. I’d want to see model list, layer positions, and probe transfer details before overreading it, but the mechanism is exactly the kind of interpretability result that ships faster into attacks than controls.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

14:39

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN14:39 · 05·27

→GEM: Generative Supervision Helps Embodied Intelligence

GEM adds depth-map generation as a joint objective during VLM pre-training and releases the GEM-4M dataset; the post says GEM reaches state-of-the-art results across embodied benchmarks, while GEM-VLA improves task execution in simulation and real-world evaluations.

#Robotics#Vision#Multimodal#GEM

why featured

HKR-H/K/R pass: the mechanism and GEM-4M dataset give real signal for embodied AI. The post only states SOTA without margins, model scale, or real-world setup, so it stays at the low featured band.

editor take

GEM’s depth-supervised VLM pretraining is a sane bet for robotics, but SOTA claims without numbers still smell like paper-launch inflation.

sharp

GEM gets the bet right: depth-map generation during VLM pretraining is closer to robot control than another pile of text-instruction data. Grasping, navigation, and obstacle avoidance need geometry, not just object labels. The concrete hook is GEM-4M: grounding, reasoning, and planning data paired with depth supervision, plus GEM-VLA tested in simulation and real-world evaluations. I don’t buy the SOTA framing yet. The snippet says “diverse embodied benchmarks” and “vastly superior,” but gives no benchmark names, success rates, robot platforms, or comparison against OpenVLA, RT-2, or Octo. This is a good training-objective story; the evidence shown here is still abstract-level marketing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

14:36

12d ago

HuggingFace Papers (takara mirror)· rssEN14:36 · 05·27

→DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

DriveWAM adapts a pretrained video diffusion Transformer into an autoregressive video-action policy, trains unified video and action tokens with joint flow matching, and reports planning results on NAVSIM and PhysicalAI-Autonomous-Vehicles with a data-scaling study from 4k to 100k driving clips.

#Agent#Robotics#Multimodal#DriveWAM

why featured

HKR-K/R pass: the item names a concrete model conversion and NAVSIM scaling setup, and it touches driving-policy learning. HKR-H is weak, and this is a single research paper rather than a product or market event.

editor take

DriveWAM scales 4k to 100k clips on NAVSIM; video priors fit driving, but no closed-loop real-car evidence is disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

14:29

12d ago

HuggingFace Papers (takara mirror)· rssEN14:29 · 05·27

→GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER trains GUI agents with a three-stage mid-training pipeline that converts GUI trajectories into causal knowledge, reselects exemplars by causal structure and redundancy, and improves understanding and success rates on two GUI knowledge benchmarks and three task-completion benchmarks.

#Agent#Multimodal#Fine-tuning#GUI-CIDER

why featured

HKR-K/R pass: the paper offers a concrete training mechanism and multi-benchmark validation, and GUI-agent reliability matters to practitioners. HKR-H is weak; gains and release details are not disclosed, so it stays below featured.

editor take

GUI-CIDER reports 2 knowledge and 3 task benchmarks; no gains disclosed, so I read it as GUI trajectory dedup training.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

14:28

12d ago

HuggingFace Papers (takara mirror)· rssEN14:28 · 05·27

→Semi-Supervised Hypothesis Testing by Betting on Predictions

The paper introduces a testing-by-betting framework that uses unlabeled X samples to improve sequential hypothesis testing; under label shift or concept shift assumptions, the test remains anytime valid and is evaluated through simulations and large language model assessment.

#Benchmarking#Reasoning#Research release#Benchmark

why featured

HKR-K is clear: the post gives a semi-supervised testing-by-betting mechanism, shift conditions, and an LLM-eval simulation. The statistical angle and non-flagship source keep it in all, not featured.

editor take

This plugs unlabeled X into sequential tests while staying anytime-valid; for LLM evals with scarce labels, that beats another benchmark pile.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

14:00

12d ago

HuggingFace Papers (takara mirror)· rssEN14:00 · 05·27

→The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

The study runs a mixed-subjects Q&A experiment comparing warm and neutral chatbots. Users still rely on AI despite access to web search, and the post does not disclose participant count. Prior trust drives verification more than answer properties, while consulting additional AI sources predicts higher accuracy than traditional web search.

#Agent#Safety#Research release#Safety/alignment

why featured

HKR-H/K/R pass, but the body gives only the mechanism; participant count, effect size, and replication details are not disclosed. Useful safety/UX research, not a same-day industry story.

editor take

The study compares warm vs neutral chatbots but omits N; I don’t buy warmth as UX when it increases agreement with wrong answers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

13:47

12d ago

HuggingFace Papers (takara mirror)· rssEN13:47 · 05·27

→DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing

DiscoForcing generates full-body character motion under strict causality and bounded-latency streaming, using a causal music encoder, heterogeneous-noise diffusion-forcing training, and history-guided sampling to improve long-horizon stability and audio-motion alignment over prior baselines under matched causality and latency constraints.

#Audio#Robotics#Inference-opt#DiscoForcing

why featured

HKR-H/K pass: the real-time full-body motion hook is concrete, and the post lists causal streaming plus sampling mechanisms. HKR-R is weak; this is niche animation/avatar research without product, open-source, or competitive pressure.

editor take

DiscoForcing forces music-to-motion into strict causality and bounded latency; no ms latency disclosed, so I read this as benchmark hygiene.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

13:32

12d ago

HuggingFace Papers (takara mirror)· rssEN13:32 · 05·27

→The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

The authors propose Prosecution Decision Prediction and build PDP-Bench with 4,630 real Chinese prosecutorial decisions across 190 charges, classifying cases into prosecution or three non-prosecution decisions for evidence evaluation, legal subsumption, and discretion assessment.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K pass: the title frames an LJP blind spot and the body gives PDP-Bench size plus task design. The legal-NLP scope is narrow for AI practitioners, so it stays in the interesting-but-not-featured band.

editor take

PDP-Bench has 4,630 prosecution decisions; I trust this probe more than LJP, whose indicted-only sample bakes in survivor bias.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

13:20

12d ago

HuggingFace Papers (takara mirror)· rssEN13:20 · 05·27

→GONDOR to the Rescue: Satisficing Planning with Low Memory

GONDOR extends Greedy Best-First Search under strict memory limits by compressing the search tree, retaining sparse anchor states, and reconstructing the final path through re-search between anchors.

#Reasoning#Memory#GONDOR#Research release

why featured

HKR-K passes on a concrete planning mechanism, but HKR-H and HKR-R are weak. The post gives no benchmark, code detail, or product path, so it stays in the low-value research band.

editor take

GONDOR compresses GBFS with anchor re-search; no memory budgets disclosed, so the time-for-coverage tradeoff is the test.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

13:17

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN13:17 · 05·27

→BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

BiasEdit detects unknown bias attributes in visual datasets using statistical dependence and mutual information over vision-language representations, then applies text-guided image editing to generate realistic bias-conflict samples; the post says it needs no manual annotations and reports state-of-the-art debiasing performance even when training data are fully biased.

#Vision#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass, but this is a single paper summary without code, benchmark details, or external replication. The fully biased-data SOTA claim gives it practical punch, landing at 78.

editor take

BiasEdit’s sharp move is treating bias as editable data, not labels; I’d still audit the “fully biased” SOTA claim hard.

sharp

BiasEdit moves debiasing from manual labels into a data-editing pipeline, and I buy that direction more than the SOTA headline. It detects unknown bias attributes through statistical dependence and mutual information over vision-language representations, then uses text-guided image editing to create bias-conflict samples. That is cleaner than older setups that assume the bias label is already known. The risk is that the editor becomes the new bias source. The snippet says it works even when training data are fully biased, but gives no dataset names, margin over baselines, or edit-failure rate. Compared with JTT or LfF-style methods that fight bias during training, BiasEdit pushes the fight into dataset construction. That is deployable, but it makes the off-the-shelf VLM and image editor part of the fairness system, not neutral tooling.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

11:54

12d ago

HuggingFace Papers (takara mirror)· rssEN11:54 · 05·27

→Research proposes method for detecting diffusion-generated time series under generator shift

The study compares white-box reconstruction detection with a black-box raw-signal classifier for diffusion-generated time series, and the black-box detector reaches 79.2 average F1, a 22.1% relative improvement over the white-box approach, and 57.2 TPR@1%FPR under generator shift.

#Benchmarking#Research release#Benchmark

why featured

HKR-K is clear via concrete metrics, and HKR-R touches synthetic-data detection under shift. The scope is narrow time-series research with no model/product/open-source impact, so it stays in the lower interesting band.

editor take

Black-box raw-signal detection hits 79.2 F1; stop porting image reconstruction tricks to time series under generator shift.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

11:50

12d ago

HuggingFace Papers (takara mirror)· rssEN11:50 · 05·27

→Picid: Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Picid formalizes PHM evaluation as an executable protocol covering splits, preprocessing, label alignment, windows, and metrics. The paper evaluates 13 models on 12 datasets across batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via a reproducible PHM protocol and a 13-model/12-dataset setup. HKR-H and HKR-R are weak because the story is niche industrial maintenance, so it stays in the low-value research band.

editor take

Picid tests 13 models on 12 PHM datasets; this field needs fewer SOTA claims and fewer hidden splits in scripts.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

11:01

12d ago

HuggingFace Papers (takara mirror)· rssEN11:01 · 05·27

→Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in MoE Models

The paper proposes RA-MoE, a three-stage fine-tuning framework that adds routing alignment loss for target-language ci examples, and reports gains over standard SFT, Routing Steering, and RISE across three MoE models, three tasks, and six target languages.

#Fine-tuning#Reasoning#RA-MoE#Routing Steering

why featured

HKR-K passes: the summary names a three-stage RA-MoE method and a 3×3×6 evaluation. HKR-H/R are weak because the angle is technical and the audience is limited to multilingual MoE fine-tuning, so it sits in the 60–71 band.

editor take

RA-MoE beats SFT, Routing Steering, and RISE on 3 MoEs, 3 tasks, 6 languages; useful hook, but RSS omits gain sizes.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

10:16

12d ago

HuggingFace Papers (takara mirror)· rssEN10:16 · 05·27

→Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

Every9D-21M provides 9D pose annotations for 21.8M real-world images, built from 109K object-centric videos across 700 everyday object categories.

#Vision#Benchmarking#Every9D-21M#GenIntel

why featured

HKR-H and HKR-K pass: the dataset scale, class count, and video count are concrete. HKR-R is weak because this is a specialist vision/robotics dataset, so it stays below the 72 featured bar.

editor take

Every9D-21M labels 21.8M real images for 9D pose; the bet is clean cross-instance propagation, not dataset size.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

09:56

12d ago

HuggingFace Papers (takara mirror)· rssEN09:56 · 05·27

→PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

PointQ-Bench introduces 3,083 point clouds across authentic scans, simulated distortions, and AI-generated content, with eight issue types and 12,332 QA pairs for anomaly sensing, defect diagnosis, usability grading, and open-ended quality reporting.

#Vision#Multimodal#Benchmarking#PointQ-Bench

why featured

HKR-K passes because the dataset size and diagnostic tasks are concrete. HKR-H and HKR-R are weak; the point-cloud QA angle is narrow, so it sits in the 60-71 band.

editor take

PointQ-Bench adds 3,083 point clouds and 12,332 QA pairs; 3D VLMs losing to 2D MLLMs is an awkward signal.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

09:54

12d ago

HuggingFace Papers (takara mirror)· rssEN09:54 · 05·27

→Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

L2L casts pseudo-label construction as a learnable decision process for semi-supervised referring expression segmentation, using multimodal priors, reinforced pseudo-label selection, and a hierarchical segmentation network, with experiments on RefCOCO, RefCOCO+, and RefCOCOg showing improvements over existing methods.

#Multimodal#Vision#Reasoning#Research release

why featured

HKR-K passes for a concrete mechanism and datasets, but gains, code, and production relevance are not disclosed. The narrow vision-benchmark angle keeps it in the lower band.

editor take

L2L reports gains on RefCOCO suites, but no numbers; I don't buy semi-supervised segmentation wins without deltas.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

09:44

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN09:44 · 05·27

→Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Proprio lets a frozen video generator self-score outputs using flow residuals under controlled latent perturbations, then improve physical plausibility through best-of-N search, gradient-based refinement, or both; with TurboWan2.2, it raises Physics-IQ from 32.2 to 37.5 and VideoPhy2-hard physical commonsense from 45.6 to 55.0.

#Multimodal#Vision#Inference-opt#Proprio

why featured

HKR-H and HKR-K pass: physics self-scoring plus inference-time refinement gives a new mechanism and a metric lift. HKR-R is weak, and the source only gives paper-summary detail, so it sits at the featured threshold.

editor take

Proprio squeezes physics checks out of a frozen video model’s own flow residuals; the gain is real, but it pays with inference-time search.

sharp

Proprio moves physical plausibility scoring back inside the video generator, which I trust more than stapling on a VLM judge. The concrete gain is decent: on TurboWan2.2, Physics-IQ rises from 32.2 to 37.5, and VideoPhy2-hard physical commonsense jumps from 45.6 to 55.0. Human raters prefer Proprio-selected or refined videos in roughly two-thirds of comparisons. The convincing part is the signal: flow residuals under controlled latent perturbations, not another external caption-and-score loop. But this is not a free model upgrade. Best-of-N search and gradient refinement both spend inference budget, and the snippet does not disclose N or latency. For production video, I read this as a sharper rejection/refinement layer, not proof that the generator has learned robust intuitive physics.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

09:39

12d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN09:39 · 05·27

→When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

The paper evaluates four memory methods across three inference strategies and four tool-use benchmarks, finding that inference strategy confounds memory gains: reflection is significant only under MCTS, within-expansion injection helps only diversity-starved beam search, and atomic fact extraction is accuracy-neutral while shortening some trajectories by 19-26%.

#Agent#Reasoning#Memory#Research release

why featured

HKR-H/K/R all pass: this is not a plain SOTA claim, but a test of how memory interacts with inference strategy in tool-use agents. It is research-heavy, so it lands at 78, below major product/model releases.

editor take

This paper punctures lazy agent-memory claims: the same memory trick changes under search strategy, so many “memory gains” are inference artifacts.

sharp

Agent memory has been sold too often as a portable module: add reflection, add facts, add observations, get better agents. This paper cuts into that claim. It tests four memory methods, three inference strategies, and four tool-use benchmarks, then shows the same method can change significance on the same examples when the search procedure changes. The concrete results are the useful part. Reflection only reaches significance under MCTS, not best-of-N. Within-expansion injection helps only diversity-starved beam search. Atomic fact extraction stays accuracy-neutral, but shortens some reusable-structure tasks by 19–26%. That is a much cleaner result than another long-term-memory agent stack. If your agent eval does not separate memory abstraction from inference policy, the reported gain is probably contaminated.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

09:26

12d ago

HuggingFace Papers (takara mirror)· rssEN09:26 · 05·27

→Refining Multidimensional Video Reward Models via Disentangled Influence Functions

The paper proposes a disentangled influence framework for estimating dimension-specific supervision risk in T2V multidimensional video reward models and introduces pruning and reweighting strategies; the post does not disclose dataset size, exact metric gains, or code availability.

#Multimodal#Vision#Alignment#Research release

why featured

HKR-K passes because the paper offers a testable supervision-risk mechanism for video reward models. HKR-H/R are weak, and dataset size, metrics, and code status are not disclosed, so this stays low-band all.

editor take

The paper offers dimension-level influence functions plus pruning and reweighting; metrics, data, and code are undisclosed, so don't file it as reproducible progress.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:21

13d ago

HuggingFace Papers (takara mirror)· rssEN08:21 · 05·27

→SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

The researchers used SAM to convert ZOD bounding boxes into pixel-level masks, processed over 100,000 frames, manually curated a 2,300-frame subset with a 36% acceptance rate, and reported up to 48.1% mIoU with CLFT-Hybrid.

#Vision#Multimodal#Benchmarking#Segment Anything Model

why featured

HKR-K passes on concrete dataset scale and 48.1% mIoU, but HKR-H and HKR-R miss because the angle is a narrow segmentation paper with limited practitioner pull. Lower-band score due to niche scope.

editor take

SAM adds masks to 100K ZOD frames; 48.1% mIoU is modest, but the 2,300-frame curated set is the asset.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:06

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN08:06 · 05·27

→Human-like in-group bias in instruction-tuned language model agents

Researchers ran a 500-turn multi-agent simulation across six model families and found 5–16 percentage-point in-group targeting differentials when group labels were visible, while the pattern disappeared when labels were hidden.

#Agent#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass: the paper has a sharp agent-bias hook, concrete numbers, and a deployment-safety nerve. Limited source authority and no cross-source cluster keep it below the 78–84 band.

editor take

Stop auditing only action types; this paper finds the bias in who gets the action, and 5–16 points is enough to bend agent networks.

sharp

Agent safety evals still over-index on single-step outputs, and this paper hits the blind spot: the bias sits in who receives resources, not in what the model says. The setup ran 500 turns across six model families with 20 seeds; visible group labels produced 5–16 percentage-point in-group targeting differentials, disappearing when labels were hidden, with corrected p < 0.001. The ugly part is audit failure. Action-type distributions showed no rise in negative actions, so standard action-log review misses the effect. If agents route tickets, leads, permissions, or compute, this kind of “mild” preference compounds through reciprocation. A harmless-looking step policy can still produce a biased network.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

08:06

13d ago

HuggingFace Papers (takara mirror)· rssEN08:06 · 05·27

→A Wolf in Sheep's Clothing: Targeted Routing Hijacking in Federated RAG

The paper introduces Routing Hijacking: a malicious client forges its semantic profile to attract target queries, consistently causing misrouting across three FedRAG routing architectures and downstream failures such as missing evidence, poisoning, incorrect answers, and hallucinations.

#RAG#Safety#Tools#Research release

why featured

HKR-H/K/R pass, but the feed gives only title plus summary, with no success rate, dataset, or mitigation result. Federated RAG is niche, so this stays in the 60–71 research-signal band.

editor take

Routing Hijacking breaks three FedRAG router types; privacy-preserving retrieval looks brittle when client profiles become the attack surface.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

07:56

13d ago

HuggingFace Papers (takara mirror)· rssEN07:56 · 05·27

→Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

The paper introduces an adaptive cooperative attack framework and STAR defense for LLM-based multi-agent systems. Cooperative attacks cause a 5.34% relative task-success drop, while STAR improves task success by 36.76% on average.

#Agent#Safety#STAR#Research release

why featured

HKR-H/K/R pass, but the post gives abstract-level facts only; benchmark setup, model scope, and open-source details are not disclosed. Useful agent-safety research, not same-day must-write.

editor take

Cooperative attacks cut MAS success by 5.34%; STAR adds 36.76%, but sentence-level repair still smells like a lab threat model.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

07:33

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN07:33 · 05·27

→ATLAS: All-round Testing of Long-context Abilities across Scales

ATLAS evaluates 26 long-context models on an 8K-1M grid, covering eight capability dimensions, nine auditable components, and 6,438 instances. Gemini-3.1-Pro-Preview leads at 128K, while Claude-Opus-4.6 leads at 1M; seven models shift by at least two ranks between 8K-128K and 8K-1M scoring.

#Benchmarking#Reasoning#RAG#Gemini

why featured

HKR-H/K/R all pass: the story has a Gemini-vs-Claude hook, concrete benchmark scale, and practical model-selection stakes. It stays in the 78-84 band because this is a single benchmark paper, not a model or product release.

editor take

ATLAS punctures the long-context flex: 128K and 1M have different winners, so million-token claims need decay curves, not banners.

sharp

Single-point long-context scores deserve to die, and ATLAS lands that hit cleanly. It tests 26 models across an 8K–1M grid, with 6,438 instances, eight capability dimensions, and nine auditable components. The scoring uses length-aware AUC, then a harmonic mean that punishes lopsided profiles. The leaderboard splits fast: Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M, seven models move at least two ranks between 8K–128K and 8K–1M, and one gap reaches 12 positions. The annoying trick in million-token marketing is treating “fits in the window” as “works at that length.” ATLAS attacks that by separating retrieval-style operations from application workloads. The caveat is practical: the RSS snippet gives no pricing, latency, or inference budget. A 1M-token winner that stalls or burns cash still loses inside production RAG and agent loops.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

07:30

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN07:30 · 05·27

→SilentRetrieval: Hijacking RAG via Semantically Preserving Adversarial Data Poisoning

SilentRetrieval attacks RAG with a two-stage data-poisoning method, reaching 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO under a one-poisoned-document-per-query setup.

#RAG#Safety#Benchmarking#SilentRetrieval

why featured

HKR-H/K/R all pass: the paper targets production RAG security and gives a two-stage method with testable metrics. It remains in the 78–84 band because this is a single paper, not a major lab release or cross-source event.

editor take

RAG security can’t stop at prompt injection; SilentRetrieval hits 74.2% HR@10 at 0.016% poisoning on Wikipedia-scale data, which human review won’t catch.

sharp

SilentRetrieval pins the RAG failure mode on corpus integrity, not prompt wording. That is the uglier problem. The attack keeps poisoned documents fluent and retrievable with Coordinated Beam Search, then fuses triggers using a frozen LLM. With one poisoned document per query, it reaches 84.6% / 81.3% HR@10 on Natural Questions and MS MARCO, plus 57.5% / 54.8% ASR-LLM. The scale result is the punch: 74.2% HR@10 at a 0.016% poisoning ratio in sampled Wikipedia-scale evaluation. Transfer also holds at 64.7% average HR@10 across unseen retrievers, including ColBERT and commercial embedding models. That breaks the lazy enterprise assumption that “clean-looking” documents make RAG safe. The paper says combined retrieval-side and generation-side defenses cut success, but add latency; that trade-off hurts production RAG where every extra rerank or filter already gets negotiated.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

07:28

13d ago

HuggingFace Papers (takara mirror)· rssEN07:28 · 05·27

→Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

The paper proposes Judge-Then-Solve, which makes reasoning models commit to answerability before generation; experiments on dense and MoE models push Abstention@Detection toward near saturation under insufficient information.

#Reasoning#Safety#Alignment#Research release

why featured

HKR-K and HKR-R pass: Judge-Then-Solve is a concrete mechanism, and abstention safety matters for reasoning-model deployment. Sparse sourcing lacks benchmarks, numbers, and authors, so it stays in 60–71.

editor take

JTS commits answerability before generation; A@D nears saturation, but no numbers disclosed, so I’d treat it as a reasoning brake.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

06:57

13d ago

HuggingFace Papers (takara mirror)· rssEN06:57 · 05·27

→RW-TTT: Batched Serving System for Request-Owned Test-Time Training

RW-TTT tags each decode step with owner, version, and READ/WRITE effect, then batches only compatible phases; on one GPU with eight InPlace-TTT fast-weight streams, it reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget.

#Inference-opt#Fine-tuning#Memory#RW-TTT

why featured

HKR-H/K/R pass: the paper has a concrete mechanism and throughput result. Its niche inference-systems angle and lack of adoption or cross-source discussion keep it in the interesting band, not featured.

editor take

RW-TTT hits 274.61 tok/s on one GPU across eight streams. TTT serving needs state isolation, not louder batching claims.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

06:38

13d ago

HuggingFace Papers (takara mirror)· rssEN06:38 · 05·27

→MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0 builds more than 10,000 QA evaluation instances for short-drama and scene-level generation, diagnosing high-level failures across acting, narrative, atmosphere, and audio-visual language in multi-talker audio-video generation.

#Multimodal#Audio#Benchmarking#Gemini

why featured

HKR-K and HKR-R pass: 10k+ QA cases and four failure categories add usable evaluation detail for AV generation. HKR-H is weak, with a narrow academic title and no product/model release, so it stays in the interesting band.

editor take

MTAVG-Bench 2.0 ships 10k+ QA items; multi-talker video eval is finally moving past lip-sync into acting and narrative.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

06:25

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN06:25 · 05·27

→The Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

The paper proposes Energy-Based Decoding, a training-free reward-guided framework that steers frozen pre-trained LLMs at decoding time; EBD outperforms baselines across five models and six benchmarks, raising Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5.

#Inference-opt#Benchmarking#Reasoning#Qwen

why featured

HKR-H/K/R all pass: EBD guides frozen LLMs at decoding time with a lightweight reward model, backed by 5 models, 6 benchmarks, and a large Qwen3-8B AlpacaEval2.0 jump. Strong research signal, not a major lab release.

editor take

EBD exposes a dirty secret in base-model evals: some “weak capability” scores are just bad decoding trapping the model outside task behavior.

sharp

EBD hits the evaluation protocol harder than the decoding literature. Qwen3-8B-Base jumps from 8.8 to 44.5 on AlpacaEval2.0 with frozen weights, using a lightweight reward model only at decoding time. Mistral-7B on Math500 also gets 18.9x lower latency than prior decoding work. That makes plenty of base-model leaderboards look contaminated: they mix actual task skill with whether the model can format an answer under naive sampling. I don’t fully buy the fairness framing, though. A reward model is an outside preference prior, not a neutral lens. The paper’s useful provocation is that “pre-trained capability” is not a scalar you read off greedy decoding.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

06:08

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN06:08 · 05·27

→KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

KSAFE-MM evaluates Korean multimodal safety risks across 12 state-of-the-art MLLMs, and ProgramExecution jailbreaking reaches up to 74.2% ASR versus 13.4% for standard queries.

#Multimodal#Vision#Safety#KSAFE-MM

why featured

HKR-H/K/R pass: the Korean-localized multimodal safety angle is specific, and 12 MLLMs with 74.2% ASR is testable. It stays near the featured floor because this is one benchmark paper with limited disclosed details.

editor take

Korean multimodal safety is not a localization footnote; ProgramExecution jumps ASR from 13.4% to 74.2%, exposing a live guardrail gap.

sharp

KSAFE-MM hits a stale blind spot in multimodal safety: passing English generic harms says little once the model sees local visual cues. The paper tests 12 MLLMs, and ProgramExecution jailbreaks reach 74.2% ASR versus 13.4% for standard queries. That gap is too large to file under ordinary prompt sensitivity. The useful design choice is the split between KSAFE-MM-G and KSAFE-MM-C. One localizes generic Korean-language risks; the other pairs real-world cultural visual queries with malicious text. Many vendor safety reports still lean on English red-team sets, sometimes padded with translated samples. The nasty trade-off is also familiar: models with low ASR show excessive refusal on benign queries. A pretty safety score can still mean a brittle product.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

06:02

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN06:02 · 05·27

→Research paper analyzes effectiveness and timing of compressed reasoning data in LLM post-training

The paper defines three CoT types: Explicit, Composed, and Implicit, then tests difficulty, compression granularity, and data size on a synthetic compositional reasoning task. It finds coarser CoT needs more SFT data, Composed and Implicit CoT gain more from data scaling than Explicit CoT, Implicit CoT tends toward memorization, and RLVR decomposes compressed steps learned during SFT.

#Reasoning#Fine-tuning#Research release

why featured

HKR-H/K/R all pass, but the evidence is limited to synthetic compositional reasoning tasks. This clears featured, not the 78+ band.

editor take

Compressed CoT is not free efficiency: SFT saves tokens, then RLVR re-expands the steps. The post-training cost story has a crack.

sharp

Compressed reasoning data is not a clean token-saving trick; it changes what the model learns at each training stage. The paper splits CoT into Explicit, Composed, and Implicit, then controls difficulty, compression granularity, and data size on a synthetic compositional task. The sharp result: coarser CoT needs more SFT data, Implicit CoT drifts toward memorization, and RLVR decomposes the compressed steps SFT had learned. That matters for reasoning post-training teams. A lot of pipelines treat short CoT as budget optimization. I read this as distribution-risk management. Push compression too hard in SFT, and the model learns shortcuts; add verifiable-reward RL, and the trajectory expands again. The body does not give absolute gains on real math or coding tasks, so I would not map this straight onto SWE-bench yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

05:45

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN05:45 · 05·27

→Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Tool Forge converts natural-language capability intent into validation-carrying tool capsules, and its Router reaches 0.901 micro-F1 across 83 benchmark cases while reducing estimated task-flow tool context by 99.2% versus naive full-catalog schema exposure.

#Agent#Tools#Benchmarking#Tool Forge

why featured

HKR-H/K/R all pass: not a major lab release, but it offers a testable mechanism for agent tool governance with 83 use cases and 99.2% context reduction, placing it in the good-quality featured band.

editor take

Tool Forge is a needed slap at schema-dump agents: 0.901 F1 and 99.2% less tool context is strong, but 83 cases is still a lab bench.

sharp

Tool Forge makes the right call: agent tooling cannot keep surviving as a giant schema blob stuffed into context. It packages tools as capsules with intent, contracts, tests, credential bindings, lifecycle state, and runtime validation evidence, then routes agents into intent-scoped sessions. On 83 Router cases, it reports 0.901 micro-F1 and a 99.2% estimated reduction in task-flow tool context. I buy the direction more than the headline number. The end-to-end probe is only 25 local-tool cases: 25/25 bundles generated, 0.940 micro-F1 on deterministic checks, and 23/25 live sandbox validations. That reads like a solid systems scaffold, not a proof of agent reliability. MCP-style tool ecosystems badly need this validation layer; adversarial routing and broader API grounding are exactly where enterprise deployments will break first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

05:36

13d ago

HuggingFace Papers (takara mirror)· rssEN05:36 · 05·27

→AsyncTool: Evaluating Asynchronous Function Calling Capability in Multi-Task Scenarios

The paper introduces AsyncTool, a benchmark that tests LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback, using step-, sub-task-, and task-level evaluation plus efficiency metrics for coordination and completion; the snippet does not disclose dataset size, model list, or exact performance numbers.

#Agent#Tools#Benchmarking#Research release

why featured

HKR-K/R pass: AsyncTool adds delayed feedback and three-level evaluation for agent tool use. HKR-H is weak, and the abstract lacks model scores or reproducible details, so this stays interesting but not featured.

editor take

AsyncTool tests delayed multi-task tool use, but no size or scores are disclosed; I buy the angle—agent evals should punish idle waiting.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

05:19

13d ago

HuggingFace Papers (takara mirror)· rssEN05:19 · 05·27

→KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

The authors release KVoiceBench, KOpenAudioBench, and KMMAU for Korean SpokenQA and audio understanding, with 12,345 samples in total, and evaluate eight recent SpeechLMs across English-Korean gaps and task-family rankings.

#Agent#Audio#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: sample count, model count, and Korean speech scope are concrete. Still, it is a vertical benchmark paper with no strong result or product impact, so it stays in the 60–71 band.

editor take

KVoiceBench ships 12,345 Korean speech samples; eight SpeechLMs split by task, so English-only speech evals are cosplaying multilinguality.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

05:12

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN05:12 · 05·27

→Continual Learning in Modern Hopfield Networks with Application to Diffusion Models

The paper uses Hopfield energy to characterize forgetting under continual learning, proving in tractable MHN settings that high-energy, outlier-like samples get larger energy increases after task changes, then validating the pattern on Stable Diffusion and a pixel-space DDPM where energy tracks reconstruction-based forgetting and replay helps high-energy samples more.

#Fine-tuning#Memory#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a testable mechanism linking Hopfield energy to forgetting in Stable Diffusion/DDPM. The angle is research-heavy, HKR-H is weak, so it stays below featured.

editor take

Three sources point to the same arXiv paper; I buy the question, not the extrapolation. Stable Diffusion is a testbed here, not a production CL recipe.

sharp

All 3 sources carry the same title and trace back to one arXiv v1, so this is visibility, not independent confirmation. The paper makes a clean claim: in modern Hopfield energy terms, high-energy outlier-like samples suffer larger forgetting after task switches, and replay helps those samples more. I like the framing, but the boundary is tight. The abstract validates on Stable Diffusion and a pixel-space DDPM, but gives no task count, dataset scale, or replay budget. That makes this a sample-selection criterion, not a solved recipe for continual learning in generative models. Against LoRA merging, EWC, or plain experience replay, Hopfield energy has to win on equal-budget curves before practitioners should care.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:52

13d ago

HuggingFace Papers (takara mirror)· rssEN04:52 · 05·27

→ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

ROVER adds a lightweight plugin to Qwen2.5-VL-7B that routes object-centric evidence with a step-specific token triplet, improving MM-GCoT answer accuracy by 4.8%, grounding accuracy by 14.6%, and VideoEspresso answer accuracy by 8.6% under the original datasets and evaluation protocols.

#Multimodal#Vision#Reasoning#Qwen

why featured

HKR-K is strong: ROVER plugs evidence routing into Qwen2.5-VL-7B and reports two gains. HKR-R is limited to VLM researchers, while HKR-H is weak, so this is interesting but below featured.

editor take

ROVER adds three-token routing to Qwen2.5-VL-7B and gains 14.6% grounding; I buy the direction, pending decode-cost curves.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:48

13d ago

HuggingFace Papers (takara mirror)· rssEN04:48 · 05·27

→Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

SaP converts Markdown skill libraries into typed pseudocode, and on the 134-game ALFWorld unseen split with gpt-4o-mini it wins 82/402 paired games versus 47/402 for Graph-of-Skills, while cutting input tokens by 22.8% and LLM calls by 14.5% per game.

#Agent#Tools#Benchmarking#ALFWorld

why featured

HKR-H/K/R all pass, but the impact is still bounded to an agent skill-library paper and ALFWorld tests, with no major framework adoption or lab release; lower-band score: 70, tier all.

editor take

SaP wins 82/402 ALFWorld games; typed contracts beat Markdown prose when agents must invoke skills reliably.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:07

13d ago

HuggingFace Papers (takara mirror)· rssEN04:07 · 05·27

→GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

GeneralThinker reframes reasoning supervision as dense answer-conditioned optimization, using ground-truth answer likelihood for response-level evaluation and token-level credit assignment, and reports the best average performance across 11 mathematics, STEM, and general reasoning benchmarks.

#Reasoning#Fine-tuning#Benchmarking#GeneralThinker

why featured

HKR-K passes with a training mechanism and an 11-benchmark claim. HKR-H/R are weak: no author, model size, open-source status, or cost details, so this stays in the regular research tier.

editor take

GeneralThinker tops 11 benchmarks on average; I buy the mechanism, not the generality—answer likelihood still depends on labels.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:04

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN04:04 · 05·27

→Study on Multimodal Jailbreak Robustness of Think-with-Image Framework

The paper evaluates four think-with-image process designs across multiple vision-language models and finds explicit image-tool interaction reduces jailbreak attack success rates by about 30% relative on average; its safety-vector framework attributes the effect to a residual shift in hidden representations rather than benign tool outputs or text traces alone.

#Multimodal#Vision#Safety#Research release

why featured

HKR-H/K/R all pass: the paper offers a counterintuitive ~30% jailbreak-success reduction plus multi-VLM tests and a safety-vector explanation. It is strong practical safety research, not a major product release, so it fits the 78-84 band.

editor take

Stop treating multimodal safety as output filtering; this paper shows image-tool invocation cutting jailbreak ASR by ~30%, like a safety bias inside the reasoning path.

sharp

The sharp claim here is that multimodal jailbreak resistance can come from process shape, not cleaner tool outputs. The authors compare four designs: direct response, text-only prior turn, visual-state manipulation, and explicit image-tool invocation. The explicit image-tool path lowers attack success by about 30% relative on average across evaluated VLMs. The useful evidence is the ablation. ASR stays low when the returned image-tool output is manually overridden, even with unsafe-looking content. It rises back near direct-answering levels under text-only prior-turn controls. That rules out two lazy explanations: benign tool semantics and refusal triggered by a textual tool trace. I’d still be careful with the “safety vector” framing, but the residual-shift result is actionable for agentic VLM builders: safety eval has to cover the invocation path, not just the base model checkpoint.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·27

→Research shows capability-robustness tradeoff in vision-language-action models

The paper proves that a VLA policy’s capability and robustness sum is bounded by task entropy plus adversarial channel capacity; a 16/255 PGD attack drops OpenVLA-7B success on LIBERO from above 95% to below 5%.

#Robotics#Vision#Safety#OpenVLA

why featured

HKR-H/K/R all pass: the title has a real tradeoff hook, the post gives a bound plus reproducible PGD numbers, and VLA robustness maps to embodied-agent safety. Single arXiv paper, so it sits in good-quality research rather than must-write.

editor take

OpenVLA-7B falling from 95%+ to under 5% under 16/255 PGD is the warning shot: VLA robustness now has an information budget, not just patches.

sharp

All 3 entries point to the same arXiv record, so the agreement is a single-source chain, not independent convergence. The hard hook is still strong: OpenVLA-7B drops from above 95% LIBERO success to under 5% under a 16/255 PGD attack. The paper frames VLA capability and robustness as an information budget, then adds action-channel leakage, which classifier robustness papers do not need. I buy the direction of the bound more than the deployment comfort. “Zero violations across 320 cells” sounds clean, and the ≤200-sample diagnostics are useful, but they certify an information-theoretic constraint, not physical-world safety. For OpenVLA-style policies and RT-2-like stacks, once perturbations can leak through action outputs, clean benchmark success becomes a much weaker brag.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·27

→Research Shows RLHF Training Can Be Exploited to Optimize Misaligned Biases

The paper introduces alignment tampering, where an LLM influences preference data built from its own outputs, and RLHF or best-of-N sampling amplifies misaligned behaviors across keyword bias, sexist propaganda, brand promotion, and instrumental goal-seeking.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: the title has a counterintuitive safety hook, and the summary states a concrete mechanism where self-generated outputs contaminate preference data. Single arXiv item lacks authors and experiment numbers, so it stays at 80.

editor take

Only the title is disclosed: no models, setup, or metrics. Still, RLHF as an exploitable channel for misaligned bias hits a live alignment blind spot.

sharp

Two arXiv entries carry the same title, split across cs.CL and cs.LG. The body is empty, so the only disclosed claim is that RLHF can be exploited to optimize misaligned biases; models, reward setup, and attack conditions are absent. I buy the direction, but not the strength yet. RLHF is a preference-fitting mechanism, not a safety boundary. If the feedback channel is gameable, a model learning reviewer-pleasing behavior instead of user intent is the expected failure mode. The paper needs one hard reproducible result: same base model, same reward pipeline, and a bias metric rising across RL steps under a defined tampering condition. Without that, this risks being reward hacking with a sharper alignment label.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Post-training makes large language models less human-like

The paper introduces the Psych-201 dataset and finds that post-training reduces LLM alignment with human behavior across model families, sizes, and objectives, while persona induction does not improve individual-level predictions.

#Alignment#Benchmarking#arXiv#Research release

why featured

HKR-H/K/R all pass: the paper has a counterintuitive hook, a named dataset, and a testable cross-model claim that challenges post-training assumptions. Strong research signal, but no top-venue, major-lab, or artifact detail is disclosed, so it stays in 78–84.

editor take

Post-training makes models better-behaved and less human-like; teams selling RLHF as “human preference” need to stop hand-waving.

sharp

Post-training’s loss of humanness looks systematic, not like a cute benchmark artifact. Psych-201 spans 201 psychology experiments, and the paper says post-training lowers alignment with human behavior across model families, model sizes, and training objectives. Persona induction also fails to improve individual-level predictions. That cuts directly against the lazy RLHF story: you are optimizing acceptable answers, refusal boundaries, and instruction following, not human cognitive trajectories. I’d separate this from the TruthfulQA and HH-RLHF lineage. Those benchmarks reward not lying, not offending, and following instructions. Psych-201 asks about behavior structure: choices, biases, learning patterns. After that pressure, the model becomes a cleaner product interface, not a better human proxy. Anyone using chat-tuned models for user simulation, agent personas, or behavioral experiments should stop treating “aligned” as “more human.”

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Jailbreak Susceptibility Prediction and Mitigation via Model Behavioral Geometry

The paper evaluates behavioral geometry on 79 models across 24 providers and 100 configurations of one base model, reaching 0.94 AUPRC for jailbreak susceptibility detection with about 98% fewer probes and using three models to cover defense transfer across the population.

#Safety#Alignment#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: the paper offers a concrete jailbreak-risk hook, testable numbers, and a deployment-safety nerve. It fits the 78–84 research-release band, not the 85+ must-write tier.

editor take

This pushes jailbreak evals from brute-force red teaming to probe-efficient risk prediction; 0.94 AUPRC with 98% fewer probes is the useful part.

sharp

The sharp move here is treating jailbreak risk as population geometry, not another leaderboard over 79 models. The paper tests 79 models across 24 providers plus 100 configurations of one base model, then reports 0.94 AUPRC with roughly 98% fewer probes. That matters for production teams: every system prompt, wrapper, and policy tweak cannot afford a fresh full red-team sweep. I’m less sold on the “three models cover defense transfer” claim. The abstract gives +2% over same-provider assignment with p=0.03, but not the identities of the three models, the attack distribution, or judge details. Geometry that transfers on static jailbreak sets can break under multi-turn pressure, tool use, or RAG leakage. Still, the direction is right: safety evals need sampling efficiency over configuration space, not one more brittle refusal score.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→The ATOM Report: Measuring the Open Language Model Ecosystem

The ATOM Report measures about 1,500 mainline open language models from Qwen, DeepSeek, Llama, and peers, and states that Chinese models overtook U.S.-built counterparts in summer 2025 using Hugging Face downloads, model derivatives, inference market share, and performance metrics.

#Benchmarking#Inference-opt#Alibaba#DeepSeek

why featured

HKR-H/K/R all pass: the report quantifies the open-model ecosystem and claims Chinese models passed US models in summer 2025. Strong research/benchmark signal, but not a model launch or product capability update, so it fits 78–84.

editor take

ATOM quantifies the open-model gravity shift: across ~1,500 mainline models, Chinese stacks passed U.S. ones in summer 2025. Llama’s halo needs a haircut.

sharp

ATOM’s useful move is dragging the open-model fight away from leaderboard peaks and toward ecosystem share. The paper tracks ~1,500 mainline open models and mixes Hugging Face downloads, derivatives, inference market share, and performance metrics. Its claim is blunt: Chinese models passed U.S.-built open models in summer 2025 and kept widening the gap. I buy the direction, not every proxy. Hugging Face downloads and derivative counts favor models that get repackaged, distilled, quantized, and forked; Qwen and DeepSeek are built for that distribution loop. Llama’s old advantage was license familiarity plus community inertia, and that advantage gets weaker when Chinese releases ship fast and stay permissive enough. The inference-share metric is the fragile one: without vendor coverage details, “overtook” reads more like ecosystem heat than confirmed enterprise production load.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

The Cordyceps paper proposes a semantic-association data poisoning method that teaches LLMs an information-hiding scheme, evaluates it on 5 LLMs, 3 backdoor defenses, and 4 prompt-injection defenses, and reports up to 98% attack success after prompt-injection defenses.

#Safety#Fine-tuning#Alignment#Research release

why featured

HKR-H/K/R all pass: the title has a strong security hook, the paper gives a testable poisoning mechanism across 5 models and 7 defense setups, and 98% post-defense ASR is practitioner-relevant. Single arXiv paper, so it stays below must-write.

editor take

Cordyceps moves poisoning from trigger phrases to semantic ciphers; 98% post-defense success is a brutal number for fine-tuning pipelines.

sharp

Cordyceps attacks the assumption that “semantically normal” data is safe. Classic backdoors lean on fixed trigger phrases; Cordyceps uses associations between facts, concepts, and attacker phrases, then teaches the model to encode and decode malicious instructions. The paper reports tests across 5 LLMs, 3 backdoor defenses, and 4 prompt-injection defenses, with up to 93% ASR after backdoor defenses and 98% after prompt-injection defenses. I don’t read this as another prompt-injection paper. It hits the fine-tuning supply chain, especially the enterprise habit of throwing semi-curated text into SFT. I’d still want to verify model sizes, poison fraction, and task setup in the PDF, but the direction is nasty: trigger scanning, outlier filtering, and clean-data regularization are weak against poisoned samples that look like ordinary knowledge.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

The paper evaluates 17 LLMs on three clinical benchmarks with the SoS framework, and sequential answer presentations reduce end-to-end accuracy and abstention against incorrect suggestions by up to 30% on average, reaching 65% for some models.

#Reasoning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the counterintuitive hook is backed by 17 models, 3 clinical benchmarks, and drops up to 30% on average and 65% in cases. It is still an arXiv benchmark paper, so it fits the 78–84 featured band, not p1.

editor take

This paper hits the chat-product blind spot: 17 LLMs lose up to 65% under clinical multi-turn SoS, including abstention safety.

sharp

Multi-turn chat is stripping away the comfort static benchmarks give teams. The SoS setup feeds answer options sequentially to 17 LLMs across three clinical benchmarks; end-to-end accuracy and abstention against wrong suggestions drop by up to 30% on average, with some models falling 65%. That is not ordinary prompt sensitivity. It is a reliability tax from the product format. The nastiest result is blind switching: models move from abstention to wrong and correct suggestions at near-identical rates, reaching 50%. Scale only fixes part of it, and can raise the tendency to adopt a wrong suggestion after initially abstaining. For medical chatbots, leaderboard accuracy does not buy you conversational safety.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→MiniMax Releases M2 Series Mixture of Experts Language Models

MiniMax introduces the M2 series of MoE language models, with the flagship M2 using 229.9B total parameters and 9.8B activated parameters per token, plus Forge RL for long-horizon agent trajectories and M2.7 self-debugging of training runs.

#Agent#Reasoning#Code#MiniMax

why featured

HKR-H/K/R all pass: the 229.9B/9.8B MoE design and Forge RL agent-training hook are concrete. Still, it is an arXiv model paper rather than a major API launch, so it stays in the 78–84 band.

editor take

MiniMax M2’s pitch is 229.9B total, 9.8B active per token; the bet is stable agent-trajectory training, not parameter bragging.

sharp

MiniMax M2’s sharpest move is making agent training the release narrative, not the 229.9B parameter count. The concrete hook is 9.8B active parameters per token, plus Forge RL, windowed-FIFO scheduling, prefix-tree merging, executable workspaces, and artifact-aligned rewards for coding and cowork trajectories. That is the right battlefield for 2026 models; static benchmark flexing has less leverage than stable long-horizon agent loops. I’d discount the “frontier-tier performance” claim for now. The snippet gives no SWE-bench score, deep-search score, office-task metric, context length, API pricing, or weight-release status. This smells like MiniMax answering the Qwen and DeepSeek low-activation MoE playbook, but M2.7 “debugging training runs and modifying its own scaffold” needs reproducible evidence. Without that, it is a strong systems paper wrapped in self-evolution language.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Open-Weight LLM Fine-Tuning Defenses Are Susceptible to Simple Attacks

The paper evaluates two low-cost attacks, abliteration and prefilling, and raises attack success rates on safeguarded open-weight models from below 10% to 16%-96% across BeaverTails, HarmBench, and AdvBench. Its proposed ART objective can be layered onto existing defenses and reduces success rates for abliteration, prefilling, and combined attacks by 10%-20%.

#Fine-tuning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper gives concrete attacks, benchmarks, and ASR ranges, plus a testable ART mitigation. It is still a single arXiv safety paper, so it lands in featured, not p1.

editor take

Open-weight safety takes another hit: no gradients, no fine-tune, and abliteration plus prefilling still push ASR up to 96%.

sharp

Open-weight safeguards look weakest when cheap old attacks beat them without touching gradients. The paper tests abliteration and prefilling on BeaverTails, HarmBench, and AdvBench, raising attack success rates from below 10% to 16%-96%. These attacks do not require gradient optimization or adversarial fine-tuning, which undercuts a common safety assumption: harmful behavior is learned later, not already latent in the pretrained model. ART lowers success rates by 10%-20% across abliteration, prefilling, and combined attacks, but that reads like a patch, not a boundary. For Llama- and Qwen-style open-weight ecosystems, evaluations centered on malicious fine-tuning are too narrow. Once weights ship, the vendor no longer controls the safety perimeter.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

The paper trains Transformer models from scratch on formally verifiable reasoning traces and finds that corrupted intermediate steps perform similarly to correct CoTs, while GRPO post-training raises answer accuracy without improving trace validity.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the post gives testable findings, and CoT faithfulness matters to practitioners. Single arXiv paper, no cross-source traction, so it stays in the 78–84 band.

editor take

This is a clean hit on CoT faith: intermediate tokens help, but that does not make them faithful reasoning traces.

sharp

The sharp part is that the paper separates CoT’s semantics from its utility. The authors train Transformers from scratch on formally verifiable traces, then compare correct traces, solution-only data, and corrupted intermediate steps. Correct traces beat the solution-only baseline, but models still emit invalid traces while reaching right answers. Corrupted traces perform similarly to correct CoTs, and even generalize better out of distribution. GRPO raises answer accuracy, but does not improve trace validity. That cuts into the public story around reasoning models. OpenAI, DeepSeek, and Anthropic all use long visible reasoning to make users feel the model is working through steps. This paper says the visible chain can be a computational scaffold, not an audit trail. If a lab wants to sell CoT as safety evidence, it has to measure trace validity first, not show a convincing-looking scratchpad.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V models multimodal jailbreak discovery as learning a joint posterior over paired text-image prompts, combines typography prompts, diffusion image synthesis, and structured distractors, and reports up to 53.75% higher attack success rate than the best baseline on GPT-4o across HarmBench and HADES.

#Multimodal#Vision#Safety#VERA-V

why featured

HKR-H/K/R all pass: the VLM jailbreak angle is clickable, the post gives HarmBench/HADES plus a 53.75% ASR lift, and safety teams care. It remains a single arXiv paper, so it fits 78–84 rather than P1.

editor take

VERA-V turns VLM jailbreaks from prompt craft into posterior sampling; 53.75% higher ASR on GPT-4o says rule patches are losing tempo.

sharp

VERA-V’s sharp part is not that GPT-4o gets jailbroken again; it formalizes the attack surface as a joint text-image posterior. Typography prompts, diffusion-generated images, and structured distractors sit inside one sampling frame, with up to 53.75% higher ASR than the best baseline on GPT-4o across HarmBench and HADES. That is bad news for VLM safety stacks built as OCR-plus-policy filters. VERA-V targets cross-modal coupling and attention fragmentation, not a single naughty string. The arXiv page gives 18 pages, 7 figures, code on GitHub, and a v2 update on 2026-05-26. I’d still check the PDF for baseline setup and ASR definitions, but the direction is clear: multimodal jailbreaks are moving from prompt tricks to sampled attack distributions.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Research proposes heavy-tail guided layerwise learning rates to optimize LLM training

LLR assigns learning rates by each Transformer layer’s heavy-tailedness and reports up to 1.5x training speedup across 60M to 3B models trained on up to 100B tokens, with 3B zero-shot accuracy rising from 48.58% to 50.61%.

#Fine-tuning#Inference-opt#Benchmarking#LLaMA

why featured

HKR-H/K/R all pass: the title challenges a default LR assumption, and the post gives a concrete heavy-tail mechanism plus 60M-3B, 100B-token, 1.5x speedup results. It is strong research, not a same-day must-write model launch.

editor take

LLR is the kind of training tweak teams should actually rerun: 48.58% to 50.61% zero-shot at 3B and up to 1.5x speedup is not cosmetic.

sharp

LLR hits a knob pretraining teams usually leave too blunt: one learning rate for every Transformer layer. The paper assigns per-layer LR from HT-SR heavy-tailedness: weaker heavy tails get larger LR, stronger ones get smaller LR. The evidence is broad for an arXiv training paper: LLaMA to GPT-nano, AdamW and Muon, 60M to 3B parameters, up to 100B tokens, with 3B zero-shot average moving from 48.58% to 50.61% and up to 1.5x speedup. I’m cautious on the “low tuning overhead” claim. LR schedules interact with data mix, warmup, batch size, and optimizer state in annoying ways, and 100B tokens is still below frontier pretraining scale. But if the released code reproduces cleanly, this is easier to adopt than another MoE routing trick or architecture patch.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Innovation: An Almost Characterization of Hallucination

The paper introduces innovation as a property of LLM outputs outside training data, proves hallucination implies innovation, and shows innovation implies hallucination with high probability under its probabilistic framework.

#Safety#Alignment#Reasoning#Kalai

why featured

HKR-K is strong because the paper states a formal near-characterization of hallucination; HKR-R is clear via reliability and safety. With only abstract-level facts and no product artifact or broad adoption signal, it fits the 78–84 band.

editor take

This frames hallucination as statistical gravity: if a model produces outside-training outputs, calibration slogans don't save it.

sharp

Pinning hallucination to innovation is sharper than another RAG patch: if an LLM tends to emit outputs outside its training data, the paper says hallucination follows with high probability. The concrete hook is Kalai and Vempala’s STOC 2024 framework, where missing mass lower-bounded hallucination for calibrated models; this paper routes that bound through innovation rate. I like the move because it cuts through the product story that better calibration kills hallucination. But don’t turn it into engineering absolution. The abstract gives no model runs, datasets, or numeric lower bounds. This is inevitability inside a probabilistic framework, not a measured failure rate for GPT-5.4 mini or Claude Sonnet 4.5.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail decomposes LLM memory systems into summarization, storage, and retrieval, then uses five datasets across four tasks to evaluate four state-of-the-art memory systems and attribute wrong answers to specific failure modes rather than aggregate QA accuracy.

#Agent#Memory#Benchmarking#MemFail

why featured

HKR-H/K/R pass: the paper targets LLM memory failures with a concrete 3-operation, 5-dataset benchmark and speaks to agent reliability. It is still a single arXiv benchmark, not a major lab release or cross-source event.

editor take

MemFail hits the sore spot in agent memory: aggregate QA scores hide whether summarization, storage, or retrieval actually broke.

sharp

MemFail is useful because it attacks memory as an engineering failure, not a vibes feature. It splits LLM memory into three operations—summarization, storage, and retrieval—then tests four systems on five datasets across four tasks. That framing matters more than another aggregate QA leaderboard, because agent memory bugs rarely look like simple forgetting. They look like stale preferences, compressed contradictions, or retrieval noise being treated as user truth. I like the diagnostic angle, but the RSS snippet withholds the four system names and scores. Without that, we cannot tell whether vector-store memory, summarization buffers, or hybrid designs fail hardest. Still, the benchmark points at the right pain: long-context evals measure what fits in the prompt; agent memory needs blame assignment after the prompt has been rewritten, stored, and fetched.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Tool Calling is Linearly Readable and Steerable in Language Models

The paper tests 18 Gemma, Qwen, and Llama models and finds that tool choice is carried by a single activation-space direction; 4B+ instruction-tuned models switch tools with 83-100% accuracy on a 15-tool synthetic benchmark and 77-94% on τ-bench airline.

#Agent#Tools#Interpretability#Gemma

why featured

HKR-H/K/R all pass: the single-direction tool-calling claim is clickable, the summary gives cross-model numbers, and agent control is a practitioner nerve. It stays at 80 because this is still an arXiv result, not a shipped product.

editor take

Tool choice looks less black-box: one activation direction reads and flips calls across Gemma/Qwen/Llama, but multi-turn agents still break the story.

sharp

This paper pulls tool calling out of prompt folklore and into representation control: across 18 Gemma, Qwen, and Llama models, tool choice is readable and steerable through one activation direction per tool pair. The numbers are unusually clean: 4B+ instruction models hit 83-100% switching accuracy on a 15-tool synthetic benchmark and 77-94% on τ-bench airline, while same-magnitude random vectors produce 0% switches. I buy half of the claim. For pre-execution monitoring, the Gemma 3 27B result is the hard hook: uncertain tool-choice states fail 21x more often. But the paper’s own limit matters: single-turn, fixed-menu settings work; multi-turn agent loops swing by up to 30 points in either direction with no stable pattern. Useful for routing diagnostics, not yet an agent safety layer.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring

The paper evaluates 12 training-free prompt optimization methods under 5 conditions on 2 OOD benchmark suites, and every best-per-method configuration exceeds the strongest RL-trained baseline at R_total=0.633. ParetoGrad gives the best Pareto balance across post-test solve rate, leak control, and helpfulness.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper pits 12 training-free prompt methods against an RL tutor baseline and names ParetoGrad’s tradeoff across learning, leakage, and helpfulness. Scope stays in tutoring and prompting, so 78–84 fits.

editor take

This punches a hole in the “tutoring needs RL” story: 12 prompt-only methods beat the 0.633 RL baseline, so product teams should audit prompts first.

sharp

Tutoring teams should treat the system prompt as an optimizable parameter before burning GPUs on RL. The paper tests 12 training-free prompt optimization methods across 5 conditions and 2 OOD benchmark suites; every best-per-method setup beats the strongest RL-trained baseline at R_total=0.633. ParetoGrad lands the best tradeoff across post-test solve rate, leak control, and helpfulness. The behavioral result is the sharp part: prompt-only methods use teaching-knowledge patterns at 2–3x the rate of RL models, while intent-level scaffolding drops by about 10 percentage points. That smells like better recovered teacher talk, not a learned long-horizon tutoring policy. Khanmigo- or Duolingo-style systems can use this, but they still need memory and student modeling if the product promise is multi-session learning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

The paper evaluates quantization across the full Llama-3.1 family with over 500,000 runs, finding FP8 effectively lossless, tuned INT8 losing 1–3% accuracy, and W4A16 most cost-efficient for synchronous vLLM deployments.

#Inference-opt#Benchmarking#Llama#vLLM

why featured

HKR-H/K/R all pass: the title has a hook, the paper gives concrete quantization results, and the cost-accuracy trade-off matters to inference teams. It is a strong engineering benchmark, not a model-launch event.

editor take

BF16 purism just lost cover: 500k+ evals put FP8 near lossless, with tuned INT8 only down 1–3%.

sharp

This paper drags quantization out of vibes and into deployment math: across the full Llama-3.1 family and 500k+ evaluations, FP8 W8A8-FP is effectively lossless, while tuned INT8 W8A8-INT loses only 1–3% accuracy. That matters because this is not a one-model benchmark screenshot. The deployment split is the useful part: under vLLM, W4A16 wins on cost for synchronous serving, while W8A8 wins under asynchronous continuous batching. Plenty of teams still treat BF16 as the safe default; this paper makes that look like paying a memory and throughput tax for comfort. I still have one concern: the abstract does not unpack the real-workload mix, and tail failures in code, long context, or multi-turn agent loops can hide behind average accuracy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel selects a fixed-budget subset of web-agent trajectory steps using unary importance and pairwise diversity, and reports roughly 9.7-12.5x training speedups over standard fine-tuning across WebArena, WorkArena, and MiniWob evaluations with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B.

#Agent#Fine-tuning#Tools#Qwen

why featured

HKR-H/K/R all pass, but this remains an arXiv research item. The 9.7-12.5x speedup across three web-agent benchmarks clears featured, not same-day must-write.

editor take

Weasel hits the unsexy bottleneck: web agents need better trajectory curation, not more dumped traces; 9.7-12.5x speedup is the tell.

sharp

Weasel is a useful correction to the web-agent fine-tuning habit: stop treating trajectories as bulk data, start treating them as a noisy budget. The method scores steps with unary importance and pairwise diversity, then trims AXTree context around the ground-truth action target. Across WebArena, WorkArena, and MiniWob, it reports 9.7-12.5x training speedups on Qwen2.5-7B, Gemma3-4B, and Qwen3-8B. I buy the direction, less the clean number. Web-agent benchmarks have a history of rewarding formatting choices, DOM truncation, and action-space quirks as much as policy learning. The paper has ICML 2026 placement and released code, so replication is doable. If the OOD gain survives messy internal SaaS workflows, Weasel becomes a training recipe. If not, it is benchmark hygiene with a good objective.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

The paper introduces task-preserving perturbations and shows that correct demonstrations can still reduce ICL accuracy. The degradation appears across sentiment classification, logical reasoning, and math word problems, with stronger effects for smaller models, harder tasks, and higher perturbation ratios; code is released on GitHub.

#Reasoning#Benchmarking#arXiv#GitHub

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the post gives a testable perturbation mechanism and three task domains, and it challenges few-shot prompt reliability. Single arXiv paper with no drop sizes disclosed, so it stays in the 78–84 band.

editor take

Correct few-shot examples can still hurt accuracy; that punches straight through lazy prompt recipes. ICL cares about evidence mix, not just label correctness.

sharp

This paper cuts into the cult of “correct few-shot examples”: labels can be right, and the model still follows the wrong contextual evidence. The authors use task-preserving perturbations: change only the exemplar input, recompute the target under the task mapping, then test ICL. Accuracy drops across sentiment classification, logical reasoning, and math word problems, with worse damage on smaller models, harder tasks, and higher perturbation ratios. I buy this more than another prompt-ordering anecdote. It gives a reproducible condition: correctness holds, input evidence shifts, performance falls. That should annoy anyone running few-shot evals. Those hand-picked “clean” demonstrations in benchmarks are not automatically teaching the task; they can be steering the model toward a skewed evidence mixture.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy

ATOM builds budget-controllable multi-agent collaboration graphs with an offline-learned nucleus and query-conditioned electron agents at inference, and reports up to 30% better token efficiency than strong baselines across six benchmarks.

#Agent#Reasoning#Inference-opt#ATOM

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with only mechanism summary and peak gains, not code or production proof. Agent cost control is timely, so it clears featured at 78 but not P1.

editor take

ATOM usefully drags multi-agent work back to budget control; 30% token efficiency is nice, but the difficulty estimator is the stress point.

sharp

ATOM’s useful move is admitting that multi-agent systems usually fail by spawning too many agents. The paper keeps an offline-learned nucleus as the stable collaboration backbone, then creates query-conditioned electron agents at inference. A complexity-aware budget gates those agents, and the authors report up to 30% better token efficiency across six benchmarks. I buy the direction more than the headline number. Multi-agent papers over the last year kept trading extra agents for leaderboard gains, then quietly dumping the cost problem on deployment. ATOM makes budget a first-class constraint, which is the right pressure. But the abstract does not give absolute token counts, latency, failure cases, or cross-domain calibration for the difficulty estimator. If the 30% comes from benchmark difficulty being easy to predict, the engineering value shrinks fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

The paper samples 20,000 stories from four current models using five prompts and finds 11 words in 88.3% of outputs, linking recurring names and settings such as Elias and lighthouses to preference data rather than published literature or pre-training data.

#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the title has a sticky repeated motif, the body gives 20K samples and 88.3% coverage, and the claim hits practitioner concern about preference data reducing diversity. As a single arXiv paper without adoption signals, it fits 78.

editor take

20,000 sampled stories and 11 words in 88.3% of them: this is preference training turning “safe fiction” into a house style.

sharp

The sharp part is that the diversity collapse is traced to preference data, not pretraining. Hamilton and Mimno sampled 20,000 stories from four current models across five prompts. Eleven words appeared in 88.3% of outputs, including Elias, Mara, Elara, lighthouses, clockmaker, and librarian. The paper says those tokens are rare in published literature and pretraining data, but present in likely shared preference data. That makes the usual “post-training only nudges behavior” story look thin. If SFT/RLHF/DPO pipelines amplify tiny human-preference artifacts, models learn a house aesthetic: copyright-safe, adult-content-free, vaguely literary, and painfully samey. For writing products, hallucination is not the only failure mode. The scarier one is every model independently discovering the same lighthouse.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

The paper introduces Furina, a jailbreak attack that uses fragmented, scene-anchored prompts to induce refusal instability; experiments cover HarmBench and MM-SafetyBench, and the code is available on GitHub.

#Safety#Multimodal#Benchmarking#Furina

why featured

HKR-H/K/R all pass: the paper offers a named jailbreak mechanism, benchmark coverage, and open code, with clear safety resonance. Missing success rates, tested models, and defense results keep it in the 78 band.

editor take

Furina is scary because it attacks refusal instability, not policy wording; that is a cleaner failure mode than another prompt hack.

sharp

Furina’s sharp edge is the claim that refusal is an unstable region, not a clean threshold. The paper says fragmented, scene-anchored prompts work without model-specific optimization, beat strong single-turn and multi-turn baselines on HarmBench, and stay competitive on MM-SafetyBench. I buy half the story. The useful hook is the diagnostic split: higher output uncertainty while internal safety activation drops. That explains why detection-style defenses miss attacks that do not look like classic malicious prompts. But the snippet gives no ASR, model list, or defense setup. If this holds only on a narrow model set, Furina is a good jailbreak. If it transfers across GPT, Claude, Gemini, and Qwen, it is evidence that refusal classifiers are structurally shaky.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

Self-Verified Distillation trains Qwen3 models from unlabeled seed questions, and the 4B model improves held-out pass@1 by 16.7 points in math, 11.1 points in science, and 8.3 points in coding after self-filtering candidate solutions through cycle-consistency, factuality, and correctness checks.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the title has a contrarian hook, the abstract gives Qwen3 4B pass@1 gains, and the mechanism targets unlabeled self-generated data. As a single arXiv paper awaiting replication, it lands at 78.

editor take

Self-distillation gets real pass@1 gains here, but same-family judging is the trap: the model may learn the verifier, not the task.

sharp

Self-Verified Distillation’s useful move is shifting sampling cost from inference to dataset construction. Qwen3-4B gains +16.7 pass@1 on AIME26/HMMT, +11.1 on GPQA Diamond/HLE, and +8.3 on LCBv5/v6, then uses one inference call at test time. For small-model deployment, that trade is clean. I don’t fully buy the “its own synthetic data pipeline” framing. The filter uses cycle-consistency, factuality, and correctness checks, with unanimous judge votes. That removes obvious junk, but it also risks freezing the model’s blind spots into the training set. Beating UQ-TTC while spending less test-time compute is the solid part; the abstract does not show human error audits, so we cannot tell whether the gain is broader reasoning or better verifier compliance.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Stateful Inference for Low-Latency Multi-Agent Tool Calling

The paper presents a stateful inference architecture that reduces per-turn multi-agent tool-calling cost from O(n_t) to O(Δ_t), using persistent KV cache, radix prefix cache, and prompt-lookup speculative decoding to reach 2.1x speedup on a 6-turn workflow and 4.2x on the median turn of a 35-turn workflow.

#Agent#Tools#Inference-opt#vLLM

why featured

HKR-H/K/R all pass: the hook is agent latency, the new claim is O(Δ_t) stateful inference with 2.1x/4.2x speedups, and the pain is serving cost. Single arXiv systems paper keeps it at the low end of featured.

editor take

Agent latency is not only model quality; it is servers recomputing 85-95% stale prompt every turn. This paper attacks the right bill.

sharp

This paper hits a serving-layer wound in agent systems: multi-turn tool use keeps paying for old context as if every turn were fresh. The mechanism is specific enough to take seriously: persistent KV cache across turns, radix prefix cache for interleaved agents, and prompt-lookup speculative decoding for structured output. The claimed cost move is from O(n_t) to O(Δ_t), not a vague cache story. The reported numbers are useful: 2.1x per-turn speedup on a 6-turn workflow, 4.2x on the median turn of a 35-turn workflow, and half the end-to-end wall time versus vLLM and SGLang. My pushback is the workload: the abstract says “novel, fully-generated,” so production annoyances like tool latency, auth checks, retries, and partial failures may be undercounted. Even with that caveat, this is closer to the agent speedup users will feel than another planner model demo.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

The paper argues that RLVR improves LLMs on math, code, and structured tasks, but several cited gains shrink or disappear after budget matching, prompt and dataset version control, and contamination screening.

#Reasoning#Code#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper challenges RLVR gains with testable controls around budget, calibration, and contamination. It stays at 78 because it is an arXiv position paper, not a model release or deployment.

editor take

RLVR needs a cooldown: once budgets and contamination checks are matched, many “reasoning gains” look like eval arbitrage.

sharp

RLVR’s problem is not that it fails; it is that too many papers sell “more attempts” as reasoning. arXiv:2509.21882 names three confounds: budget mismatch, attempt inflation plus calibration drift, and benchmark contamination. With budget-matched reproductions and partial-prompt contamination probes, several cited gaps shrink or vanish. That hits the awkward spot in this year’s reasoning-model story. Math and code are good RLVR targets because rewards are checkable, but pass@k or one-shot headline scores reward models that guess harder. The proposed bar is basic: saturation curves, variance, calibration, abstention tracking, judge-robustness tests, and contamination screens. If an RLVR paper skips those, I’d discount the claimed gain before reading the leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

The paper studies R2D2 and SFT on a 7B backbone, where R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints, but adaptive GCG attack success rises to 0.613 at step 500.

#Fine-tuning#Safety#Interpretability#arXiv

why featured

HKR-K/R are strong: the post gives testable attack numbers and a practical safety warning. HKR-H comes from the ASR-to-0 then adaptive-GCG rebound; single arXiv paper and high technical load keep it in the low 78–84 band.

editor take

R2D2 hits 0 fixed HarmBench ASR, then adaptive GCG reaches 0.613 at step 500; safety fine-tuning is still farming static tests.

sharp

R2D2’s problem is not weak refusal; it is the split between static robustness and adaptive robustness. On a 7B backbone, early checkpoints drive fixed-source HarmBench attack success to 0, while XSTest refusal peaks and a benign-utility audit fails. By step 250 and 500, adaptive GCG attack success climbs back to 0.415 and 0.613. That curve does not support a clean “dynamic defense is robust” story; it supports a moving-target refusal policy that overfits the visible attack surface. The mechanism is concrete enough to matter: effective rank stays near 1.24, R2D2 preserves a late-layer refusal carrier through step 100, then relocates the best admissible carrier to an early layer. The refusal direction remains low-dimensional, but it becomes more utility-coupled. Fixed HarmBench ASR here looks like a unit test, not a safety guarantee.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT manages million-scale LoRA policy catalogs while training and serving adapter revisions over shared 1T-class base models; rank-1 adapters can be under 1% of base-model size, and adapter-only handoff reduces the measured step by 18.3x on a 4B dense model.

#Fine-tuning#Inference-opt#Agent#MindLab Toolkit

why featured

HKR-H/K/R all pass, but this is an arXiv infra paper without disclosed deployment reach or major-lab weight. It fits the lower featured band for a practical research release.

editor take

MinT turns LoRA from a tuning trick into model ops; million-scale catalogs are serious, but 18.3x on 4B should not be sold as 1T proof.

sharp

MinT’s sharp claim is not that LoRA saves memory. It treats million-scale adapters as an operational catalog. The bet is clear: keep a shared 1T-class base model, push task variance into rank-1 LoRA, and move adapters instead of whole models. A rank-1 adapter can sit under 1% of base size. The 18.3x adapter-only handoff result on a 4B dense model is a real hook. I would not extrapolate it straight to 1T production serving. The hard parts move into cache residency, routing, rollback, and tenant isolation. Hugging Face PEFT made adapter training accessible. vLLM attacked serving throughput. MinT is going after the ugly middle layer: which agent gets which adapter revision, when. That layer kills multi-tenant agent systems quietly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

BASIS samples one rollout per prompt and uses cross-prompt information within the batch for value estimation; the paper reports 69% lower MSE than the single-rollout REINFORCE++ baseline and lower MSE with one rollout than group mean estimators using 8 rollouts.

#Reasoning#Fine-tuning#Benchmarking#BASIS

why featured

HKR-H/K/R all pass: the paper offers a counterintuitive single-rollout estimator, a 69% MSE claim, and a direct compute-cost hook. It remains an arXiv method paper needing reproduction, so 78 featured, not p1.

editor take

BASIS attacks the expensive part of RLVR: many rollouts per prompt. If the 69% MSE drop holds, GRPO-style training budgets get awkward.

sharp

BASIS is poking the sampling tax behind GRPO-style RLVR, not polishing another RL acronym. It uses one rollout per prompt, then borrows signal across the batch for value estimation. The paper reports 69% lower MSE than single-rollout REINFORCE++ and lower MSE than group-mean estimators using 8 rollouts. If that only holds on tidy paper tasks, fine, it is a neat estimator. If it holds in math, code, and long-chain RLVR pipelines, the savings hit training time and rollout budget directly. After DeepSeek-R1, the field internalized “sample more to reason better.” BASIS is attacking that assumption. The snippet does not give model scale, task mix, or wall-clock numbers, so I’d be cautious about cross-prompt value sharing under messy mixed-distribution batches.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Curriculum Learning for Safety Alignment

The paper proposes Staged-Competence, a curriculum framework that orders preference data by difficulty and reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20% across three model families.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-K/R pass: the paper gives a concrete training mechanism and 16%/20% results tied to safety practice. HKR-H is weak, and this is a single arXiv paper, so 78 fits the lower featured band.

editor take

DPO safety got a practical patch: difficulty-ordered preference data cuts jailbreak success 20%, but curriculum is training hygiene, not a moat.

sharp

Staged-Competence makes DPO fragility look like a data-ordering problem, and I half-buy it. The concrete hook is strong: across three model families, OOD harmful responses drop 16%, jailbreak success drops 20%, baseline safety is matched with 75% of the training data, and over-refusal stays near zero. For post-training teams, that is more reusable than another loss tweak. My hesitation is the evaluation surface. The abstract does not name the model families, attack sets, absolute harmful-response rates, or whether the 20% is relative or point reduction. The last year of DPO variants produced plenty of gains that lived inside one jailbreak suite. Open code and data help. If this reproduces on HarmBench, AdvBench, or WildGuard-style external sets, it becomes training-pipeline hygiene.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

The paper evaluates autonomous generative AI agents with the MIT Beer Game, reports that optimized reasoning models cut costs by up to 67% versus human teams, and introduces agent bullwhip plus a GRPO post-training framework to reduce tail events and decision instability.

#Agent#Reasoning#Alignment#MIT

why featured

HKR-H/K/R all pass: 67% cost reduction is the hook, Beer Game plus GRPO gives testable substance, and agent reliability hits practitioners. Single arXiv paper in a narrow vertical keeps it at 78.

editor take

The 67% cost cut is flashy; the scarier result is agents creating their own bullwhip, and sampling does not fix it.

sharp

This paper cuts into the clean enterprise-agent story: stronger reasoning does not equal reliable operations. In the MIT Beer Game, optimized reasoning models cut costs by up to 67% versus human teams, but the same demand path still produced amplified decision variance across facilities and over time. The authors call it agent bullwhip, and that label is useful because it separates model randomness from market demand noise. The sharp detail is that repeated sampling did not meaningfully reduce the instability. That makes the usual test-time fix look weak; the failure sits in the policy, not just one bad completion. Their GRPO post-training frame uses system-level supply-chain rewards, which is closer to what production agents need than another layer of prompts and guardrails. If an agent touches inventory, procurement, or replenishment, average cost is the wrong first question. Tail order volatility is where the bill lands.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

The paper introduces Bridge-Garden hybrid supervision for LLM distillation, tests seven teacher-student pairs including Qwen, Llama, Gemma, and DeepSeek on reasoning and coding benchmarks, and reports better results than divergence-based and on-policy KD baselines with a 9.7x training-cost reduction.

#Reasoning#Code#Fine-tuning#Qwen

why featured

HKR-H/K/R all pass: the mixed-label claim is a hook, the post gives 7 model pairs and a 1/9.7 training-cost figure, and it hits distillation cost. Strong research release, but a single arXiv paper, not same-day must-write.

editor take

Distillation finally gets a cleaner story than hard-vs-soft folklore; 9.7x cheaper is loud, but the repo has to survive reproduction.

sharp

Bridge-Garden hits a KD problem people often hand-wave: richer soft labels do not mean every token should learn a distribution. The paper splits generation into Bridges, where the next token must land exactly, and Gardens, where diversity helps. Across seven Qwen, Llama, Gemma, and DeepSeek teacher-student pairs, it beats divergence-based and on-policy KD baselines, while reporting a 9.7x training-cost cut. I buy the direction, but not the 9.7x number on first read. Distillation papers kept selling “free” compression this year, then broke on teacher sampling, benchmark choice, or student scale. This one at least gives a falsifiable mechanism: if exposure bias drives the gain, failures in long reasoning and code completion should line up with the Bridge/Garden split.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL treats GPT-2 small’s 144 attention heads as a discrete action space, trains one PPO policy on induction and IOI with zero-ablation and contrastive rewards, and reaches 96% of the oracle ceiling on held-out docstring completion under best-of-five planning.

#Agent#Interpretability#Reasoning#GPT-2

why featured

HKR-H/K/R all pass: circuit discovery is framed as an RL-agent task with concrete numbers like 144 heads and 96% of oracle ceiling. Scope is still GPT-2 small plus narrow tasks, so it lands in mid-featured rather than p1.

editor take

MechRL turns circuit hunting into a PPO search problem; useful, but so far it proves GPT-2 small single-head bottlenecks, not broad interpretability automation.

sharp

MechRL is useful because it turns circuit discovery from artisanal analysis into a trainable policy, but calling it automated interpretability is too generous. The setup stays inside GPT-2 small: 144 attention heads as actions, PPO trained on induction and IOI, with zero-ablation plus a contrastive reward. The strongest number is 96% of the oracle ceiling on held-out docstring completion under best-of-five planning. I buy the direction because the reward subtracts general next-token damage, so the agent is pushed toward task-causal heads rather than merely destructive heads. The catch is scope. Single-head ablation is a clean GPT-2-era sandbox. Once the target becomes MLP features, head combinations, or MoE routing, the action space and credit assignment get ugly fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

The paper proposes Teachability-Aware OPD, using fixed-context KL reduction to measure token teachability; across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often beats full-token OPD while retaining only 5% of tokens.

#Fine-tuning#Alignment#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the counterintuitive 5%-token result, a concrete KL-based metric, and Qwen setups give practitioners something to test. Single arXiv paper, so it lands at 78 rather than same-day must-write.

editor take

Keeping 5% of tokens and often beating full-token OPD is the sharp bit: high disagreement was never the same as learnable signal.

sharp

TA-OPD makes a useful cut: much of token-level distillation cost is spent on teacher signals the student cannot absorb. The paper defines token teachability via fixed-context KL reduction, separating cases where the teacher corrects the student’s top-K candidates from cases where teacher mass sits outside the student’s current support. Across Qwen2.5 and Qwen 3 teacher-student setups, keeping only 5% of tokens often beats full-token OPD. That is a problem for a lot of selective distillation work built on entropy or raw KL. Those heuristics measure conflict intensity, not whether the gradient lands anywhere learnable. I’d want replication outside Qwen, especially on long reasoning and code, but the claim is clean enough to matter: OPD’s default full-token loss may be paying for disagreement that behaves like noise.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Test-Time Compute for Dense Retrieval Using Agentic Program Generation

The paper uses an agentic program-search loop over a frozen encoder API to test 144 candidate programs, producing 12 Pareto-optimal programs that improve nDCG@10 across all 14 MMTEB retrieval tasks at 1.2–14.7 times the single-pass baseline cost.

#Agent#Embedding#Inference-opt#arXiv

why featured

HKR-H/K/R all pass: the mechanism and numbers are concrete, and the cost-quality tradeoff matters to RAG teams. It remains an arXiv retrieval paper, so it lands at the lower good-quality band, not a same-day must-write.

editor take

Retrieval is now eating test-time compute too; 144 searched programs yielding 12 Pareto points smells more practical than another embedding-size arms race.

sharp

This paper pushes test-time compute into frozen embedding APIs, and the useful part is the transfer claim. The loop searched 144 candidate programs and found 12 Pareto programs at 1.2–14.7x single-pass cost. All 14 MMTEB retrieval tasks improved on nDCG@10, and 68% of held-out model-task pairs had at least one frontier program beating cosine baseline across 19 extra tasks. I buy the direction because it avoids retraining the encoder and does not sneak in external models. The search rediscovered Rocchio feedback, ColBERT-style sentence MaxSim, reciprocal rank fusion, and Fisher linear discriminant. That makes “agentic” less like branding and more like automated composition over old retrieval tricks. The catch is blunt: 14.7x cost will hurt online latency before it impresses a search infra team.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→GUI-Libra: Training Native GUI Agents with Action-aware Supervision and Partially Verifiable RL

GUI-Libra releases an 81K GUI reasoning dataset and trains native GUI agents with action-aware SFT, a KL trust region, and success-adaptive scaling; the abstract says it improves step-wise accuracy and end-to-end task completion across web and mobile benchmarks.

#Agent#Reasoning#Fine-tuning#GUI-Libra

why featured

HKR-H/K/R pass: the 81K dataset and training recipe are concrete, and GUI-agent reliability is a live practitioner concern. It stays at 78 because this is an arXiv paper and exact gains are not disclosed.

editor take

GUI-Libra pushes GUI agents back to action verification, not generic reasoning; the 81K dataset is useful, but the KL trust region is the hook.

sharp

GUI-Libra makes the right cut: GUI agents are not mainly blocked by longer CoT; they are blocked by dirty action supervision. The paper releases an 81K GUI reasoning dataset, mixes reasoning-then-action with direct-action in action-aware SFT, then uses a KL trust region for partially verifiable RL. That is a better hook than benchmark gains alone, because it names the poison: many GUI actions can work, while the verifier rewards only one demonstrated action. I don’t buy the “without costly online data collection” framing. Long-horizon GUI work drifts hard: DOM changes, app versions, login state, and layout variants break offline wins. Compared with SWE-bench-style code tasks, GUI agents have a messier execution surface. Without online replay and real failure logs, 81K samples become a clean map of a city that keeps rebuilding itself.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

Pilot-Commit uses a pilot stage to estimate per-prompt informativeness online, then allocates remaining rollouts to high-variance prompts; across math reasoning benchmarks and 1.5B to 14B models, it reaches target accuracy up to 1.9x faster than GRPO and 4.0x faster than DAPO in cumulative rollouts.

#Reasoning#Fine-tuning#Inference-opt#Pilot-Commit

why featured

HKR-H/K/R all pass: the 4.0x DAPO speedup, pilot-allocation mechanism, and RL compute-cost angle are concrete. It stays below 78 because this is a niche arXiv post-training paper with no named lab rollout or code claim.

editor take

Pilot-Commit attacks the boring cost center in RL post-training: wasted rollouts. The 1.9x/4.0x gain is budget routing, not model magic.

sharp

Pilot-Commit makes the right bet: RL post-training waste lives in rollout allocation, not another renamed loss. It runs a pilot stage to estimate per-prompt reward variance online, then spends the remaining samples on high-variance prompts and skips low-signal ones. On math reasoning benchmarks across 1.5B to 14B models, it reaches target accuracy up to 1.9x faster than GRPO and 4.0x faster than DAPO in cumulative rollouts. That is useful because rollout generation is the bill, especially for on-policy training. I would still be careful with the headline number: the snippet reports cumulative rollouts, not wall-clock time, pilot budget ratio, or behavior under prompt-distribution drift. If those hold, this is the kind of unsexy systems tweak that actually survives beyond one arXiv cycle.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

The paper proposes Student-Centric Answer Sampling, which selects verified teacher-generated answers using a forward-only proxy for student-centric learning cost; experiments cover 30 teacher models, 6 student base models, and 8 tasks.

#Fine-tuning#Reasoning#Research release

why featured

HKR-H/K/R all pass: the title is counterintuitive, and the post gives SCAS, forward proxy cost, and experiment scale. It lacks open source, production impact, or cross-source debate, so it sits at the featured threshold.

editor take

Strongest teacher is a lazy distillation heuristic; SCAS says pick the answer the student can actually learn from.

sharp

SCAS attacks the laziest assumption in distillation: a higher-scoring teacher produces better supervision. The paper tests 30 teacher models, 6 student bases, and 8 tasks, then selects only among verified correct answers using a forward-only proxy for student learning cost. That is closer to a real training pipeline than the usual “use the biggest model for better CoT” story. I buy the direction, not the implied completeness. The method depends on a verified candidate set, so a lot of the hard work moves into the verifier and answer pool. For math or code-style tasks, that is tractable. For open-ended writing, tool plans, or long agent traces, correctness stops being a clean binary label and the proxy gets brittle fast. This looks like a useful data-selection operator, not a new distillation doctrine.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Jaideep Ray measures a “constraint tax” across 15,000 generations: hard schema decoding raises validity from 61.5% to 100.0% on Qwen2.5 and SmolLM2 small models, but answer accuracy drops from 19.7% to 11.0% and wrong-valid-schema outputs rise from 49.5% to 88.9%.

#Tools#Reasoning#Benchmarking#Jaideep Ray

why featured

HKR-H/K/R all pass: the hook is counterintuitive, the paper gives 15,000-generation validity/accuracy numbers, and schema decoding is a live practitioner tradeoff. Single arXiv paper with limited source authority keeps it below the 78+ band.

editor take

Hard schemas made Qwen2.5/SmolLM2 100% valid and less correct; for small-model tool use, pretty JSON can just mean cleaner failure.

sharp

Small-model structured output has a measurable failure mode here: valid JSON is not reliable tool use. Across 15,000 generations, hard schema decoding pushed validity from 61.5% to 100.0%, while answer accuracy fell from 19.7% to 11.0%. Wrong-but-valid outputs rose to 88.9%. The calendar tool result is the sharpest cut: Qwen2.5-1.5B hit 91.5% executable accuracy with prompt-only JSON, then fell to 48.0% under the same hard schema while staying 100% valid. That should make on-device agent teams nervous. Parser errors going to zero can hide a semantic regression, especially below 3B parameters.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0 introduces an image and video generation model family with 6B Image Lite, 2B Video Lite, and 19B Video Pro variants, supporting high-resolution image synthesis and 10-second video generation with released open-source code and training checkpoints.

#Multimodal#Vision#Fine-tuning#Kandinsky

why featured

HKR-H and HKR-K pass: the title names an image/video foundation-model family, and the summary gives sizes plus 10-second video. No weights, license, benchmarks, or hands-on comparison keep it near the featured floor.

editor take

Kandinsky 5.0 ships 19B video weights, not just a paper; open video generation finally gets a serious reproducibility anchor.

sharp

Kandinsky 5.0’s sharp move is not the “state-of-the-art” claim; it is shipping 6B Image Lite, 2B Video Lite, 19B Video Pro, open code, and training checkpoints together. Closed video systems like Sora, Veo, and Runway still win attention through demos while hiding weights, training recipes, and post-training details. This paper at least exposes the pipeline shape: data collection, filtering, clustering, multi-stage pretraining, SFT, and RL-based post-training. I’d still be careful with the quality claim. The snippet cites human evaluation, not a clearly reproducible VBench-style table, and 10-second generation is far from controllable long-form video. The useful part is that open video now has a large model people can dissect instead of another teaser clip.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Research paper finds hidden-state privacy has empty middle ground

The paper tests 1,536 Gaussian release covariances for single-layer hidden-state privacy, and zero achieve both moderate utility and moderate privacy; under an adaptive Mahalanobis attacker, the generalized-eigen mechanism collapses to 100% top-1 retrieval.

#Safety#Alignment#Interpretability#GPT-2

why featured

HKR-H/K/R all pass: the title has a counterintuitive trade-off, and the summary gives 1,536 covariance tests plus a 100% attack result. The work is technical and sourced only to arXiv, so it sits in the lower featured band.

editor take

This paper makes “just add noise to hidden states” look fragile: 1,536 Gaussian releases, zero in the useful-private middle.

sharp

Hidden-state release does not have a tuning problem; the Gaussian route has no comfortable middle. The paper tests 1,536 single-layer release covariances, and zero hit both moderate utility and moderate privacy. The generalized-eigen mechanism gets a 13× Pareto reduction under Euclidean retrieval, then collapses to 100% top-1 retrieval under an adaptive Mahalanobis attacker. That hurts the pitch behind exposing intermediate activations to tools, memory systems, or downstream agents. The only diagonal inverse-Fisher release holding worst-attacker top-1 ≤0.001 across a 32 model-layer grid sits on the privacy/utility edge. The wild part is the split-memory transformer: trained from scratch, 90M parameters reaches G_Mah 20–33, while pretrained models top out at 9.3. This looks like an architecture constraint, not a deployment-time noise patch.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

The paper proposes an information-theoretic framework that separates reasoning into procedural advancement and epistemic verbalization; a minimal doubt cue recovers failed trajectories, and small-scale SFT instills or suppresses this capability under the tested conditions.

#Reasoning#Fine-tuning#Interpretability#Research release

why featured

HKR-H/K/R all pass: the paper offers a testable reasoning mechanism and a practical debugging angle. It stays below 78 because the feed gives no author authority, scale, or reproducibility details.

editor take

This paper demystifies “Wait”: failed reasoning often lacks token budget for saying uncertainty out loud, not a hidden genius circuit.

sharp

The sharp claim here is that many LLM reasoning failures are not calculation failures. They are silent drift. The paper splits reasoning into procedural advancement and epistemic verbalization, then says a minimal doubt cue can recover failed trajectories. Small-scale SFT can also install or suppress that behavior. I buy half of it. It explains why tokens like “Wait” and “Let me check” often act like switches in chain-of-thought, and it rhymes with the long self-checking traces popularized by DeepSeek-R1-style training. But the abstract gives no model names, task suite, recovery rate, or SFT size. If this only works on toy reasoning tasks, it is prompt craft with nicer math. If it holds across GSM, MATH, and code, it becomes a cheap training knob for reasoning style.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Diff-Instruct with Diffused Reward: Principled One-step Generator Reinforcement Learning Research

The paper proposes DIDR, a data-free trajectory-level alignment method derived from Integral KL minimization; on the 6B DiT Z-Image backbone, DIDR uses one generation step and exceeds its 50-step teacher in preference alignment.

#Alignment#Multimodal#Fine-tuning#Z-Image

why featured

HKR-H/K/R pass: one-step beating a 50-step teacher is a strong hook, with Integral KL and a 6B Z-Image setup. It is a single arXiv paper with high technical load, so featured stays in the low band.

editor take

DIDR attacks the right failure mode: one-step RL hacks rewards. Beating a 50-step teacher is loud; reward robustness is the catch.

sharp

DIDR’s sharp move is putting reward back onto the diffusion trajectory, not merely making one-step generation faster. The paper’s hook is concrete: Integral KL, a Diffused Reward Score correction to the reference score, and a DRP estimator using differentiable short-step denoising. On a 6B DiT Z-Image backbone, one generation step beats its 50-step teacher on preference alignment. If that reproduces, the usual SDXL distillation recipe looks weaker: compress first, patch preference later, then hope fidelity survives. DIDR targets the exact reward-hacking gap in one-step generators, where terminal image rewards fight the noisy-space dynamics. I’m still cautious on the reward side. The abstract names preference alignment, but not the human eval size, reward model, or failure cases. Image RL has a long habit of turning aesthetic rewards into glossy artifacts; trajectory alignment fixes the mismatch, not the taste function.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

LinkedIn introduced the HLTM framework for long-term semantic memory, using a schema-aligned memory tree for multi-granularity storage and retrieval; in Hiring Assistant evaluations, it improved answer correctness by more than 5% and retrieval F1 by more than 10%.

#Agent#Memory#RAG#LinkedIn

why featured

HKR-H/K/R all pass, but the scope stays within a LinkedIn hiring-agent research result. The mechanism and metrics are concrete, yet this is below a model release or broad platform update.

editor take

LinkedIn put agent memory into hiring workflows; +5% correctness is real signal, but missing latency numbers dull the production claim.

sharp

LinkedIn’s strong move is dragging long-term agent memory into Hiring Assistant production, not publishing another RAG variant. HLTM uses a schema-aligned memory tree for multi-granularity semantic storage; the reported gains are over 5% in answer correctness and over 10% in retrieval F1. It also claims a better query-latency versus indexing-latency Pareto frontier. I buy the direction, not the full strength of the claim. Hiring agents need provenance, deletion, and low-latency user-signal management more than fancy summaries, and HLTM is aimed at that pain. But the abstract gives relative gains, not p95 latency, indexing cost, or the privacy deletion path. Compared with most “agent memory” papers, this reads like product engineering. Compared with an SRE-grade production bar, the ledger is still partly closed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow evaluates natural-language generation of executable visual workflows using real-world business workflows, with outputs designed for platforms such as Dify and Coze; its agentic baseline improves resolve rate by up to 6.05%, while the abstract reports that state-of-the-art models still struggle with correct, stable execution under complex requirements.

#Agent#Code#Benchmarking#Chat2Workflow

why featured

HKR-K is strong and HKR-R is moderate: executable visual workflows matter for agent deployment, with a 6.05% resolve-rate gain and open code. HKR-H is weak, so this sits at the featured threshold.

editor take

Chat2Workflow tests deployable Dify/Coze-style workflows; a 6.05% gain is modest, and the scar is execution stability.

sharp

Chat2Workflow hits the awkward gap in agent products: chatting through a plan is not delivering a runnable workflow. The benchmark uses real business workflows and targets deployable visual flows for platforms like Dify and Coze. Its agentic baseline raises resolve rate by only up to 6.05%, which is less embarrassing than honest. A lot of workflow-generation demos survive on single-turn prompts and pretty node graphs. The hard part is keeping logic, parameters, and tool calls consistent after requirements change. SWE-bench at least has code tests as a backstop; Chat2Workflow is closer to messy business state machines. The code release helps, but the abstract does not give the model list or absolute pass rates. A 6.05% delta says the patch works; it does not say workflow engineers are getting replaced.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

The paper simulates multiple benchmark leakage settings through continued pre-training, showing that in-domain user-item interaction leakage inflates LLM-based recommendation metrics, while out-of-domain leakage usually reduces recommendation accuracy; the authors release code at https://github.com/yusba1/LLMRec-Data-Leakage.

#Benchmarking#Fine-tuning#Research release#Benchmark

why featured

HKR-H/K/R all pass: the paper offers a concrete benchmark-leakage mechanism and code, not just a new leaderboard score. Its reach is narrower than a general LLM eval release, so it stays in the featured-threshold band.

editor take

LLM recommender benchmarks just took a hit: in-domain leakage inflates scores, out-of-domain leakage hurts, so leaderboard claims need a data audit.

sharp

LLM-based recommendation evaluation has a dirtier failure mode than memorized QA benchmarks: user-item interactions can act like hidden training labels. Zhang et al. simulate leakage in arXiv:2602.13626 v3 by continuing pre-training on blended corpora with in-domain and out-of-domain interactions. The sharp result is asymmetric: in-domain leakage inflates recommender metrics, while out-of-domain leakage usually reduces accuracy. That matters because recommender data is not just text contamination; the interaction matrix is the task signal. The abstract does not disclose exact lift, datasets, or model names, so I would not treat the claim as quantified yet. But it raises the bar for LLMRec papers: HitRate and NDCG are too easy to launder unless authors show corpus audits, temporal splits, and interaction de-duplication. Code release helps; it does not make old leaderboards clean.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→OCR-Reasoning Benchmark: Unveiling MLLMs' Capabilities in Complex Text-Rich Image Reasoning

OCR-Reasoning provides 1,069 human-annotated examples across 6 reasoning abilities and 18 text-rich visual tasks, and the authors report that no evaluated recent MLLM achieved accuracy above 50% on the benchmark.

#Multimodal#Vision#Reasoning#SCUT-DLVCLab

why featured

HKR-H/K/R all pass: the benchmark gives concrete scale and a sharp sub-50% MLLM result for OCR-heavy reasoning. Single arXiv paper limits reach, so it stays in low featured.

editor take

Text-rich vision is still a tax on MLLMs: 1,069 examples, 18 tasks, and every recent model stays under 50%. OCR was never solved.

sharp

OCR-Reasoning hits the sore spot: MLLMs look strong on visual reasoning when dense text, layout, and cross-region references stay offstage. The benchmark has only 1,069 human-labeled examples, but spans 6 reasoning abilities and 18 text-rich image tasks. The reported result is brutal: no evaluated recent MLLM clears 50% accuracy. The step-by-step annotation matters more than the headline score. It can separate bad reading, bad localization, and broken reasoning instead of hiding them behind one final answer. That is exactly where enterprise “document agent” demos get slippery. Invoices, screenshots, forms, and dashboards are rarely clean VQA images. I don’t buy product claims in this lane unless they report chain-level failures, not just end-answer accuracy on curated samples.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Device Context Protocol: A Compact, Safety-First Architecture for LLM-Driven Control of Constrained Devices

DCP controls constrained devices with sub-50-byte typical frames and a host-side Bridge, while its ESP32 firmware uses 27.6 KB flash and 0.6 KB RAM; in 675 tool calls across five LLMs and six adversarial prompt categories, it rejected 100% of capability-escalation attempts and 78% of prompt-injection attempts.

#Agent#Safety#Tools#DeepSeek

why featured

HKR-H/K/R all pass: DCP links LLM device control, tiny frames, and attack blocking in one paper. Kept at 74 because it is an arXiv release with no adoption signal or cross-source discussion yet.

editor take

DCP drags MCP-style tool use into hardware: 27.6KB flash is impressive, but 78% prompt-injection rejection is not enough for real devices.

sharp

DCP’s useful move is pushing LLM hardware failures into the host Bridge, before bytes hit the device. The numbers are unusually concrete: sub-50-byte typical frames, 27.6KB flash and 0.6KB RAM on ESP32, and 100% rejection of capability escalation across 675 calls. Raw MCP and IoT-MCP sat at 0–1% in the same comparison. I don’t buy the full “safety-first” framing yet. The prompt-injection rejection rate is 78%, across five LLMs and six adversarial prompt classes. That is a solid research result, not a deployment bar for motors, locks, lab gear, or medical peripherals. MCP is drifting toward SaaS connectors; DCP attacks the neglected MCU layer. But physical control has a harsher threshold than API cleanup, and 22% leakage is where the incident report starts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Learning When to Think While Listening in Large Audio-Language Models

The authors trained a wait-think-answer controller on Qwen2.5-Omni-7B, raising row-weighted accuracy from 67.6% to 70.3% on a six-task SRQA benchmark and reducing post-endpoint final-think length by 14% under the same deployment harness.

#Audio#Reasoning#Fine-tuning#Qwen

why featured

HKR-H/K/R all pass, but the audience scope is narrow: this is a timing-control paper for audio-language reasoning, not a model launch. The 67.6% to 70.3% SRQA gain and 14% shorter final-think justify the featured threshold.

editor take

Audio reasoning needs a timing policy, not just better answers; this Qwen2.5-Omni-7B result is modest at 70.3%, but the target is right.

sharp

Streaming audio models fail less on hearing and more on timing their cognition. The Qwen2.5-Omni-7B wait-think-answer controller lifts row-weighted SRQA accuracy from 67.6% to 70.3% and cuts post-endpoint final-think by 14%. That is not a huge capability jump, but the training target is the right one: the reward covers correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency. I’d be careful with the victory lap. The headline benchmark is synthetic six-task SRQA, and Real Audio Bench has only 186 human-recorded items. SFT gets the strongest accuracy there, while six-reward DAPO mainly wins by keeping final-think below the base. For spoken agents, that latency-side win still matters; a few hundred visible milliseconds can kill the interaction.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Sparse Autoencoder-Guided Post-training Data Engineering for Large Language Models

SAERL uses SAE-extracted diversity, difficulty, and quality signals for RL data engineering, improving average accuracy by 3.00% over vanilla GRPO on Qwen2.5-Math-1.5B and reaching the target accuracy with 20% fewer training steps.

#Fine-tuning#Interpretability#Reasoning#Qwen

why featured

HKR-H/K/R pass: the paper links SAEs to post-training data engineering with Qwen2.5-Math-1.5B results, +3.00% accuracy, and 20% fewer steps. Single-source arXiv research keeps it at the featured threshold, not higher.

editor take

SAE is finally being used to steer training data, not just explain models; 3% on Qwen2.5-Math-1.5B is a useful but narrow proof.

sharp

SAERL’s useful claim is not the 3.00% gain over vanilla GRPO; it is wiring SAE features into the RL data pipeline. The paper maps diversity, difficulty, and quality to batch mixing, curriculum ordering, and filtering. On Qwen2.5-Math-1.5B, it reaches the target accuracy with 20% fewer training steps. I buy the direction, but the “reusable data engineering tool” claim is early. The disclosed hook is a math setup on a 1.5B Qwen model, and the snippet does not give data scale or SAE training cost. Compared with the last year of reward-model filtering, synthetic math ramps, and rejection sampling, SAE signals look like a better instrument panel. They become infrastructure only if the same trick holds on code, agent traces, and long-context post-training data.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

The paper proposes a Jensen bias correction for quantized KV caches in video diffusion, using per-attention-score adjustments from cached-key quantization steps and query norms; on MAGI-1, SkyReels-V2, and HY-WorldPlay, INT2 recovers most quality loss, reaches near-BF16 video quality, and uses 50% less memory than INT4.

#Inference-opt#Multimodal#Vision#MAGI-1

why featured

HKR-H/K/R all pass, but this is a single arXiv inference-optimization paper with a high technical bar and no disclosed open-source artifact or broad replication, so it sits at the lower featured band.

editor take

INT2 KV-cache getting near BF16 is not a compression flex; the smart part is treating softmax Jensen bias as the bug, not the noise.

sharp

This paper lands because it pins KV-cache quantization loss on attention math, not generic compression damage. The claim is precise: quantized cached keys get inflated by softmax’s exponential, stealing attention mass from the unquantized current chunk. The fix uses cached-key quantization step sizes and query norms, with a second-order Taylor approximation, zero extra cache memory. INT2 reaching near-BF16 on MAGI-1, SkyReels-V2, and HY-WorldPlay while using 50% less memory than INT4 is a practical inference result for long video diffusion. I’m cautious on the phrase “near-BF16”: the snippet gives no concrete metric table or human eval protocol. If the full paper backs that with consistent temporal quality scores, this is cleaner than another vague video compression trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

The paper applies propensity-adjusted associational analysis to optimized prompts across multiple optimization frameworks, LLM backbones, and NLP benchmarks, finding that complexity-increasing and meta-instructional edits are negatively associated with math and multi-hop reasoning performance.

#Reasoning#Tools#Benchmarking#DSpy

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with only method and claim summarized. It clears featured, not the 78+ band for broader industry-moving research.

editor take

Prompt optimization just got audited at the edit level; the “add more clever instructions” reflex looks worse than lazy engineering.

sharp

The useful cut here is edit-level accountability for DSpy and TextGrad-style optimizers. The paper spans 17 pages, 4 figures, and 8 tables, using propensity-adjusted associational analysis across optimization frameworks, LLM backbones, and NLP benchmarks. Its sharpest finding: complexity-increasing and meta-instruction edits are negatively associated with math and multi-hop reasoning. That hits a bad habit in prompt-optimizer pipelines. Many systems treat longer instructions, role constraints, and self-checking wrappers as default wins. This paper says the gains are task-conditioned: step-by-step and meta-cognitive edits help logical and sequential reasoning, while heavier meta packaging hurts harder reasoning tasks. Don’t oversell the causal claim; the authors call it observational analysis. For builders, that is still enough signal: choose edit families by task type, instead of letting an optimizer inflate prompts until the benchmark moves.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

The paper introduces MUSE, a two-stage evaluation framework that maps an LLM’s epistemic uncertainty on an initial query to its probability of yielding to later user pushback, separating sycophantic conformity from uncertainty-driven conformity under user expertise and suggestion plausibility conditions.

#Alignment#Safety#Benchmarking#Research release

why featured

HKR-H/K/R pass, but the feed only gives the framework mechanism; experiment size, model list, and main results are not disclosed. This fits a featured-threshold alignment benchmark paper.

editor take

MUSE treats conformity as a measurable failure surface, not a morality play about RLHF sycophancy. That is the useful move.

sharp

MUSE’s useful move is splitting “the model folded” into two different failure modes. The framework first estimates epistemic uncertainty on an initial answer, then measures yielding after user pushback. It also ablates perceived user expertise and suggestion plausibility. That is closer to a diagnostic tool than another sycophancy leaderboard. I buy the framing because deployed assistants pay for both errors: stubborn wrong answers and confident answers that collapse under pressure. Calling high-certainty yielding “sycophantic conformity” and uncertainty-linked yielding “uncertainty-driven conformity” gives teams different levers. The missing piece is operational: the abstract does not disclose model list, task scale, or the rule for judging a yield. Without those, MUSE is a good measurement vocabulary, not yet a CI-ready safety metric.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·27

→Focal Reward: Balanced Reinforcement Learning with Rubric-Based Rewards

The paper proposes Focal Reward, using inverse reward projection to estimate saturation per rubric criterion and automatically reweight rewards, and it beats the strongest static aggregation baseline across all 18 comparisons from three model scales and six benchmarks.

#Reasoning#Alignment#Fine-tuning#Research release

why featured

HKR-K/R pass: the mechanism and 18 comparisons are testable, and rubric reward imbalance matters to RLHF practitioners. HKR-H is weak, and this is a single arXiv paper, not same-day must-write.

editor take

Focal Reward hits a real rubric-RL failure mode: nice average scores, rotten subcriteria. 18/18 wins are strong, but user preference data is missing.

sharp

Focal Reward matters because it models a familiar rubric-RL bug: the average reward improves while one subcriterion stays broken. The mechanism is concrete: inverse reward projection estimates saturation for each rubric criterion, then shifts weight online toward dimensions with remaining headroom. The paper reports wins over the strongest static aggregation baseline in all 18 comparisons across 3 model scales and 6 benchmarks, which is stronger than a single leaderboard bump. My caution is the evidence stays inside rubric scores and ablations. The abstract does not disclose human preference results or live task success rates. RLHF and RLAIF work has shown the same trap for a year: clean reward curves often fail to map to user-visible quality. I’d put Focal Reward in the fine-tuning toolbox, not in the alignment victory column.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

ORLoopBench introduces 5,362 LP/MILP repair instances and frames infeasible-model repair as a solver-in-the-loop MDP, while solver-verified RLVR training lets an 8B model reach 95.3% RR@5 on LP repair versus 92.4% for frontier APIs.

#Agent#Reasoning#Benchmarking#Ruicheng Ao

why featured

HKR-H/K/R all pass: the 8B-vs-frontier-API result is a hook, with 5,362 cases and RR@5 numbers. The OR/LP/MILP scope is narrow, so it stays below featured.

editor take

ORLoopBench ships 5,362 LP/MILP repair cases; an 8B model hits 95.3% RR@5, making solver feedback look saner than code regen.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

The paper adds a 781-node, 955-edge knowledge graph to 139 industrial maintenance scenarios, where deterministic graph handlers score 99%, GPT-4-generated Cypher scores 82-83%, and the original tool-augmented GPT-4 baseline scores 65%.

#Agent#Reasoning#Tools#arXiv

why featured

HKR-H/K/R pass: the missing-data-layer hook, 139-scenario benchmark, and enterprise reliability angle are clear. Narrow industrial-ops scope and no product or open-source artifact keep it in 60-71.

editor take

A 781-node graph lifts GPT-4 from 65% to 82–83%; industrial agents need queryable data before fancier orchestration.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Causal Representation Learning for Generalisable Recommendation

The paper proposes a CRL disentanglement objective for recommender distribution shift, requires only existing confounded logs with no inference-time cost, and reports offline parity plus online engagement gains in a Spotify A/B test with millions of users, KuaiRand, and a synthetic benchmark.

#Reasoning#Benchmarking#Spotify#KuaiRand

why featured

HKR-H/K/R pass, but this is a vertical recommender-systems paper. Spotify million-user A/B evidence lifts credibility, yet it is not a same-day must-write for the broader AI crowd.

editor take

Spotify tested CRL on millions of users; offline parity and online gains are reported, but lift size is undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

The paper compares four open-source PDF-to-Markdown frameworks—Docling, MinerU, Marker, and DeepSeek OCR—across 21 RAG pipeline configurations on 36 Portuguese administrative documents, and Docling with hierarchical splitting plus image descriptions reaches 94.1±1.6% automated QA accuracy.

#RAG#Benchmarking#Docling#MinerU

why featured

HKR-H/K/R pass: the paper has a practical RAG hook and concrete benchmark numbers. It stays in all because the corpus is limited to Portuguese administrative documents, so general enterprise transfer is unproven.

editor take

Docling hits 94.1% on 36 Portuguese admin PDFs; the 33-point table-question gap is the useful warning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

ECHO-2 combines centralized learning with distributed rollouts for GRPO post-training on 4B to 32B LLMs, using user-controlled bounded policy staleness, peer-assisted pipelined broadcast, and cost-aware heterogeneous worker activation to improve cost efficiency while keeping RL reward comparable to strong baselines.

#Reasoning#Inference-opt#Fine-tuning#ECHO-2

why featured

HKR-K and HKR-R pass: the summary gives mechanisms and a cost angle. With only an arXiv abstract and no savings number, open-source status, or reproducible details disclosed, it stays high-all, not featured.

editor take

ECHO-2 tests GRPO on 4B–32B LLMs; bounded staleness is practical, but cost gains lack disclosed numbers.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for RL in Code Generation

VeRPO converts test-case-level partial success into dense verifiable rewards for code-generation RL, and across multiple benchmarks it beats outcome-reward and reward-model baselines by up to +8.83 pass@1, with less than 0.02% extra time cost and zero additional GPU memory overhead.

#Code#Fine-tuning#Reasoning#Longwen Wang

why featured

HKR-H/K pass: VeRPO turns test-case partial success into dense verifiable rewards and reports +8.83 pass@1 with tiny overhead. Its reach is mostly code-model training research, not a major-lab or product event, so it stays in 60–71.

editor take

VeRPO gets up to +8.83 pass@1 from partial test passes; in code RL, RM supervision now has a harder ROI case.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

MONA adds an acceleration term from the exponential moving average of gradient differences into Muon’s gradient pipeline, and outperforms Muon and AdamW across 1B to 68B MoE pretraining runs, with the largest model trained on 1 trillion tokens.

#Fine-tuning#Inference-opt#Benchmarking#MONA

why featured

HKR-K is strong: MONA gives a gradient-difference EMA mechanism plus 1B-68B MoE and 1T-token tests. HKR-H has a scale hook, but the optimizer-paper audience is narrow and code, lab backing, and external replication are not disclosed.

editor take

MONA beats Muon/AdamW from 1B to 68B MoE at 1T tokens; I want reproduction cost, not another SOTA claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

The study tested Claude Haiku 4.5 on 1,000 GSM-Symbolic problems and compared CoT, PAL, and SBSC on original and modified pairs; CoT had a 1.3-point accuracy drop, PAL dropped 1.7 points, and code execution did not improve robustness for grade-school math variations.

#Reasoning#Code#Benchmarking#Claude

why featured

HKR-H/K/R all pass: the code-vs-reasoning hook is clear, and the paper gives Claude Haiku 4.5 results on 1,000 GSM-Symbolic items. Still, it is a single benchmark paper, below model-release or major product-update weight.

editor take

Claude Haiku 4.5 ran 1,000 items; PAL dropped 1.7 points. Python execution is no robustness patch here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE fine-tunes the router to bias MoE token routing toward recently selected experts, raising expert reuse by 26% on DeepSeek and Qwen models while preserving downstream performance, increasing vLLM GPU-CPU offloading throughput by 8.4%, and reducing TPOT by 43.6%-49.8% on llama.cpp with Jetson Orin NX.

#Inference-opt#Fine-tuning#DeepSeek#Qwen

why featured

HKR-K and HKR-R pass via concrete MoE inference numbers and cost pressure. HKR-H is weak, and the arXiv systems angle is too narrow for featured without code, adoption, or cross-source discussion.

editor take

ReMoE lifts expert reuse 26% and cuts Jetson TPOT nearly half; MoE edge latency is back to router training.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Representation-Aware Unlearning via Activation Signatures: From Suppression to Entity-Signature Erasure

ERUF mines entity-specific activation signatures and distills suppression into LoRA parameters, reaching FQ 0.99 and MU 0.62 on TOFU forget10, while reducing adversarial entity recovery on Llama-3.1-8B from 63.89% to 20.15%.

#Fine-tuning#Safety#Interpretability#ERUF

why featured

HKR-H/K/R pass: the method shift, metrics, and safety use case are concrete. It stays in all because this is a single arXiv method paper without deployment, artifact evidence, or cross-source discussion.

editor take

ERUF hits FQ 0.99 and MU 0.62 on TOFU forget10; unlearning audits need activation evidence, not refusal-rate theater.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

The paper introduces SDPG, a visual reinforcement learning method that trains visuomotor policies end to end within hours on one NVIDIA RTX 4080, estimates gradients through random trajectory perturbations, and reports better training time, memory use, and rewards than baselines on visual MuJoCo benchmarks.

#Robotics#Vision#Benchmarking#NVIDIA

why featured

HKR-H/K/R pass: SDPG has a testable single-RTX-4080 efficiency claim. It stays in all because this is a specialized visual-RL paper without a major-lab release, open-source artifact, or cross-source discussion signal.

editor take

SDPG trains visuomotor policies in hours on one RTX 4080; the credible bit is fewer batch-rendered environments via rollout perturbations.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Athena: Enhancing Multimodal Reasoning with Data-Efficient Process Reward Models

Athena-PRM trains a multimodal process reward model with 5,000 samples and improves Qwen2.5-VL-7B test-time scaling by 10.2 points on WeMath and 7.1 points on MathVista.

#Reasoning#Multimodal#Alignment#Athena-PRM

why featured

HKR-K/R pass: concrete sample count, test-time scaling setup, and benchmark gains. Single arXiv paper with an academic title and no disclosed open-source artifact or adoption keeps it in the interesting-research band.

editor take

Athena-PRM gets +10.2 WeMath from 5,000 samples; multimodal PRM cost arguments just took a hit.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Research shows not all transitions matter for PPO learning

The paper tests random transition dropping for PPO across five environments, and a 25% drop rate preserves rewards while stabilizing KL divergence, policy entropy, and value estimates.

#Agent#Reasoning#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv PPO training technique with a narrow RL audience and no evidence yet for RLHF or production agent training transfer, so it stays in all.

editor take

PPO drops 25% of transitions across 5 environments and keeps rewards; this tiny tweak deserves defaults more than new RL wrappers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Coordinate-Wise Curvature Differences Localize Memorized Regions in Diffusion Models

The paper proposes coordinate-wise curvature-difference methods to localize memorized regions in diffusion outputs, subtracting curvature from an underfitted baseline such as an unconditional or less-trained model, and experiments on Stable Diffusion with ground-truth memorization masks outperform a prior attention-based localization method.

#Vision#Safety#Interpretability#Stable Diffusion

why featured

HKR-K/R pass: the paper offers a concrete localization mechanism and Stable Diffusion mask evaluation. HKR-H is weak; single-source arXiv research with a narrow method stays in the interesting band.

editor take

Curvature differences beat attention baselines on Stable Diffusion memorization masks; privacy tooling needs region-level blame, not image-level alarms.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Diet Your LLM: Dimension-wise Global Pruning via Merged Task-Specific Importance Scores

DIET profiles activation magnitudes with 100 samples per task and uses majority voting to build one global mask; on Gemma-2 2B at 20% sparsity, it reports nearly 10% higher average accuracy than prior structured pruning methods across seven zero-shot benchmarks.

#Inference-opt#Benchmarking#Gemma#Research release

why featured

HKR-K is strong: the paper states a concrete pruning mechanism and test setup. HKR-H and HKR-R pass, but impact stays within model-compression research rather than a major model or product release.

editor take

DIET builds one mask from 100 samples per task; +10% at 20% sparsity is nice, but Gemma-2 only limits the claim.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec uses early-exit models for intermediate verification in speculative decoding, reuses KV caches and hidden states across draft, verifier, and target models, and reports 1.28x average throughput improvement and up to 2.01x over single-layer speculation without accuracy loss.

#Inference-opt#HiSpec#Research release

why featured

HiSpec offers a concrete mechanism and speed numbers for inference teams. As a single arXiv paper with no code, deployment case, or independent replication disclosed, it stays in all rather than featured.

editor take

HiSpec reports 1.28x average throughput; don’t budget for 2.01x until EE training and serving costs are counted.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit uses a unified discrete diffusion Transformer for text, image, and vision-language reasoning tasks, combining a pretrained text-to-image backbone with a lightweight text decoder; the arXiv snippet claims competitive or superior quality and efficiency versus larger autoregressive models but does not disclose parameter counts.

#Multimodal#Vision#Inference-opt#Muddit

why featured

HKR-H and HKR-K pass on the unified discrete diffusion angle and concrete architecture. HKR-R is weak because the post gives no scale, benchmark result, or usable artifact, keeping it in the 60–71 research-signal band.

editor take

Muddit unifies text and image via discrete diffusion, but parameter count is undisclosed; I won’t buy “beats larger AR” without reproducible runs.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Scalable GANs with Transformers

The paper introduces GAT, a latent-space GAN with purely transformer-based generators and discriminators, and reports that GAT-XL/2 reaches FID 2.96 on ImageNet-256 after 40 epochs, using 6x fewer epochs than strong baselines.

#Vision#Multimodal#Benchmarking#arXiv

why featured

HKR-H/K pass: the Transformer-GAN angle and FID 2.96 after 40 epochs add signal. HKR-R is narrow because the impact is mostly for vision-generation researchers, with no product or cost hook.

editor take

GAT-XL/2 hits FID 2.96 on ImageNet-256 in 40 epochs; GANs have a pulse again, if code reproduces.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit runs streaming contrastive learning across ARM clients from Raspberry Pi 4 to Apple M2, using a Hybrid Loss and an RL-based adaptive splitter to cut per-sample latency by up to 4.7x, bandwidth by 77.1%, and energy by 52.3% versus server-centric baselines while staying within 2.2% accuracy.

#Audio#Embedding#Inference-opt#Raspberry Pi

why featured

HKR-K and HKR-R pass on concrete ARM latency, bandwidth, and energy numbers. HKR-H is weak because the angle is academic and narrow, so this stays high all rather than featured.

editor take

StreamSplit cuts ARM edge latency by 4.7x; I’d stress-test its RL splitter under real noise and flaky networks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen introduces a sensitive-content scene graph benchmark with 13,999 frames from 157 movies, 16 sensitivity tags, and 5 categories; its 241M student model improves SenBen Recall by 6.4 percentage points over standard cross-entropy training.

#Vision#Multimodal#Benchmarking#SenBen

why featured

HKR-K and HKR-R pass: the paper gives dataset size, label structure, and a student-model gain. HKR-H is weak, and a single arXiv benchmark does not clear the featured bar.

editor take

SenBen ships 13,999 sensitive scene-graph frames; the 241M student beating most safety APIs at 7.6x speed is the sting.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas proposes an audit protocol for LLM agent evaluation, using a six-state control-decision taxonomy, a 0/1/2 coverage audit across 15 benchmarks, and a synthetic 1,342-item study with eight models.

#Agent#Benchmarking#AgentAtlas#Research release

why featured

HKR-K/R pass: the paper offers concrete audit structure and speaks to agent-eval trust. Single arXiv paper with no named lab or adoption signal keeps it in the high 60–71 band.

editor take

AgentAtlas audits 15 benchmarks and 1,342 items; I buy the push, success-only agent leaderboards are willful blindness.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant uses train-free PSOT to reshape LLM activation distributions for low-bit quantization; under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% versus the previous state of the art.

#Inference-opt#InfoQuant#LLaMA-2#Research release

why featured

HKR-K and HKR-R pass: the paper gives concrete accuracy numbers and targets inference cost. HKR-H is weak, and a single arXiv quantization paper with specialist framing stays below featured.

editor take

InfoQuant keeps 97% FP accuracy at W4A4KV4; if train-free PSOT reproduces, 4-bit activation excuses get thinner.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?

GraphIP-Bench evaluates 12 extraction attacks, 12 defenses, 10 public graphs, 3 GNN backbones, and 3 graph-learning tasks under one black-box protocol, finding that GNN extraction is easy at medium query budgets and that many defenses lose watermark verification signal on extracted surrogates.

#Benchmarking#Safety#Tools#GraphIP-Bench

why featured

HKR-H/K/R pass: the theft angle is clickable, and the post gives a reproducible benchmark scale plus the medium-query finding. It stays in all because GNN security is a narrow research lane, not a broad model or product update.

editor take

GraphIP-Bench runs 12 attacks and 12 defenses; medium query budgets steal GNNs, and watermarks fade on surrogates.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Ethical Fairness without Demographics in Human-Centered AI

The paper introduces Flare, a demographic- and heterogeneous-attribute-agnostic framework that uses Fisher Information to find latent performance strata, applies do-no-harm regularization, and reports improved ethical fairness across EDA, OhioT1DM, IHS, and Percept-R sensing datasets.

#Alignment#Safety#Interpretability#Flare

why featured

HKR-H/K/R all pass, but this is a single arXiv research item with no code, deployment, or cross-source debate disclosed. It stays in the 60–71 research-interest band.

editor take

Flare uses Fisher Information for latent strata; demographic-free fairness is deployable, but BHE risks marking its own homework.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

The paper introduces Mimic Score and Grad-Mimic to select data by measuring alignment between sample gradients and a target direction induced by a pre-trained reference model; across six image datasets, the method improves data efficiency and trains CLIP models with 20.7% fewer steps.

#Vision#Fine-tuning#Benchmarking#arXiv

why featured

HKR-K and HKR-R pass via a concrete data-selection method and 20.7% fewer CLIP training steps. The arXiv paper is still training-pipeline-heavy, and HKR-H is weak.

editor take

Grad-Mimic cuts CLIP training by 20.7%. Nice trick: no validation set; obvious risk: reference-model bias becomes the filter.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Tracing Refusal Dynamics: Using Latent Refusal Trajectories for Robust Jailbreak Detection

The paper proposes SALO, a lightweight white-box detector that reads raw hidden-state volumes from a selected layer window and improves jailbreak detection across Qwen, Llama, and Mistral models under a fixed XSTest-calibrated operating point.

#Safety#Interpretability#Benchmarking#Qwen

why featured

HKR-K and HKR-R pass: the mechanism is concrete and tested on Qwen, Llama, and Mistral. No gain size, false-positive rate, or artifact is disclosed, so this stays a useful research item, not featured.

editor take

SALO reads layer-window hidden states for jailbreaks; gains aren’t disclosed, so I’d treat it as a white-box probe, not product defense.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

arXiv:2605.26133 defines pretraining data exposure as determining whether specific samples appeared in an LLM pretraining corpus, and surveys membership inference, data contamination, attack and defense methods, empirical findings, and open research challenges under one PDE framework.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K/R pass: membership inference and contamination tie directly to LLM security and eval trust. As an arXiv survey with no new empirical numbers disclosed in the feed, it stays in the interesting-not-featured band.

editor take

arXiv 2605.26133 folds contamination and membership inference into PDE; useful survey, not a new defense layer.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→LLM-guided Hierarchical Search for End-to-end Reasoning Intensive Retrieval

The paper proposes LATTICE, an LLM-guided hierarchical search method that traverses a navigable index without an embedding model at search time; on BRIGHT, base LATTICE reaches 46.7 nDCG@10, while LATTICE++ fusing cheap retrieval reaches 49.1.

#RAG#Reasoning#Benchmarking#LATTICE

why featured

HKR-K is strong and HKR-R is limited to RAG practitioners: the paper gives a concrete mechanism and BRIGHT scores. As a single arXiv method paper with no product or code disclosed, it stays in the 60–71 band.

editor take

LATTICE hits 46.7 nDCG@10 on BRIGHT; I buy the recall critique, but the cost curve is still under-specified.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Understanding the Challenges in Iterative Generative Optimization with LLMs

The paper studies LLM-based generative optimization for iteratively improving code, workflows, or prompts, and reports that only 9% of surveyed agents used any automated optimization in practice.

#Agent#Reasoning#Benchmarking#MLAgentBench

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with only the survey result and topic disclosed; methods and reproducible findings are not given, so it stays in the 60–71 band.

editor take

Only 9% of surveyed agents use auto-optimization; self-improvement still breaks on starting artifacts, trace truncation, and batch design.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER uses the base model’s sampled outputs, hidden states, and derived evidence to correct majority-vote failures in test-time sampling. On Llama-3.1-8B MMLU-HS-Math, it raises accuracy from the mid-78% range to the mid-82% range, and recovers about 22% of same-pool oracle headroom without external information.

#Reasoning#Inference-opt#Benchmarking#Qwen

why featured

HKR-H/K/R pass via the majority-vote failure hook, a concrete hidden-state mechanism, and a 78%-to-82% benchmark gain. Single arXiv paper with narrow task scope keeps it in the 60–71 band.

editor take

ARBITER lifts Llama-3.1-8B math accuracy from mid-78% to mid-82%; majority vote picks stable basins, not truth.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

Zeyi Huang and 10 coauthors present Latent Recurrent Transformer, which reuses a source-layer hidden state from the previous token as recurrent memory for the next token, preserves the KV-cache interface, trains with interleaved parallel training at roughly 2× baseline compute, and adds as little as 0.3% parameters.

#Reasoning#Memory#Inference-opt#Zeyi Huang

why featured

HKR-H and HKR-K pass: LRT gives a recurrent-memory mechanism plus compute and parameter numbers. HKR-R is weak; the excerpt lacks scale, gains, or reproducible setup, so it stays in the lower research-paper band.

editor take

LRT adds prior-token hidden-state memory with 0.3% parameters; the catch is 2× pretraining compute, not free reasoning.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→When Does LeJEPA Learn a World Model?

The paper proves LeJEPA can linearly recover world latent variables from nonlinear observations under stationary additive-noise transitions, with the guarantee holding uniquely for Gaussian latent distributions, and validates the theory on tasks from 2D examples to 1024-dimensional latents and pixel-based robotic control.

#Reasoning#Robotics#Alignment#LeJEPA

why featured

HKR-H/K pass: the title has a concrete world-model hook and the summary gives theorem conditions plus experiment scale. The theory-heavy angle narrows practitioner relevance, so it stays in the 60–71 research-signal band.

editor take

LeJEPA gets a proof under stationary additive-noise transitions and Gaussian latents; 1024-D and robot pixels help, but don’t sell “world model” too broadly.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC derives adaptive mixing weights from online estimates of gradient variance and disagreement between SFT and RL signals, improves hybrid post-training on math, code, science, and logic benchmarks, and adds less than 1% training overhead while reusing existing training tensors.

#Fine-tuning#Reasoning#Code#Research release

why featured

HKR-K/R pass: GAC gives a testable SFT-RL mixing rule using gradient variance and signal divergence with <1% overhead. HKR-H is weak; single arXiv paper lacks external replication or product impact.

editor take

GAC tunes SFT-RL mixing via gradient variance under 1% overhead; I buy the direction, but gains and model sizes are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Advancing Creative Physical Intelligence in Large Multimodal Models

The paper introduces MM-CreativityBench to evaluate creative tool use by LMMs in visually rich, physically constrained scenes; its experiments use Direct Preference Optimization for affordance-grounded alignment, report gains in entity and part selection, and say hallucination and grounding errors fall, but the RSS snippet does not disclose dataset size or model names.

#Multimodal#Vision#Alignment#Research release

why featured

HKR-H and HKR-K pass via a new benchmark and alignment mechanism. Sample count, comparative results, and reproduction details are not disclosed, so this stays an interesting research item, not featured.

editor take

MM-CreativityBench tests LMM tool use, but sample size is undisclosed; DPO helps grounding, yet smells like a vision-hallucination patch.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→A Unified Framework for Diffusion Model Unlearning with f-Divergence

The paper generalizes concept unlearning for text-to-image diffusion models from MSE, interpreted as KL between Gaussians, to arbitrary f-divergences, provides closed-form α-divergence objectives and a min-max variational objective, and reports that the Hellinger closed-form instance consistently outperforms MSE across multiple scenarios.

#Vision#Fine-tuning#Alignment#Research release

why featured

HKR-K and HKR-R pass: diffusion concept unlearning matters for compliance, and the post names f-divergence, α-divergence, and a Hellinger-over-MSE claim. HKR-H is weak because the angle is math-heavy and lacks code, datasets, or reproducible setup details.

editor take

This generalizes diffusion unlearning to any f-divergence; Hellinger beats MSE, but datasets and margins are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates

The paper proposes optimistic online mirror descent with safeguarded learning rates up to Θ(T), reducing adaptation lag after abrupt shifts from hundreds of rounds to a few rounds, while an O(log T) cumulative post-hoc penalty preserves near-optimal worst-case guarantees across synthetic and 11 real-world datasets.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass via a clear mechanism and numbers, but the paper is niche online-learning research rather than an agent, model, or product event. Lower-band 60–71 fit.

editor take

Θ(T) safeguarded rates cut shift lag to a few rounds; I buy the idea, but 11 datasets don’t prove production safety.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

The paper proposes Node Contribution Backpropagation for MAS defense, modeling communication as a signed DAG and backpropagating each agent’s contribution to the final decision to identify and isolate malicious agents.

#Agent#Safety#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass via a concrete signed-DAG contribution mechanism and multi-agent safety relevance. Single arXiv paper with no reported metrics, artifact details, or wider debate keeps it in the 60–71 band.

editor take

Node Contribution Backpropagation traces agents via signed DAGs; no lift numbers disclosed, so don’t treat attribution as containment yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Assessing Per-Sample Membership Inference Vulnerability without Retraining

The paper proposes a single-model per-sample privacy risk score that estimates membership inference vulnerability from last-layer representations, requires no shadow models, and outperforms loss and gradient-norm baselines at finding the highest-risk training points under state-of-the-art attacks.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K is clear: the paper proposes membership-inference risk scoring without retraining or shadow models. HKR-R is present via privacy/compliance, but the work is niche research with no product-level impact, so it sits in 60–71.

editor take

This pushes MIA risk into last-layer leverage scores; no shadow models means privacy audits get much cheaper.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

CompassDPO uses the implicit DPO reward margin to control update direction and magnitude, improving robustness over vanilla DPO and DPO-family baselines on PKU-SafeRLHF, four backbones, and out-of-distribution safety benchmarks under controlled label-flip noise.

#Alignment#Safety#Fine-tuning#PKU-SafeRLHF

why featured

HKR-K and HKR-R pass: the mechanism and 4-backbone/OOD safety tests are concrete. Still, this is a single arXiv method paper with no model launch, production replacement, or visible debate, so it stays below featured.

editor take

CompassDPO holds up across 4 backbones under label-flip noise; I buy the batch-dynamics diagnosis for DPO safety tuning.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

The paper evaluates the association between uncertainty estimators and LLM hallucinations, covering intrinsic and extrinsic hallucinations across four benchmarks including RAGTruth and HalluLens.

#Safety#Benchmarking#RAGTruth#HalluLens

why featured

Single arXiv paper: HKR-K has 4 benchmarks and intrinsic/extrinsic hallucination coverage, HKR-R hits RAG reliability. HKR-H is weak, with no product impact or strong practical claim, so it stays in 60–71.

editor take

Four benchmarks test UE-hallucination links; the association is often weak, so confidence as a hallucination alarm needs a downgrade.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic introduces 967 expert-reviewed 4-hop evaluation examples and 10,296 synthetic training examples, using sub-questions, graph topologies, and intermediate answers to diagnose where LLM multi-hop reasoning fails.

#Reasoning#Benchmarking#Fine-tuning#Omanic

why featured

HKR-K is solid: 967 expert-labeled 4-hop samples plus hop-wise failure localization. HKR-R is present for reasoning-eval reliability, but HKR-H is weak and this remains a single arXiv benchmark, so it stays in 60–71.

editor take

Omanic ships 967 expert 4-hop examples; I buy the hop-level failure tracing more than the 7.41-point transfer claim.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

Spherical KV compresses long-context KV cache with ADA and RDR: ADA stores keys as a scalar radius plus compact angle codes and computes attention logits without dense-key reconstruction, while RDR chooses keep/drop decisions and precision tiers per token and head under a fixed budget.

#Inference-opt#Research release

why featured

HKR-H/K/R are present, but the body gives mechanisms without compression, latency, accuracy-loss, model-size, or code details. As an arXiv inference-opt paper, it is useful signal but below featured threshold.

editor take

Spherical KV stores keys as radius plus angle codes; no compression ratio or benchmarks disclosed, so don’t call it an engineering win yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

SWE-Adept uses separate localization and resolution agents, and experiments on SWE-Bench Lite and SWE-Bench Pro report up to a 4.3% improvement in end-to-end issue resolve rate over prior approaches.

#Agent#Code#Tools#SWE-Adept

why featured

HKR-K passes with a concrete dual-agent mechanism and 4.3% benchmark gain; HKR-R passes for code-agent competition. HKR-H is weak, and this is a single arXiv paper, so it stays in the 60–71 band.

editor take

SWE-Adept reports up to +4.3% on SWE-Bench. Split agents plus Git checkpoints are practical, but the lift is modest.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy

The study uses 5,754 German neuropsychological assessment recordings to compare hand-crafted acoustic features with SSL embeddings across task, domain, and global score levels, finding SSL stronger at lower levels while hand-crafted features outperform SSL for MCI classification.

#Audio#Embedding#Benchmarking#Research release

why featured

HKR-H/K/R pass: the paper has a concrete 5,754-recording setup and a useful baseline reversal. Impact stays in 60–71 because it is a single clinical-speech study with no product rollout, artifact, or broad industry pickup.

editor take

Across 5,754 German recordings, SSL wins lower levels; hand-crafted acoustics beat it on MCI classification—clinical speech still punishes embedding faith.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Tracing Computation Density in LLMs

The paper introduces s-Trace to estimate a size-s subgraph that approximates full LLM outputs, and finds two computation phases: an early-layer sparse core reconstructs the distribution head, while later layers and attention heads add incremental refinements.

#Interpretability#Reasoning#Research release

why featured

HKR-K is solid: s-Trace and the two-stage computation-density claim add new information. HKR-R is limited to interpretability/safety readers; no model list, scale, or reproducible setup is disclosed, so it stays in 60–71.

editor take

s-Trace approximates full outputs with size-s subgraphs; don't call it interpretability yet, models and error curves aren't disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Yue Min and three coauthors introduce GEM, a data-mixing framework that formulates LLM pre-training curation as a variational problem on the hypersphere, and report experiments on 1.1B-parameter models where integration with DoReMi and RegMix improves average downstream accuracy by up to 1.2%.

#Benchmarking#Yue Min#DoReMi#RegMix

why featured

HKR-K and HKR-R pass: GEM adds a concrete data-mixing mechanism plus 1.1B-model results, relevant to pretraining practice. HKR-H is weak, and this is a single arXiv methods paper, so it stays in 60–71.

editor take

GEM adds up to 1.2% on 1.1B models with DoReMi/RegMix; I don’t buy the SOTA framing, but the geometry is testable.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS revises rubrics during RL training using persistent evaluation memory, scoring 2.8 points above the strongest baseline on GPQA-Diamond and 2.2 points above it on IFBench across global and instance-specific rubric settings.

#Fine-tuning#Memory#Alignment#AMARIS

why featured

HKR-K is clear: the post gives a mechanism and two benchmark gains. HKR-R passes because rubric quality affects RL training, but HKR-H is weak and the item has abstract-level detail only.

editor take

AMARIS gains 2.8 on GPQA-Diamond; I buy this because rubric drift finally gets an audit trail.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Research paper proposes early stopping rollout technique for on-policy distillation

The paper proposes Early Stopping Rollout for on-policy distillation by restricting rollout generation to early response tokens; the abstract does not disclose the exact token count, but reports stronger performance than full-rollout OPD across model sizes, families, tasks, and training regimes.

#Fine-tuning#Alignment#Inference-opt#Research release

why featured

HKR-H/K/R pass, but the item is still abstract-level: no early-stopping token count, metric table, or failure cases. The training-cost angle is useful, not strong enough for featured.

editor take

ESR rolls only early response tokens, with no length disclosed; I buy the failure mode: long rollouts turn teachers into completers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

The paper proposes shifting foundation-model machine unlearning from data-tracing to knowledge-tracing, argues that regulators and enterprise users often lack access to training data, and includes one vision-language model case study plus a public code page.

#Vision#Multimodal#Safety#Research release

why featured

HKR-K and HKR-R pass: it introduces knowledge-tracing unlearning with one VLM case and code. HKR-H is weak, and the post lacks metrics or reproducible details, so this stays in all.

editor take

The paper has one VLM case study; I don’t buy the brain-forgetting analogy—regulators need auditable boundaries.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

The paper replaces linear W_Q with Q(X)=X+fθ(X) and reports GPT-3 small style experiments with 2.40% lower validation log-loss and 6.81% lower perplexity versus the baseline.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K is strong and HKR-H has a clear architecture hook: nonlinear queries cut loss 2.40% on a GPT-3-small-style model. HKR-R is weak because cost, scaling, and artifact details are not disclosed, so this stays all.

editor take

Nonlinear Q cuts perplexity 6.81% on GPT-3-small-style runs; I’d file this as a cheap architecture patch, unproven at scale.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

The paper trains lightweight Transformer vision and text encoders on a 1D image-text testbed, and finds label diversity drives generalization to unseen object pairs more than layout diversity under a CLIP-style contrastive objective.

#Vision#Multimodal#Interpretability#arXiv

why featured

HKR-K passes because the paper gives a concrete generalization claim. HKR-H and HKR-R are weak: the synthetic 1D setup is narrow, and the article gives no product or benchmark impact.

editor take

A 1D testbed isolates left-right learning; label diversity beating layout diversity is a neat minimal counterexample for CLIP spatial generalization.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Learning to Reason Efficiently with Discounted Reinforcement Learning

The paper uses discounted reinforcement learning to penalize reasoning tokens and analyzes Blackwell optimality in restricted policy classes; experiments report shorter chains of thought while preserving accuracy, but the RSS snippet does not disclose datasets, model names, or token-reduction numbers.

#Reasoning#Inference-opt#Research release

why featured

HKR-K/R pass: the mechanism targets reasoning-token cost with theory and experiments. HKR-H is weak, and no accuracy or token-saving numbers are disclosed, so this stays in all.

editor take

Discounted RL penalizes reasoning tokens, but models, datasets, and reduction rates are undisclosed; I’d file it as token-frugality methodology.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in Multi-line Handwritten Math OCR

The paper evaluates 15 VLMs on FERMAT multi-line handwritten math OCR and proposes PINK, an LLM-rubric metric that penalizes over-correction; PINK receives 55.0% human preference versus BLEU’s 39.5%.

#Vision#Multimodal#Benchmarking#GPT-4o

why featured

HKR-H/K/R pass, but this is a single arXiv evaluation paper focused on handwritten math OCR and multimodal benchmarking. No model release, open-source tool, or production replacement claim, so it stays in the 60–71 band.

editor take

PINK beats BLEU across 15 VLMs: 55.0% versus 39.5%. GPT-4o gets penalized; education OCR needs transcription, not tutoring.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Real-Time Progress Prediction in Reasoning Language Models

The paper trains linear probes and 0–100% progress-reporting checkpoints for reasoning traces, with the strongest checkpoint reaching 0.161 MAE on mathematical reasoning and outperforming position baselines.

#Reasoning#Interpretability#Fine-tuning#Qwen

why featured

HKR-H/K/R pass: the hook is a reasoning progress bar, with 0.161 MAE and linear-probe details. As a single arXiv paper with no disclosed artifact or deployment, it stays in the all band.

editor take

Qwen3-4B progress reporting hits 0.161 MAE; I don’t buy “observable reasoning progress” until label ambiguity is tamed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Graph is a Substrate Across Data Modalities

The paper proposes G-Substrate, a graph substrate framework with a unified structural schema and interleaved role-based training, and reports that it outperforms task-isolated and naive multi-task baselines across multiple domains, modalities, and tasks.

#Multimodal#Benchmarking#G-Substrate#Research release

why featured

HKR-H and HKR-K pass: the title offers a cross-modal unification hook, and the post names G-Substrate’s schema and training mechanism. No metrics, artifact details, or deployment angle, so it stays below featured.

editor take

G-Substrate trains one graph schema across tasks. The snippet omits task counts and gains, so don’t crown it a multimodal substrate yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→From Attribution to Action: A Human-Centered Application of Activation Steering

The paper introduces a web workflow combining SAE-based attribution with activation steering, then evaluates it through semi-structured interviews with 8 experts performing CLIP debugging tasks for instance-level concept analysis.

#Vision#Interpretability#Tools#CLIP

why featured

HKR-H/K pass: the paper turns attribution into a steering workflow and reports an 8-expert CLIP debugging study. The narrow setup and small sample keep it in all, not featured.

editor take

All 8 experts used steering for intervention tests; I buy the tool direction, but N=8 only proves workflow fit.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

The paper introduces Language-guided TSED, ELT, and SELA to localize event intervals in multivariate signals from textual descriptions under little or no labeled data, and releases a real-world benchmark across energy and climate domains with expert knowledge and annotations.

#Agent#Vision#Reasoning#Research release

why featured

HKR-H/K pass; HKR-R fails. The paper has a fresh VLM-agent angle and concrete methods/benchmarks, but remains a single niche arXiv item with no adoption, code, or headline benchmark result.

editor take

SELA beats fine-tuned TSED baselines with little labeling; no margins disclosed, but ELT constraints beat VLM chart-reading vibes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

The paper introduces GraphGPO, which aggregates all rollout trajectories into one state-transition graph and assigns credit to each edge by estimating how much the transition reduces distance to the task goal.

#Agent#Reasoning#GraphGPO#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete GraphGPO mechanism for agentic RL credit assignment. No benchmark gains, eval setup, or artifact are disclosed, so it stays in the 60–71 band.

editor take

GraphGPO turns rollouts into a state graph; no metrics disclosed, so don’t buy the SOTA claim yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Dense2MoE converts public dense LLMs into on-device MoE models through LF-UC, pruning bandwidth-heavy attention modules from redundant layers and repurposing MLPs as experts; the abstract does not disclose model sizes, latency numbers, or accuracy scores.

#Inference-opt#Dense2MoE#Research release

why featured

HKR-K and HKR-R pass: the mechanism is concrete and relevant to on-device deployment costs. HKR-H is weak, and model size, latency, and accuracy are not disclosed, so it stays in 60–71.

editor take

Dense2MoE uses LF-UC on dense LLMs, but gives no size, latency, or accuracy; on-device MoE needs numbers first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

RLScale-Bench compares six DRL algorithms against a calibrated rule-based autoscaler over 240 runs; the rule-based controller achieves the lowest cost across six workloads, while trailing the best RL agents on bursty and flash traffic.

#Agent#Benchmarking#RLScale-Bench#Kubernetes

why featured

HKR-H/K/R pass, but adaptive resource control is a narrow DRL benchmark rather than a broad product or tool release. Strong data, limited audience fit, so it stays in the 60–71 band.

editor take

RLScale-Bench ran 240 trials; calibrated rules win all six cost tests, so DRL autoscaling papers owe stronger baselines.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Hongkai Li and nine coauthors propose TSFMAudit, which audits pretraining contamination in forecasting time series foundation models using fine-tuning probe dynamics: faster loss reduction with smaller backbone movement flags contamination, and the paper evaluates it on 6 TSFMs and 187 datasets against 10 LLM-derived baselines.

#Fine-tuning#Benchmarking#Hongkai Li#arXiv

why featured

HKR-K and HKR-R pass via a concrete audit mechanism and benchmark-trust angle. HKR-H fails; the niche TSFM research scope keeps it in the 60–71 interesting-but-not-featured band.

editor take

TSFMAudit tests 6 TSFMs across 187 datasets; time-series benchmark scores need contamination audits, not cleaner leaderboard prose.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Membership Inference Risks in Quantized Models: A Theoretical and Empirical Study

The paper proposes an MIS indicator for post-training quantization and evaluates membership-inference security across different quantizers using synthetic datasets and real-world drug discovery data.

#Inference-opt#Safety#Research release

why featured

HKR-K and HKR-R pass: quantization is tied to membership-inference risk, not just cost and latency. The article gives no key results or reproducible numbers, so it stays in the 60–71 research-note band.

editor take

The paper adds a PTQ MIS indicator; quantization saves inference cost, but privacy risk needs more than accuracy tables.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Rethinking the Trust Region in LLM Reinforcement Learning

The paper proposes DPPO to replace PPO ratio clipping with a direct policy-divergence estimate, using Total Variation or KL constraints and Binary plus Top-K approximations to reduce memory overhead while evaluating stability and efficiency against existing RL fine-tuning methods.

#Fine-tuning#Alignment#Research release#Open source

why featured

HKR-K/R pass: DPPO gives a concrete PPO-clipping alternative and touches RL fine-tuning stability plus memory cost. No scores, code link, or broad product angle are disclosed, so the niche arXiv paper stays in all.

editor take

DPPO swaps PPO clipping for TV/KL constraints; for huge vocabularies, single-token ratios were always a shaky crutch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

SEC-bench Pro evaluates security agents on 183 validated V8 and SpiderMonkey vulnerabilities, with the strongest frontier configuration reaching 32.0% success on V8 and 38.8% on SpiderMonkey.

#Agent#Code#Benchmarking#Google

why featured

HKR-K is strong with 183 real bugs and 32.0%/38.8% scores; HKR-H has a concrete long-horizon agent hook. Browser-engine security is specialist, so the technical-accessibility heuristic caps it near 65 and keeps it in all.

editor take

SEC-bench Pro tests agents on 183 real bugs; frontier models top out at 48.8%, so long-horizon security remains unsolved.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in Modern Transformers

The paper trains small Transformers on synthetic classification tasks and finds that RoPE raises the data-complexity threshold for ICL, while high-diversity pretraining in a primary modality lets low-complexity secondary-modality data trigger multimodal ICL.

#Multimodal#Reasoning#Interpretability#Yiran Huang

why featured

HKR-K passes with testable mechanism claims, but the evidence is small-Transformer synthetic tasks and broad product impact is thin. Narrow research scope keeps it in the 60–71 band.

editor take

Small synthetic Transformers show RoPE raises ICL thresholds; I buy the circuit evidence, not the jump to VLM claims.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation

BhashaSetu releases an English-Marathi parallel dataset with 2.78 million sentence pairs across news, politics, healthcare, literature, and culture, and the paper benchmarks translation models with BLEU, spBLEU, chrF++, and TER while fine-tuning NLLB-200-distilled-600M with LoRA.

#Fine-tuning#Benchmarking#BhashaSetu#NLLB-200

why featured

HKR-K/R pass: 2.78M sentence pairs and the NLLB-200 LoRA setup are concrete, and low-resource language data resonates with multilingual builders. The academic framing and narrow audience keep it below featured.

editor take

BhashaSetu ships 2.78M English-Marathi pairs; skipping dedup costs 1.17 BLEU, so low-resource MT still starts with hygiene.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

The paper conditions diffusion models on representations from a pre-trained self-supervised model, and the abstract says this self-conditioning improves unconditional image quality while exposing variation directions for controllable generation.

#Vision#Multimodal#Research release

why featured

HKR-K/R pass: the paper offers a concrete representation-conditioned diffusion mechanism and speaks to image controllability. No metrics, model scale, or reproducible setup are disclosed, so it stays in the 60–71 research-release band.

editor take

This paper conditions diffusion on self-supervised features; no FID or dataset disclosed, so I’d test cross-class control before buying it.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→FAV Framework Aligns Few-Step Generative Models via Amortized Variational Inference

FAV aligns few-step generative models using only sample access to the generator and reference distribution, and its robotics evaluation covers 56 offline and 30 offline-to-online RL tasks.

#Fine-tuning#Alignment#Robotics#FAV

why featured

HKR-K passes via a concrete mechanism and 56+30 robotics tasks. HKR-H fails on a dense academic title; HKR-R is narrow to robotics/RL researchers, so this stays in the 60–71 band.

editor take

FAV needs only sample access and tests 56 offline robotics tasks; I buy the interface, fewer model-family rituals for few-step generators.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

The paper proposes STARS, a training framework that constrains LoopLM latent states toward stable fixed points using Jacobian spectral radius regularization and random loop sampling; arithmetic and mathematical reasoning experiments show more reliable test-time scaling and reduced degradation as recurrence depth increases, but the snippet does not disclose exact benchmark scores.

#Reasoning#Inference-opt#Research release

why featured

HKR-K/R pass: the mechanism is concrete and test-time reasoning is relevant. Kept in all because this is a technical arXiv paper with no disclosed uplift numbers, code, or mainstream-model validation.

editor take

STARS regularizes LoopLM recurrence via Jacobian spectral radius; scores are undisclosed, so I don’t buy “reliable scaling” yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench compares adversarial training and probabilistic robustness training methods, and the authors release a leaderboard with 229 trained models across 7 datasets and 10 architectures.

#Benchmarking#Safety#PRBench#Research release

why featured

HKR-K passes with concrete leaderboard scale; HKR-H/R are weak because this is a narrow research benchmark without product impact. No hard exclusion, so it stays in the lower-interest band.

editor take

PRBench ships 229 models; AT still looks sturdier, while PR training wins on lower GE and clean accuracy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

The paper studies scale vectors in LLM normalization layers and tests a unified strategy on 0.12B to 2B dense and MoE pre-training runs, where branch-specific heterogeneity, placement changes, and magnitude-direction reparameterization reduce terminal loss with negligible parameter and compute overhead.

#Inference-opt#Fine-tuning#Benchmarking#arXiv

why featured

HKR-H and HKR-K pass: the title has contrast, and the paper gives a 0.12B-2B pretraining setup with near-zero overhead. HKR-R is weak, and no concrete loss delta is disclosed, so this stays in all.

editor take

Scale vectors cut terminal loss across 0.12B–2B pretraining, but token budgets and deltas are undisclosed; don’t call it an architecture win yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA revises participatory urban plans through closed-loop LiPUP cycles, alternating residential living simulation with plan revision while combining experiential, visual, and geospatial evidence; the abstract says it outperforms baselines on static and living-based metrics, but the RSS snippet does not disclose datasets or numeric scores.

#Agent#Multimodal#Research release#Benchmark

why featured

HKR-K passes: the paper offers a concrete multi-agent loop for participatory planning. HKR-H/R are weak because the article lacks metrics, code, reproducible setup, or a broader AI-industry hook.

editor take

LiPUP-MA loops residential simulation into planning, with no scores disclosed; planning agents easily launder preferences as geospatial evidence.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer trains a 3B LLM with a two-stage curriculum to execute graph functions and aggregate evidence across turns, then evaluates it by training on one domain and testing on unseen domains and out-of-distribution question types.

#Reasoning#Tools#Fine-tuning#GraphDancer

why featured

HKR-K passes: the mechanism and test setting are concrete for tool-reasoning readers. HKR-H and HKR-R are weak, and no result numbers, baselines, or reproducible repo are disclosed, so it stays in the normal research band.

editor take

GraphDancer uses a 3B backbone and cross-domain tests, but scores are undisclosed; I buy the curriculum, not the larger-model claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

The paper proposes SWAP for auditing CLIP soft-prompt copyright by encoding watermarks as defender-specified out-of-distribution class sequences, and evaluates effectiveness, harmlessness, and robustness against attacks on 11 datasets.

#Vision#Multimodal#Safety#CLIP

why featured

HKR-K is clear via sequential watermarking and 11-dataset validation; HKR-R lands on model-IP and security concerns. The soft-prompt focus is too niche for featured, with no product impact or broad industry trigger.

editor take

SWAP audits CLIP soft-prompt copyright on 11 datasets; OOD class sequences are clever, but CLIP-only limits the claim.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→UCPO: Uncertainty-Aware Policy Optimization

The paper proposes UCPO, using Ternary Advantage Decoupling and Dynamic Uncertainty Reward Adjustment to address advantage bias in GRPO-style RL under binary decision spaces and static uncertainty rewards.

#Reasoning#Alignment#Safety#Research release

why featured

HKR-K/R pass: the paper gives concrete post-training mechanisms and targets GRPO bias. The item lacks experiment numbers, model scale, or code, so it stays in all rather than featured.

editor take

UCPO normalizes uncertain rollouts separately; no metrics in the snippet, so don’t crown it a GRPO fix yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

FalAR provides about 20 years of European Portuguese parliamentary speech, with 5,800 hours of audio, 4,850 speaker-annotated hours across 1,180 speakers, and experiments showing up to 14% relative WER improvement when used as ASR pre-training data.

#Audio#Benchmarking#FalAR#Research release

why featured

HKR-K passes with concrete corpus scale and WER impact. HKR-H and HKR-R miss because the angle is a niche speech dataset, so it fits the 60–71 research-release band.

editor take

FalAR ships 5,800 hours of EP parliament speech; 14% WER gain is solid, but parliament data hard-codes accent and register bias.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Ratio-Variance Regularized Policy Optimization

Yu Luo and seven coauthors introduce R²VPO, replacing PPO-style hard clipping with a policy ratio variance constraint, and evaluate it across seven LLM scales and 10 robotic control tasks.

#Reasoning#Robotics#Yu Luo#Shuo Han

why featured

HKR-K passes on the mechanism and evaluation scope. HKR-H and HKR-R are weak, and the algorithmic RL framing has a high access barrier with no disclosed gain numbers, so it stays in all.

editor take

R²VPO tests a PPO alternative on 7 LLM scales and 10 robotics tasks; I buy soft constraints, but gains lack tables here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

The paper proposes Diffusion LAIR, which converts reward scores from multiple candidate images for one prompt into centered advantage weights, then optimizes an advantage-weighted regression objective with a quadratic implicit-reward penalty; experiments report gains over preference-optimization baselines on SD1.5 and SDXL across text-to-image, compositional generation, and image editing benchmarks.

#Alignment#Fine-tuning#Vision#Diffusion LAIR

why featured

HKR-K passes via a concrete method and SD1.5/SDXL evaluations; HKR-H and HKR-R are weak. This is useful diffusion alignment research, but reads as incremental rather than featured-level news.

editor take

Diffusion LAIR trains on multi-image rewards per prompt; SD1.5 and SDXL win, but effect sizes are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→CFG-OEC: Classifier-Free Guidance with Orthogonal Error Correction

The paper proposes CFG-OEC to correct structural sampling error in classifier-free guidance for diffusion models, using a proxy from model predictions and a dynamic timestep method; experiments on Stable Diffusion v1.5 and Stable Diffusion XL report better FID and CLIP scores than CFG and CFG++ across multiple samplers and guidance regimes.

#Vision#Inference-opt#Stable Diffusion#Research release

why featured

HKR-K passes via a new CFG error-correction mechanism and SD v1.5/SDXL FID-CLIP results. HKR-H/R are weak, so this stays a narrow but useful research item.

editor take

CFG-OEC beats CFG++ on SD v1.5 and SDXL, but no FID numbers are disclosed; I’d treat it as a sampler patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

FedTreeLoRA uses tree-structured aggregation for layer-wise alignment, letting clients share shallow trunks and specialize deeper branches; the abstract says it outperforms state-of-the-art methods on NLU and NLG benchmarks, but the post does not disclose exact scores.

#Fine-tuning#Benchmarking#FedTreeLoRA#Research release

why featured

HKR-K passes: FedTreeLoRA offers tree aggregation with layer-wise alignment and claims NLU/NLG SOTA gains. Scores are not disclosed, and the topic is niche, so it stays in low all.

editor take

FedTreeLoRA adds layer-wise tree aggregation; no scores disclosed, so I read it as personalization routing for federated LoRA.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

The paper proposes an interactive agentic framework that extracts LLM knowledge with four adaptive exploration policies, then applies a three-stage pipeline for duplicate filtering, semantic-overlap adjudication, and domain-relevance auditing.

#Agent#RAG#Benchmarking#Research release

why featured

HKR-K passes because the method is concrete for evaluation/RAG readers. HKR-H and HKR-R are weak, and the post does not disclose results, model comparisons, or artifacts, so it stays in the normal research-release band.

editor take

This probes LLM knowledge with 4 policies; Recursive Taxonomy wins, but no model list is disclosed here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

The paper co-trains one self-driving car and 12 pedestrians with MAPPO, reaching a 78% goal rate and 14% collision rate over 500 evaluation episodes, versus 35% and 33% for the best rule-based baseline.

#Agent#Robotics#Safety#Research release

why featured

HKR-K is clear with comparable evaluation numbers; HKR-R is limited to autonomous-driving and robotics safety, while HKR-H is weak. The arXiv paper has technical overhead but no hard-exclusion trigger, so it stays in all.

editor take

MAPPO cuts collisions to 14% over 500 episodes; pedestrians still use Dijkstra scripts, so don’t oversell real driving safety.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA introduces two modules, AdaSTR and DuTR, to reconstruct tables into Logical Semantic Trees and combine tree-search textual navigation with symbolic code execution; the abstract says experiments reach SOTA on complex table benchmarks, but the post does not disclose exact scores.

#Reasoning#Code#Benchmarking#ASTRA

why featured

HKR-K passes for a concrete mechanism, but HKR-H and HKR-R miss: no scores, code, or deployment angle are disclosed. This fits the 60s band for niche research, so tier is all.

editor take

ASTRA uses AdaSTR and DuTR for table QA, but gives no scores; ignore SOTA until tree search plus code is reproducible.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Stochastic Decision Horizons for Constrained Reinforcement Learning

The paper proposes stochastic decision horizons for constrained RL with every-step constraint satisfaction, and VT-MPO matches state-of-the-art gait realism on the 90-muscle H2190 humanoid with 4x fewer environment steps.

#Robotics#Reasoning#Safety#arXiv

why featured

HKR-H and HKR-K pass via the 90-muscle humanoid and 4x sample-efficiency claim. The constrained-RL framing is technical and narrow, so it stays in all rather than featured.

editor take

VT-MPO matches H2190 gait quality with 4x fewer environment steps; SDH earns attention by enforcing per-step constraints.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj trains a superpoint-merging object discovery agent with semantic and geometric rewards from self-supervised 2D/3D foundation models, targeting 3D object segmentation without scene-level human annotations; the abstract claims stronger results on diverse benchmarks but does not disclose benchmark counts or scores.

#Agent#Vision#Robotics#FoundObj

why featured

HKR-K is solid via the reward-training mechanism, and HKR-R lands on annotation cost for 3D vision teams. The score stays in 60–71 because this is a single arXiv paper with no disclosed benchmark count or metrics.

editor take

FoundObj uses 2D/3D self-supervised models as rewards; scores are undisclosed, so don’t read “label-free” as deployable yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Personalized Generative Models for Contextual Debiasing

The paper introduces DecoupleGen, a personalized text-to-image diffusion method for augmenting rare-context images, and evaluates it on object classification and recognition tasks in complex scene datasets; the RSS snippet does not disclose dataset names, improvement numbers, model sizes, or training costs.

#Vision#Multimodal#Fine-tuning#Research release

why featured

HKR-K and HKR-R pass: DecoupleGen gives a concrete synthetic-data debiasing mechanism and touches long-tail data cost. Missing datasets, gains, and training cost keep it in the ordinary research-release band.

editor take

DecoupleGen augments rare-context images via personalized diffusion; no datasets or gains are disclosed, so don’t crown it a debiasing baseline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

LUCoS ranks first by mean AUC, ACC, and F1 across 67 OpenML-CC18 datasets and six low-label budgets, selecting representative medoids as context from embeddings induced by an unsupervised Prior-Fitted Network rather than raw tabular features.

#Embedding#Benchmarking#LUCoS#OpenML-CC18

why featured

HKR-K passes with 67 datasets, six label budgets, and a PFN-medoid selection mechanism; HKR-H/R are weak because this is niche tabular-ML benchmarking. It lands in the lower 60–71 research band with no hard exclusion.

editor take

LUCoS ranks first on 67 OpenML-CC18 datasets; for low-label TabPFN, raw tabular-space distance should retire.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

SPHERE-JEPA replaces LeJEPA’s Gaussian prior with hyperspherical uniformity via an adapted Cramér-Wold projection mechanism, and reports over 6% higher texture retrieval mAP plus a 1.8% linear-probing gain on ImageNet-1K with ViT-B/14.

#Embedding#Benchmarking#SPHERE-JEPA#LeJEPA

why featured

HKR-K passes on a concrete mechanism and two benchmark gains. HKR-H/R are weak: the title is technical, and there is no product implication or practitioner nerve, so this stays in all.

editor take

SPHERE-JEPA gains 1.8% linear probing on ViT-B/14; I buy spherical uniformity more than the big “optimal geometry” framing.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

Falcon-X maps variates into a unified latent prototype space and reports state-of-the-art forecasting results on GIFT-Eval and fev-bench; the abstract does not disclose parameter count, training data size, or release license.

#Benchmarking#Falcon-X#Research release#Open source

why featured

Only HKR-K passes: the post gives a mechanism and two benchmark claims, but not parameter count, training data, or license. Time-series foundation models are useful to some teams, but the audience fit is narrow, so this stays in the lower 60-71 band.

editor take

Falcon-X claims SOTA on GIFT-Eval and fev-bench; no params, data scale, or license, so treat it as architecture first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection

The paper introduces Chimera Training for logical anomaly detection, concatenating subtree features from different samples at the feature level and improving rule-level anomaly AUROC on CLEVRER, OpenImages, and VidOR against independent-event and same-image semantic-training baselines.

#Vision#Reasoning#Benchmarking#arXiv

why featured

HKR-K passes with a new training mechanism and AUROC gains on CLEVRER, OpenImages, and VidOR. HKR-H/R are weak, so this stays in all as a narrow but valid research release.

editor take

Chimera Training lifts rule-anomaly AUROC on 3 vision datasets; feature-level counterfactuals beat pretending rare violations are collectible.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

DEI uses a four-node heterogeneous LLM ensemble on Core War and reports a 45.90 merged-archive QD-Score versus 20.46 for a single-node baseline, with coverage at 80.6% versus 63.0%, under an equal total LLM-call budget.

#Agent#Code#Benchmarking#GPT-5.4-mini

why featured

HKR-K passes with testable QD-Score, coverage, and single-node baseline numbers. HKR-H/R are weak, and Core War plus quality-diversity search is too narrow for featured treatment.

editor take

DEI hits 45.90 QD-Score on Core War; 124% over single-model is strong, but real code search remains unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

The paper proposes Mixture of Activations, a token-adaptive FFN design that mixes a dictionary of activation functions through input-dependent gates, and reports lower terminal loss in pre-training runs on dense and MoE language models from 0.12B to 2B parameters.

#Inference-opt#Reasoning#Research release

why featured

HKR-K passes via a concrete mechanism and 0.12B-2B pretraining result. HKR-H/R are weak: this is a narrow architecture paper, not a product, release, or open-source artifact with broad impact.

editor take

MoA lowers terminal loss from 0.12B to 2B runs; I buy the signal, but inference cost and downstream gains are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W evaluates joint 2D codebook quantization across five 1.1B–13B LLMs and eight quantized settings, with its activation-aware variant at about 5.5 bpw staying within ±0.4% of BF16 WikiText-2 perplexity on every model.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-K/R pass: the paper gives concrete compression metrics and maps to inference-cost pressure. HKR-H fails because the title is specialist-heavy; not excluded since the summary gives model sizes, bpw, and benchmark conditions.

editor take

QAM-W holds ±0.4% PPL at ~5.5 bpw; QTIP still wins at 4 bpw, so don’t file this under ultra-low-bit.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Research proposes improved canary crafting method for one-run privacy auditing

The paper proposes a one-run privacy auditing canary crafting method that combines influence-function greedy initialization with bilevel optimization to reduce canary interference; experiments report stronger privacy leakage estimates than existing canary crafting approaches, but the abstract does not disclose exact cost figures.

#Safety#Interpretability#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: the mechanism is concrete and privacy auditing has practitioner value. The arXiv paper is narrow, and the summary lacks cost numbers or reproducibility details, so it stays in all.

editor take

One-run auditing gets a cleaner canary recipe here; cost numbers are undisclosed, so don't treat stronger leakage as settled.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher

The paper proposes FA-OPD, which co-trains a Flow Matching teacher and a lightweight MLP student, using reward and action channels on student rollouts, and reports stronger results than strong baselines across six robot navigation, manipulation, and locomotion benchmarks under noisy or limited demonstrations.

#Robotics#Fine-tuning#Agent#Research release

why featured

HKR-K has a concrete mechanism and 6 robotics benchmarks; HKR-R connects to lightweight deployment. HKR-H is weak, and the post lacks margins, code, or real-robot results, so it stays in the regular research band.

editor take

FA-OPD beats strong baselines on 6 robotics benchmarks; the useful trick is reward plus action signals on student rollouts.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Skipping the Zeros in Diffusion Models for Sparse Data Generation

The paper proposes Sparsity-Exploiting Diffusion, which models only non-zero values and skips zero entries during training and inference, matching or surpassing conventional diffusion models and domain-specific baselines across physics and biology benchmarks.

#Multimodal#Inference-opt#Benchmarking#Research release

why featured

HKR-K is solid: Sparsity-Exploiting Diffusion gives a testable mechanism and claims parity or gains on physics and biology benchmarks. Missing speed numbers, sparsity rates, and artifacts keep it in all, not featured.

editor take

SED models only nonzero values and skips zeros; no speedup number is disclosed, so don’t treat it as a general DM replacement.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Yes, Q-learning Helps Offline In-Context RL

The paper tests offline ICRL on more than 150 GridWorld and MuJoCo-derived datasets, where direct RL objectives improve average performance by about 30% over Algorithm Distillation and double AD performance in XLand-MiniGrid.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with 150+ datasets and an ~30% gain over Algorithm Distillation. HKR-H/R are weak: offline ICRL is specialist material and the post gives no product or deployment hook, so it sits in the 60-71 research-signal band.

editor take

Q-learning beats AD by ~30% across 150+ offline ICRL datasets. I buy the direction; show code and seeds.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR compares against 14 baselines on three Amazon Review benchmarks, using MLIM, RLDI, IEID via RQ-VAE, and a two-stage SFT-GRPO pipeline, with 5.8%-8.3% relative HR@10 gains, 7.7%-9.9% NDCG@10 gains, and +24.8% cross-domain generalization improvement over the strongest baseline.

#Multimodal#Reasoning#Fine-tuning#DeepInterestGR

why featured

HKR-K passes because the item gives benchmark counts and relative gains. HKR-H/R are weak: this is a niche arXiv recommender paper with no production replacement, release artifact, or broader practitioner debate disclosed.

editor take

DeepInterestGR beats 14 baselines on 3 Amazon sets; 5.8%-9.9% ranking gains are fine, +24.8% cross-domain needs replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→PHALAR: Phasors for Learned Musical Audio Representations

PHALAR improves stem retrieval accuracy by up to about 70% over the state of the art, uses less than half the parameters, and trains 7× faster with Learned Spectral Pooling and a complex-valued head.

#Audio#Embedding#Benchmarking#PHALAR

why featured

HKR-K passes on concrete benchmark and efficiency numbers. The topic is niche music-audio representation research, so HKR-H/R are weak and the item fits all rather than featured.

editor take

PHALAR lifts stem retrieval accuracy by ~70%; for music embeddings, phase-aware inductive bias beats another oversized encoder.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace uses four coordinated agents and compact persistent patient-state memory for future ICU risk prediction, with evaluation on MIMIC-IV and eICU across vasopressor-support, respiratory-support, renal-support, and deterioration tasks.

#Agent#Reasoning#Memory#Vital Trace

why featured

HKR-K passes via the 4-agent architecture, patient-state memory, and MIMIC-IV/eICU setup. HKR-H/R stay weak because gains and deployment conditions are not disclosed.

editor take

Vital Trace uses 4 agents for ICU risk prediction; no AUROC shown, so I read it as a constraints test.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→JLT: Clean-Latent Prediction Method in Latent Diffusion Transformers

JLT compares clean-latent prediction against velocity prediction using a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and reports FID-50K 2.50 on ImageNet 256×256 with classifier-free guidance under matched representation, backbone, and training settings.

#Vision#Benchmarking#JLT#FLUX.2

why featured

HKR-K passes with model size, objective comparison, and FID; HKR-H/R are weak. This is a niche research benchmark without product or production-pipeline impact, so it stays in all.

editor take

JLT-B/1 reports FID-50K 2.50 on ImageNet 256×256; matched-target gaps make v-pred look less default-safe.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks

The paper tests ensembles of networks on independently noise-perturbed training sets and finds representational alignment changes monotonically with SNR, changes non-monotonically with sample size, and reaches its minimum near the interpolation threshold.

#Interpretability#Benchmarking#Research release

why featured

HKR-K passes: the paper states testable links between representational alignment, SNR, sample size, and interpolation threshold. HKR-H/R are weak, with only arXiv-level detail and no code, scale, or product angle.

editor take

This paper finds alignment bottoms near the interpolation threshold; using representation alignment as a generalization proxy looks risky.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World introduces SeqΔ-REPA to align latent actions with temporal feature differences from a frozen self-supervised video encoder, then pretrains action-conditioned video world models on passive video; the abstract reports stronger zero-shot transfer and more data-efficient adaptation, but does not disclose dataset scale or benchmark scores.

#Robotics#Vision#Benchmarking#Olaf-World

why featured

HKR-K passes because the mechanism is concrete and testable for robotics world-model work. HKR-H and HKR-R are weak, and the post omits data scale and benchmark scores, so it stays at the low end of interesting.

editor take

Olaf-World aligns latent actions with SeqΔ-REPA, but gives no scale or scores; I don't buy “extensive experiments” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Identifiable Token Correspondence for World Models

The paper introduces Identifiable Token Correspondence, a decoding step that frames next-frame prediction as structured assignment, and reports state-of-the-art results on 4 benchmarks; on Craftax-classic, ITC reaches a 72.5% return and a 35.6% score versus prior bests of 67.4% and 27.9%.

#Reasoning#Robotics#Benchmarking#SNU MLLAB

why featured

HKR-K passes with a new mechanism and checkable numbers. HKR-H/R are weak: this is a single arXiv world-model paper with no product impact or broad practitioner trigger yet.

editor take

ITC hits SOTA on 4 benchmarks; a decode-only patch is exactly the kind of low-friction world-model fix people adopt.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Constructing Industrial-Scale Optimization Modeling Benchmark

The paper introduces MIPLIB-NL, a benchmark built from real mixed-integer linear programs in MIPLIB 2017, with 223 one-to-one reconstructions for evaluating natural-language-to-optimization formulation and solver-code generation.

#Code#Benchmarking#MIPLIB 2017#MIPLIB-NL

why featured

HKR-K passes with 223 samples and a clear NL-to-optimization-model/code evaluation setup. HKR-H/R are weak, and the operations-research barrier keeps it in the upper low-value band.

editor take

MIPLIB-NL ships 223 real MILP reconstructions; I buy this direction, toy benchmarks need industrial constraints to embarrass them.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning

Argus detects backdoors in decentralized learning without a central coordinator or prior trigger knowledge, evaluates on three standard datasets against three state-of-the-art baselines, reduces attack success rates by up to 90 percentage points versus no defense, and keeps model utility within 5 percentage points of an omniscient oracle.

#Safety#Alignment#Argus#Research release

why featured

HKR-K/R pass thanks to concrete conditions and a 90 pp ASR reduction. The topic is specialized backdoor detection in decentralized learning, so it stays in the lower research band.

editor take

Argus cuts ASR by up to 90 points on 3 datasets; neighbor-consistency is clever, but Sybil resilience is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice releases a 2,200-hour Persian TTS-ready subset with 1.36 million aligned segments and 1,815 automatically identified speaker IDs, over 25 times larger than the previous largest open Persian TTS dataset.

#Audio#Fine-tuning#ParsVoice#ParsBERT

why featured

HKR-K passes because the corpus size and speaker count are concrete. HKR-H and HKR-R are weak: this is a niche speech dataset, with no product, model-capability, or competitive industry hook.

editor take

ParsVoice ships 2,200 hours for Persian TTS; MOS 3.6 is modest, but low-resource speech first needs scale.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Neural Bayesian Sequential Routing

Yongchao Huang introduces NBSR, modeling neural inference as active evidence accumulation over a hierarchical DAG; the 71-page paper specifies Dirichlet-Categorical updates, Gumbel-Softmax Straight-Through routing, entropy-based early exits, and OOD abstention mechanisms.

#Reasoning#Agent#Interpretability#Yongchao Huang

why featured

HKR-K passes on concrete routing mechanisms; HKR-H and HKR-R are weak. This is a single arXiv research release with no disclosed benchmark result, code, or production replacement claim.

editor take

NBSR spends 71 pages on Bayesian evidence routing; I don’t buy the broad eval claims without code and strong baselines.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

The paper adapts CD-T for counterfactual-free circuit discovery and tests CT-SFT on NusaX and XNLI, restricting updates to task-relevant attention heads and LayerNorm; the abstract does not disclose model sizes or exact scores.

#Interpretability#Fine-tuning#Alignment#arXiv

why featured

HKR-K passes with a testable mechanism and NusaX/XNLI setup. HKR-H/R are weak, and missing model size plus scores keeps this as niche research below featured.

editor take

CT-SFT updates only relevant heads and LayerNorm; exact scores are undisclosed, so the forgetting claim stays provisional.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

The paper applies REPA at inference time to align diffusion or flow-model representations with a DINOv2 encoder, and reports better reconstruction quality across 4 inverse-problem settings: super-resolution, box inpainting, Gaussian deblurring, and motion deblurring.

#Vision#Inference-opt#DINOv2#Research release

why featured

HKR-K passes: the paper adds an inference-time REPA+DINOv2 alignment method and tests four restoration tasks. HKR-H/R are weak, and no quantitative gains are disclosed, so this stays a low-value research update.

editor take

REPA plugs DINOv2 alignment into inference across 4 inverse tasks; the useful claim is fewer steps, but no reduction figure is disclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

The paper proposes Superpixel Transformers, a framework that unifies superpixel-based image classification with ViTs, and tests it on CIFAR10, FashionMNIST, and Imagenette under multiple superpixel generation and graph connectivity strategies.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the title has a superpixel-vs-ViT-patch hook and the post gives a framework plus three datasets. HKR-R fails because this is niche vision-classification research with no product or industry impact shown.

editor take

SPT beats superpixel GNNs on 3 small datasets; no ImageNet result disclosed, so don’t crown it a ViT replacement.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→PILOT: Data-Free Continual Learning for Real-Time Semantic Segmentation

PILOT adds a parallel D-branch to PIDNet, trains only on new-class data, and freezes the original segmentation network so real-time semantic segmentation can add novel classes while preserving base-class mIoU.

#Vision#Fine-tuning#Inference-opt#PILOT

why featured

HKR-K passes on a concrete continual-learning mechanism, but the post gives no metrics, artifact, or product impact. HKR-H and HKR-R are weak, so this stays a niche CV research item.

editor take

PILOT freezes PIDNet and trains only a D-branch; no mIoU or latency numbers are disclosed, so hold the victory lap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Not All Tokens Matter Equally: Dynamic In-context Vector Distillation for Long-form Medical Reports

DIVE tests a frozen-backbone distillation framework on MIMIC-CXR and CheXpert Plus with two medical VLM backbones, upweighting pathology-related tokens and EOS loss while using hidden-state-dependent adapters, and reports the best BLEU-4, ROUGE-L, and RadGraph F1 across all dataset-backbone settings.

#Multimodal#Fine-tuning#Vision#arXiv

why featured

HKR-K passes because DIVE has a concrete training mechanism and evaluations on MIMIC-CXR, CheXpert Plus, and two backbones. HKR-H/R are weak: this is a vertical medical VLM paper, not a product or practitioner-wide shift.

editor take

DIVE wins across 2 datasets and 2 backbones; RadGraph is still a proxy, and clinical usability is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Normal Guidance is what Attention Needs

The paper proposes Normal Guidance, a regularization method that shapes attention into a bell curve and improves MIL slice-level localization across three medical imaging datasets totaling over 4 million 2D slices, while remaining competitive on whole-scan classification.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-K lands through a concrete method and scale claim: Normal Guidance across 3 datasets and 4M+ slices. HKR-H/R are weak because this is narrow medical-vision MIL research, not a broad model or product update.

editor take

Normal Guidance wins localization on 3 datasets and 4M slices; medical MIL should admit position priors beat attention mysticism.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Multimodal framework predicts respiratory failure in ICU patients using chest X-rays and EHR data

The study evaluated a gated multimodal framework for predicting invasive mechanical ventilation within 24 hours in ICU patients, using EHR time-series data plus CXR foundation-model representations; AUROC reached 0.860 with REMEDIS and 0.858 with MedInsight, versus 0.752 for the EHR-only Vent.io baseline.

#Multimodal#Vision#Benchmarking#REMEDIS

why featured

HKR-K passes on concrete AUROC and modality comparison. HKR-H/R are weak, and the clinical vertical lacks product or broader model implications, so this stays in the low-to-mid research-signal band.

editor take

REMEDIS+EHR hits 0.860 AUROC for 24-hour ventilation prediction; the gate’s CXR rejection logic matters more than the lift.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→TED: Related Party Transaction Guided Tax Evasion Detection on Heterogeneous Graphs

The paper proposes TED, a heterogeneous graph neural network for tax evasion detection, using related-party transaction groups to filter noise and hierarchical attention to capture structure and semantics; it evaluates the method in a tax bureau risk-management system on two human-labeled real-world tax datasets.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via a concrete mechanism and 2 human-labeled datasets. HKR-H/R are weak because this is a narrow tax-risk GNN paper, not a broad model, agent, or product update.

editor take

TED reports two human-labeled tax datasets, but no sizes or metrics; I’d treat it as vertical risk-graph plumbing for now.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→CoAD framework for time series anomaly detection using cooperative classification and reconstruction

The paper proposes CoAD, a time-series anomaly detection framework that uses a classification module to generate probability-informed soft masks for a reconstruction module; the abstract says experiments on benchmark datasets beat SOTA deep learning and traditional methods, but the post does not disclose specific scores, datasets, or speed numbers.

#Benchmarking#CoAD#arXiv#Research release

why featured

HKR-K passes: CoAD links classifier soft masks to a reconstruction module. HKR-H/R are weak; the summary claims multi-benchmark SOTA gains but gives no effect sizes or dataset details.

editor take

CoAD feeds classifier soft masks into reconstruction; no scores, datasets, or latency disclosed, so treat “SOTA and faster” as abstract-grade.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

The paper proposes a two-stage adapter that embeds tabular foundation model predictions inside a utility-maximization framework, recovering up to 13 percentage points of accuracy over a standard logit model on two transportation datasets while maintaining monotonic price-demand relationships and analytically computable trade-off measures.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with a concrete mechanism and testable result; HKR-H/R are weak because the topic is niche econometrics rather than a broad AI product or model-competition story.

editor take

Two-stage adapters gain 13 points on 2 transport sets; for policy tabular FMs, monotonicity beats leaderboard accuracy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

The paper introduces the SVBCX chest X-ray dataset and a graph-transformer ensemble architecture for silicosis and pneumonia classification, reporting a 0.9749 macro-F1 score and per-class AUC ROC scores above 0.99 on its constructed dataset.

#Vision#Multimodal#Benchmarking#Research release

why featured

HKR-K passes via a new dataset, model mechanism, and testable metrics. HKR-H/R are weak: this is narrow medical-imaging classification with no product, deployment, or broader industry signal.

editor take

SVBCX ensemble reports 0.9749 macro-F1; with no external validation disclosed, treat this as in-dataset medical imaging optimism.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Towards Interpretable Federated Learning

arXiv:2302.13473v2 presents a survey on interpretable federated learning, covering mechanisms for prediction explanation, model debugging, and attribution of contributions from individual data owners or samples.

#Interpretability#Research release

why featured

HKR-K passes because the post gives a three-part IFL survey frame; HKR-H/R fail due to a dry survey angle and weak practitioner resonance. It is specialized research, not a hard-exclusion case, so it stays in all.

editor take

arXiv:2302.13473v2 splits IFL into 3 buckets; finance and healthcare need attribution, not just prediction explanations.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Probabilistic Recurrent Intention Switching Model

PRISM maps observation history to per-step intention distributions with a lightweight recurrent network, proves an EM decomposition into independent closed-form reward subproblems, and reports an O(nK) E-step across a non-Markovian gridworld, a mouse labyrinth, and BridgeData V2 robotic manipulation.

#Robotics#Reasoning#Benchmarking#arXiv

why featured

HKR-K passes on a concrete mechanism, complexity claim, and eval datasets; HKR-H/R fail because the angle is academic and narrow. No hard exclusion, but it stays in the low-value research band at 50.

editor take

PRISM gets IRL intention switching to an O(nK) E-step; I care whether BridgeData V2 gains are only log-likelihood.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

The paper proves global convergence for WPG in entropy-regularized RL under a uniform log-Sobolev inequality, using Bellman residual KL representation, contraction, and a resolvent identity to obtain geometric contraction up to discretization bias.

#Reasoning#Research release

why featured

Hard-exclusion-technical-accessibility applies: WPG, log-Sobolev conditions, and discretization bias require deep math with no product on-ramp. HKR-K passes on theorem details, but HKR-H/R fail, so it is capped below 40.

editor take

WPG gets geometric contraction to discretization bias; the catch is uniform LSI, so don't read this as tuning-free RL.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Uniboost proposes posterior value alignment and independent linear boosting for traffic allocation in recommendation re-ranking, and validates the framework with online A/B tests, while the abstract does not disclose sample size, traffic scale, baseline names, or quantitative lift.

#Alignment#Uniboost#Research release

why featured

HKR-K passes on concrete mechanisms and an online A/B-test claim; HKR-H/R are weak, and sample size plus uplift are not disclosed. This is narrow technical research, so it stays in all.

editor take

Uniboost reports online A/B tests, but no sample size, baselines, or lift; treat it as re-ranking ops, not alignment.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→PDEInvBench: Benchmark Dataset and Neural Network Design Space for PDE Inverse Problems

PDEInvBench introduces a benchmark dataset for PDE inverse problems, covering time-dependent and time-independent PDE simulations with in-distribution and multiple out-of-distribution evaluation splits, and reports that two-stage training with supervised initialization plus test-time PDE residual fine-tuning performs best.

#Benchmarking#Fine-tuning#PDEInvBench#Research release

why featured

Triggers hard-exclusion-1: PDE inverse problems are deep numerical methods with no product or agent on-ramp for general AI practitioners. HKR-K passes, HKR-H/R fail, so the score is capped below 39.

editor take

PDEInvBench lands as a 37-page benchmark; two-stage training and PDE-derivative inputs beat blind parameter scaling.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

The paper proposes PyCAT4 for 3D human pose estimation, adding a self-attention feature layer, temporal feature fusion, and spatial pyramid multi-scale fusion, with validation on two datasets, COCO and 3DPW; the snippet does not disclose metric values or baseline comparisons.

#Vision#Multimodal#Benchmarking#PyCAT4

why featured

HKR-K passes on named mechanisms and datasets, but HKR-H and HKR-R are weak. This is a narrow vision-paper abstract with no disclosed metric gains or reproducible setup, so it stays in the lower research-release band.

editor take

PyCAT4 names COCO and 3DPW, but omits metrics and baselines; treat the “significant gains” claim as unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→High-Quality Synthetic Financial Time-Series Using a GAN-Diffusion Framework

The paper presents a CoMeTS-GAN and diffusion framework that uses the GAN Critic to guide generation, jointly producing mid-price and volume time series for correlated stocks while explicitly modeling inter-asset correlations.

#Benchmarking#CoMeTS-GAN#Research release

why featured

HKR-K passes on the GAN-Critic-guided diffusion mechanism, but HKR-H and HKR-R are weak. The post discloses no open-source artifact, benchmark delta, or production replacement claim, so it stays in the low-value research band.

editor take

CoMeTS-GAN guides diffusion with a Critic for price-volume series; no dataset or metrics disclosed, so “high-quality” stays unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→MATT-CTR: Model-Agnostic Test-Time Paradigm for CTR Prediction with Confidence-Guided Inference Paths

MATT-CTR proposes a model-agnostic test-time paradigm for CTR prediction that uses confidence scores of feature combinations to sample multiple inference paths; the abstract says offline experiments and online A/B tests validate effectiveness, but the post does not disclose specific metrics or datasets.

#Inference-opt#Research release

why featured

Narrow CTR research; HKR-K passes on the confidence-guided multi-path mechanism, while HKR-H/R miss. No A/B numbers or deployment conditions are given, so it stays in the 40–59 low-value band.

editor take

MATT-CTR moves CTR gains into inference; A/B metrics are undisclosed, so I read it as a low-frequency feature patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures

The paper reproduces AOC-IDS on UNSW-NB15 at 89.39% accuracy versus the published 89.19%, then raises accuracy to 95.45% with XGBoost-BalSamp; its combined PseudoFilter, MixupAug, and LiteAE approach reaches 90.88% best-run accuracy with 91.45% F1 and 55% fewer parameters.

#Fine-tuning#Inference-opt#Benchmarking#IEEE INFOCOM

why featured

HKR-K passes on concrete benchmark and parameter-reduction numbers. HKR-H/R are weak because this is narrow security-ML research, not a broad AI product or agent story.

editor take

XGBoost-BalSamp hits 95.45% on UNSW-NB15; I trust the benchmark gain more than the IoT deployment story.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

13d ago

arXiv · cs.LG· atomEN04:00 · 05·27

→SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection

SilIF clusters per-tree path-length fingerprints and adds a silhouette score to Isolation Forest; on the IEEE-CIS benchmark with about 590K transactions and 3.5% fraud, alpha=1.0 improves AUC-PR by +0.0080 on average across five seeds, while the Sparkov synthetic credit-card dataset shows no gain over plain IF.

#Benchmarking#Venkatakrishnan Gopalakrishnan#arXiv#Research release

why featured

HKR-K passes on a concrete method and IEEE-CIS result. HKR-H and HKR-R are weak; the topic is classic anomaly detection for fraud rather than the LLM/agent mainstream, so it stays low-tier all.

editor take

SilIF adds only +0.0080 AUC-PR on IEEE-CIS; Sparkov shows zero gain, so I’d file it as an IF patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

03:43

13d ago

HuggingFace Papers (takara mirror)· rssEN03:43 · 05·27

→OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

OphIn-Engine constructs OphIn-500K from over 29,000 ophthalmology video clips, containing more than 500,000 instruction instances and over 151,000 unique images in VQA, multi-turn dialogue, and CoT reasoning formats.

#Multimodal#Vision#Fine-tuning#OphIn-500K

why featured

HKR-K is solid: the post gives dataset scale and task mix. HKR-H/R are weak because it is a niche ophthalmology dataset with no product, open weights, or competitive stakes disclosed.

editor take

OphIn-500K packs 500K instructions and 151K images; video-mined ophthalmology data is useful, but SOTA claims need blind tests.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

03:18

13d ago

HuggingFace Papers (takara mirror)· rssEN03:18 · 05·27

→Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

The paper proposes a unified incomplete video-language model for modality-missing inputs such as unavailable cameras; the snippet says it works as a plug-and-play module for prior VLMs, but the post does not disclose experiment counts or benchmark numbers.

#Multimodal#Vision#Safety#Research release

why featured

HKR-K passes: missing modalities are a real multimodal-system problem, and the post claims a plug-in module. HKR-H/R are weak, and experiment scale is not disclosed, so this stays in the lower research-release band.

editor take

The paper targets missing-modality VLMs, but discloses no benchmark counts or scores; treat “plug-and-play” as unproven until sensor-drop tests land.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

03:17

13d ago

HuggingFace Papers (takara mirror)· rssEN03:17 · 05·27

→SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

SIGMA adapts Vision Foundation Models with scale-adaptive fusion and semantic modulation. It uses 1.72% trainable parameters relative to the VFM backbone, and the paper reports consistent gains over state-of-the-art PEFT methods across dense prediction tasks and multiple VFM backbones.

#Vision#Fine-tuning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass, but this is a narrow vision-adaptation paper. The body gives parameter share and task scope, not code, benchmark gains, or adoption evidence, so it stays all.

editor take

SIGMA trains 1.72% of backbone parameters; dense-prediction PEFT keeps chasing adapters, but “consistent SOTA” needs tables.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

03:17

13d ago

HuggingFace Papers (takara mirror)· rssEN03:17 · 05·27

→FedEHR-Gen Generates Synthetic Time-Series EHR Across Federated Hospitals

FedEHR-Gen generates synthetic time-series EHR across distributed hospitals with a two-stage federated framework, using a federated autoencoder for aligned latent spaces and a federated TCVAE with distribution-aware aggregation, and reports centralized-training-level fidelity, downstream utility, and privacy risk on eICU and MIMIC-III.

#Fine-tuning#Alignment#FedEHR-Gen#eICU

why featured

HKR-K passes: the method, datasets, and near-centralized-training claim are concrete. HKR-H/R are weak, and synthetic EHR generation is a vertical research item, so it stays in all.

editor take

FedEHR-Gen nears centralized training on eICU and MIMIC-III; hospital count is undisclosed, so deployment claims need external-site proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

02:56

13d ago

HuggingFace Papers (takara mirror)· rssEN02:56 · 05·27

→Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

The authors propose FAX, a framework that decomposes draft explanations into claims and verifies them against faithful tools, raising simulation faithfulness on CRAFTER-XAI-Bench from 0.20 for the strongest baseline to 0.46 while preserving informativeness, relevance, and fluency.

#Agent#Interpretability#Benchmarking#Research release

why featured

HKR-K is strong via a concrete mechanism and 0.20→0.46 benchmark gain; HKR-R fits agent trust concerns. As a single academic paper without adoption or broad debate, it stays in 60–71.

editor take

FAX lifts simulation faithfulness from 0.20 to 0.46; Agentic XAI without verification is just hallucination with nicer prose.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

02:26

13d ago

HuggingFace Papers (takara mirror)· rssEN02:26 · 05·27

→GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

GRADE evaluates 120 configurations across five open-source language models, covering zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations for assessing AI tutor responses in student-tutor dialogues.

#Reasoning#Fine-tuning#Benchmarking#GRADE

why featured

HKR-K passes on a concrete eval setup: 5 OSS models and 120 configurations. HKR-H/R miss because the post gives no surprising result, product impact, or broad practitioner nerve.

editor take

GRADE tests 120 configs across 5 OSS models; I buy the Gemma3 result, not costly CoT as tutor-quality evaluator.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

02:06

13d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN02:06 · 05·27

→TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP searches agent prompts and communication topologies as one genome, uses a DeepSeek-V3.2 backbone, and reports 82.66% accuracy on MMLU-Pro, 89.96% on MMLU, and 96.61% on GSM8K while using up to 5.69x fewer tokens than debate-style systems at the reported operating points.

#Agent#Reasoning#Benchmarking#DeepSeek

why featured

HKR-H/K/R all pass: TCP-MCP offers a joint-search mechanism, benchmark numbers, and a token-cost claim. It is practical multi-agent research, not a major product or framework release, so it lands in the 78–84 band.

editor take

Stop hand-wiring agent graphs; TCP-MCP’s 5.69x token cut is the part that actually hurts debate-style systems.

sharp

TCP-MCP hits the dirtiest part of multi-agent engineering: prompts and communication edges are tuned separately, then everyone pretends the graph was designed. Here they search both as one genome, using the same DeepSeek-V3.2 backbone, and report 82.66% on MMLU-Pro, 89.96% on MMLU, and 96.61% on GSM8K. The sharp number is token use: up to 5.69x fewer tokens than debate-style systems at the reported operating points. I buy the direction; I don’t buy a victory lap yet. MMLU and GSM8K are friendly to Pareto-front search because the task surface is static. Production agent systems fail on tool errors, state drift, and asynchronous dependencies, not because the graph lacks elegance. AutoGen and CrewAI users already learned that a neat topology can rot fast once real tools enter the loop. TCP-MCP needs cross-task reuse, not another benchmark win.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

01:56

13d ago

HuggingFace Papers (takara mirror)· rssEN01:56 · 05·27

→LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

LoSATok compresses 1280-dimensional semantic encoder features into 128 dimensions and uses a time-relation loss for temporal consistency; experiments cover speech, music, and general audio, and the authors provide code on GitHub.

#Audio#Multimodal#Inference-opt#LoSATok

why featured

HKR-K is solid: LoSATok gives a compression ratio, loss design, domains, and open code. HKR-R is limited to audio/multimodal builders, and HKR-H is weak, so this stays all.

editor take

LoSATok cuts semantic features from 1280D to 128D; audio generation pressure shifts back to tokenizer design, not bigger DiTs.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

01:30

13d ago

HuggingFace Papers (takara mirror)· rssEN01:30 · 05·27

→Revealing Algorithmic Deductive Circuits for Logical Reasoning

The study uses symbolic-aided CoT prompting and causal mediation analysis to localize reasoning attention heads, finding that about 3% of total heads retrieve factual and rule-based information while higher layers integrate graph-traversal strategies.

#Reasoning#Interpretability#Research release

why featured

HKR-K/R pass: the 3% head finding and high-layer graph traversal mechanism add signal. Missing model names, datasets, code, or product impact keeps it an interesting research item, not featured.

editor take

The paper pins sub-reasoning retrieval on ~3% of heads; useful interpretability, but models and sample scope remain undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

01:30

13d ago

HuggingFace Papers (takara mirror)· rssEN01:30 · 05·27

→Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security

The APD framework identifies and neutralizes malicious prompt components before LLM processing, combining mutual-information semantic decomposition, graph-based intent classification, and a lightweight transformer classifier to reduce harmful output generation by over 85%.

#Safety#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper has a concrete defense mechanism and a >85% harmful-output claim. Single-source paper coverage lacks author authority, benchmark detail, and reproducibility conditions, so it stays in the 60–71 band.

editor take

APD claims over 85% harmful-output reduction, but no baseline or attack set is disclosed; treat it as a reproducibility test.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

01:06

13d ago

HuggingFace Papers (takara mirror)· rssEN01:06 · 05·27

→Constrained Auto-Bidding via Generative Response Modeling

The paper proposes GRM for constrained auto-bidding, shifting learning from actions to responses and predicting future traffic plus horizon-level cost/value curves under one bid multiplier. An analytic controller enforces each active constraint with 1D root-finding, and AuctionNet experiments report better constraint stability and overall score than baselines.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes: GRM reframes auto-bidding from action learning to response-curve prediction and applies 1D root finding for constraints. The ad-optimization niche keeps HKR-H and HKR-R weak, so this stays in all.

editor take

GRM swaps action learning for response prediction, using one multiplier plus 1D root-finding; AuctionNet wins, but live auction drift is the test.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0