papers · 2026-05-21

▸ 240 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-21 · Thu

21:00

18d ago

HuggingFace Papers (takara mirror)· rssEN21:00 · 05·21

→Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

The paper evaluates multilingual sparse autoencoders on LLaMA-3.1-8B and Gemma-2-9B, using an intersection of multilingual alignment and language separability to choose steering layers, then tests machine translation and CrossSumm with SpBLEU, ROUGE-L, COMET, and LaSE; the reported result is more stable language identification accuracy versus generation quality without exhaustive layerwise search.

#Interpretability#Multimodal#Reasoning#LLaMA

why featured

Only HKR-K lands: the post gives a concrete multilingual SAE layer-selection rule, but HKR-H is dry and HKR-R is narrow. No hard exclusion; this fits the lower end of research-release signal.

editor take

LLaMA-3.1-8B and Gemma-2-9B get multilingual SAEs; useful layer-search shortcut, but gains are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:53

18d ago

arXiv · cs.AI· atomEN17:53 · 05·21

→The Matching Principle: Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning

The paper proposes the Matching Principle, which estimates label-preserving deployment nuisance covariance and regularizes the encoder Jacobian along its covered range; 12 of 13 pre-registered experimental blocks pass, including tests up to Qwen2.5-7B, while Office-31 fails under a pre-named eigengap condition.

#Reasoning#Alignment#Benchmarking#Qwen2.5-7B

why featured

hard-exclusion-technical-accessibility applies: the core claim depends on covariance, Jacobians, and geometric loss theory with no generalist on-ramp. Only HKR-K passes, so the item is capped and excluded.

editor take

Rajput folds robustness losses into covariance matching; 12/13 blocks pass, but I’d reproduce TDI before trusting it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:49

18d ago

arXiv · cs.AI· atomEN17:49 · 05·21

→Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models

The paper proposes a conservative drifting method for one-step generative modeling, replacing displacement velocity with a KDE-gradient velocity, and proves continuous-time finite-particle bounds with a root residual-velocity rate of N^{-1/(d+4)} under an additional h-uniform quadrature regularity condition.

#Reasoning#Research release

why featured

Hard-exclusion-1 applies: this is a KDE-gradient finite-particle convergence proof with no product, model, or reproducible practitioner hook. HKR-K passes only, so it stays excluded.

editor take

The paper proves N^{-1/(d+4)} finite-particle rates for conservative drifting; useful theory, but dimension makes it far from deployable one-step generation.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:48

18d ago

● P1arXiv · cs.AI· atomEN17:48 · 05·21

→MOSS autonomous agent system achieves self-evolution through source-level code rewriting

MOSS raises the four-task mean grader score on OpenClaw from 0.25 to 0.61 in one source-level self-rewriting cycle, with candidate code verified by replaying curated failure batches in ephemeral trial workers before an in-place container swap.

#Agent#Code#Tools#MOSS

why featured

HKR-H/K/R all pass: self-rewriting agents are clickable, the 0.25→0.61 gain is concrete, and runtime self-modification hits agent safety nerves. Single arXiv source keeps it below P1.

editor take

MOSS pushes agent self-evolution into source rewrites, and 0.25→0.61 is eye-catching; four OpenClaw tasks is not proof of production autonomy.

sharp

All 3 entries trace to the same arXiv paper, so the agreement is ingestion overlap, not independent confirmation. MOSS’s sharp move is source-level rewriting: it targets routing, hook order, state invariants, and dispatch, instead of prompts, skill files, memory schemas, or workflow graphs. I buy the problem framing, but not the “production self-evolution” strength yet. The hard number is a four-task OpenClaw mean grader jump from 0.25 to 0.61 in one autonomous cycle, with ephemeral trial workers, replay verification, user-consent promotion, container swap, and rollback probes. That sounds less like an autonomous organism and more like a coding-agent-driven CI/CD loop. The deciding variable is replay-batch coverage, not the headline phrase “rewrites its own source.”

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:44

18d ago

arXiv · cs.AI· atomEN17:44 · 05·21

→Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 separates linear-attention memory editing with channel-wise erase gate b_t and write gate w_t; under a 1.3B-parameter, 100B FineWeb-Edu-token setup, it reports the strongest overall results versus Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants.

#Reasoning#Inference-opt#Memory#NVlabs

why featured

HKR-K is strong and HKR-R is moderate: beating Mamba-2/KDA matters for cheaper long-sequence models. HKR-H is narrow, and the post gives abstract-level facts without code or broad reproduction details.

editor take

Gated DeltaNet-2 trains at 1.3B/100B tokens; splitting erase/write gates makes its RULER gains look like mechanism, not tuning luck.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:42

18d ago

arXiv · cs.AI· atomEN17:42 · 05·21

→LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

LCGuard transforms shared KV caches before transmission in multi-agent LLM systems, treating cache artifacts as latent working memory. The paper defines unsafe sharing through adversarial reconstruction of agent-specific sensitive inputs, and reports lower reconstruction-based leakage and attack success rates across multiple model families and multi-agent benchmarks while keeping competitive task performance versus standard KV-sharing baselines.

#Agent#Safety#Memory#Research release

why featured

HKR-K/R pass: KV-cache leakage and LCGuard’s mitigation are useful for agent safety. The post gives no reduction numbers, model scale, or reproduction details, so it stays in the mid research-release band.

editor take

LCGuard filters shared KV caches; no deltas disclosed, but anchoring multi-agent privacy to adversarial reconstruction is the useful move.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:42

18d ago

FEATUREDarXiv · cs.CL· atomEN17:42 · 05·21

→Evaluating Commercial AI Chatbots as News Intermediaries

The study evaluates six chatbots on 2,100 same-day BBC News questions over 14 days, with the best systems exceeding 90% multiple-choice accuracy and losing 11–13% under free response. Hindi questions score lowest at 79%, and retrieval failures account for over 70% of all errors.

#RAG#Benchmarking#Reasoning#Gemini

why featured

HKR-H/K/R all pass: the paper offers a reproducible 2,100-question, 6-chatbot news benchmark and attributes 70%+ of errors to retrieval failure. Strong RAG reliability signal, not a major model or platform release.

editor take

The 90% MCQ score is the decoy; Hindi at 79% and 70% retrieval-driven errors are where news chatbots break in production.

sharp

Commercial news chatbots already answer fresh news well enough in clean settings; the dangerous part is who their retrieval stack leaves out. The paper tests Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, and GPT-4o mini on 2,100 same-day BBC questions from Feb. 9-22, 2026. The best systems clear 90% on multiple choice, then lose 11-13 points in free response. MCQ accuracy flatters the product. Hindi is the ugly result: 79% accuracy versus 89-91% elsewhere, with citations leaning on English Wikipedia rather than Hindi outlets. Over 70% of errors come from retrieval failure, not reasoning. For builders, that lands harder than another MMLU bump: news reliability is now index coverage, source ranking, and multilingual retrieval quality, not just model IQ.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:36

18d ago

FEATUREDarXiv · cs.AI· atomEN17:36 · 05·21

→DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox uses DeltaFS and DeltaCR for incremental sandbox checkpoint and rollback, reaching 14 ms checkpoint latency and 5 ms rollback latency in SWE-bench and RL micro-benchmarks.

#Agent#Tools#Inference-opt#DeltaBox

why featured

HKR-H/K/R all pass: DeltaBox targets a real agent-systems bottleneck with named mechanisms and 14ms/5ms numbers. The source is only an arXiv abstract with no independent replication or production evidence, so it stays in the 78–84 band.

editor take

DeltaBox gets sandbox rollback to 5 ms; that matters more for agents than another SWE-bench bump.

sharp

DeltaBox pulls the agent bottleneck back into the OS layer, and I buy that framing. The concrete hook is strong: DeltaFS layers file state, DeltaCR uses incremental process dumps, and the paper reports 14 ms checkpoint latency plus 5 ms rollback on SWE-bench and RL micro-benchmarks. Full-state duplication sits at hundreds of milliseconds to seconds, which kills deep fan-out before the model gets interesting. I like that this paper does not sell a smarter agent. It changes sandbox transaction semantics. Claude Code-style and Codex-style coding agents already lose plenty of time on failed attempts, dirty worktrees, and process resets. If 5 ms rollback survives real repos, long-running services, and messy filesystem state, test-time search finally gets an engineering substrate instead of a demo loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:33

18d ago

arXiv · cs.AI· atomEN17:33 · 05·21

→MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze

MambaGaze achieves 76.8% and 73.1% accuracy on CLARE and CL-Drive under leave-one-subject-out evaluation, using XMD encoding for blink and tracking-failure missingness, while Jetson edge benchmarks report 43-68 FPS real-time inference below 7.5W power consumption.

#Multimodal#Inference-opt#Benchmarking#NVIDIA

why featured

HKR-K passes with benchmark results, an explicit missing-data mechanism, and edge FPS/power. HKR-H and HKR-R are weak because gaze-based cognitive-load assessment is useful but narrow, so it stays in all.

editor take

MambaGaze hits 76.8%/73.1% LOSO accuracy; I buy the XMD trick, not stable cognitive-load inference yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:32

18d ago

arXiv · cs.CL· atomEN17:32 · 05·21

→Reducing Political Manipulation with Consistency Training

The paper introduces Political Consistency Training, an RL method with two paradigms that reduces covert political bias in LLMs, and defines two metrics: Sentiment Consistency and Helpfulness Consistency.

#Alignment#Safety#Benchmarking#Research release

why featured

HKR-H/K/R pass: the title ties political manipulation to consistency training, and the summary gives two RL paradigms plus two metrics. No result numbers, model list, or artifact details are disclosed, so it stays in the 60–71 band.

editor take

PCT uses 2 RL paradigms to curb political bias; models and effect sizes aren’t disclosed, so I don’t buy the helpfulness claim yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:31

18d ago

FEATUREDarXiv · cs.CL· atomEN17:31 · 05·21

→Understanding Data Temporality Impact on Large Language Models Pre-training

The authors evaluate 6B-parameter models with more than 7,000 temporally grounded questions, pretrain them on temporally ordered Common Crawl snapshots, and find that sequential training improves factual freshness and temporal precision while matching shuffled baselines on general language understanding and common knowledge.

#Benchmarking#Kyutai#Common Crawl#Hugging Face

why featured

HKR-H/K/R all pass, but this is a single arXiv pretraining-recipe paper without a major-lab release or cross-source cluster. The chronological-data result is useful, so it clears featured, not a higher band.

editor take

Temporal pretraining is not a cute data trick; on 6B models and 7K+ questions, shuffling pulls the model toward stale facts.

sharp

This paper drags “stale knowledge” back into pretraining, not retrieval. The team trains 6B-parameter models on temporally ordered Common Crawl snapshots, then tests them on 7,000+ temporally grounded questions. Sequential training improves factual freshness and temporal precision, while matching shuffled baselines on general language understanding and common knowledge. That lands awkwardly for the standard continual-pretraining recipe. Many teams shuffle for distribution stability, then patch recency with RAG. Kairos says the time structure of the corpus gets baked into weights before retrieval enters the room. The caveat is real: this is 6B scale, and the snippet does not disclose tokenizer, token budget, or exact benchmark scores. I’d want to see 30B/70B replication before treating this as a default training recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:09

18d ago

HuggingFace Papers (takara mirror)· rssEN17:09 · 05·21

→Research paper introduces ProxySHAP for approximating higher-order Shapley and Banzhaf interactions

The paper introduces ProxySHAP, which approximates higher-order Shapley and Banzhaf interactions using tree-based proxy models plus residual correction, and reports lower error than ProxySPEX and KernelSHAP-IQ on benchmarks that include large-scale settings with thousands of features.

#Interpretability#Benchmarking#ProxySHAP#ProxySPEX

why featured

HKR-K passes, but HKR-H/R fail. The item is a specialized interpretability-method paper with only an error claim versus ProxySPEX and KernelSHAP-IQ, triggering technical-accessibility fail.

editor take

ProxySHAP uses tree proxies plus residual correction; benchmarks claim wins on thousands of features, but code disclosure is absent here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:04

18d ago

arXiv · cs.CL· atomEN17:04 · 05·21

→ChronoMedKG: A Temporally Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG introduces 460,497 evidence-linked triples across 13,431 diseases, ties associations to onset windows or progression stages, and adds ChronoTQA with 3,341 questions to test temporal clinical reasoning under retrieval conditions.

#RAG#Reasoning#Agent#ChronoMedKG

why featured

HKR-K is clear via dataset scale, and HKR-R is moderate for medical AI evaluation trust. The topic is vertical, and the body gives no model comparisons or deployment mechanism, so it stays in all.

editor take

ChronoMedKG keeps 460,497 evidence-linked triples; a 30-point temporal drop says clinical RAG still mishandles time.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

16:52

18d ago

arXiv · cs.CL· atomEN16:52 · 05·21

→AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo pre-trains a graph encoder with dense body-surface IMU simulation and paired placement views, then improves average HAR Accuracy/F1 by 11.7%/11.6% across 14 unseen downstream datasets.

#Multimodal#Embedding#Benchmarking#AnyMo

why featured

HKR-K passes via a concrete mechanism and 14 unseen-dataset gains. The human-motion/HAR scope is narrow for AI Radar, with weak HKR-H and HKR-R, so it stays in the lower research-signal band.

editor take

AnyMo gains 11.7% Accuracy on 14 unseen HAR sets; IMU generalization finally escapes fixed placement assumptions.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

16:51

18d ago

● P1arXiv · cs.CL· atomEN16:51 · 05·21

→AMEL: Study of Accumulated Message Effects on LLM Judgments

AMEL tests 11 models across 75,898 API calls and finds that prior evaluation polarity shifts later LLM judgments in the same direction; negative histories induce 1.62x more bias than positive histories, while 5 and 50 prior turns produce the same shift.

#Reasoning#Benchmarking#Safety#OpenAI

why featured

HKR-H/K/R all pass: the paper claims conversation history systematically biases LLM judgments, backed by 75,898 API calls across 11 models. It affects eval reliability, safety review, and agent memory design, fitting the 78–84 research band.

editor take

11 models and 75,898 calls show polarity drag; if your LLM judge batches items in one chat, rerun your evals.

sharp

All 3 arXiv entries carry the same title and point to one v2 paper, so this is visibility across categories, not independent corroboration. The paper’s hook is strong: 75,898 API calls across 11 models from OpenAI, Anthropic, Google, and four open-source models show prior judgment polarity pulling later judgments with d=-0.17, rising to d=-0.34 on high-entropy items. I’d treat this as a direct hit on LLM-as-judge batching, not a cute bias artifact. Five prior turns and 50 prior turns produce the same shift, so longer context is not the culprit. Negative histories create 1.62x more bias than positive ones. Scaling trims the damage but leaves it: OpenAI Nano at -0.34, GPT-5.2 at -0.17; Anthropic Haiku at -0.22, Opus at -0.17. Fresh context per item is boring, expensive, and now hard to dodge.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:46

18d ago

arXiv · cs.CL· atomEN16:46 · 05·21

→Tokenization with Split Trees

ToaST optimizes token counts with binary split trees and IP-based vocabulary selection, reducing token counts by over 11% versus BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass, but this is a single arXiv tokenizer method with token-count results only; no open-source artifact, deployment path, or major-model adoption is disclosed, so it stays in the interesting research band.

editor take

ToaST cuts 11%+ tokens at 40,960 vocab; 1.5B runs gain 2.6–7.6%, so tokenizer work still has teeth.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:50

18d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN15:50 · 05·21

→Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Boiling the Frog evaluates incremental attacks on tool-using AI agents in corporate and office settings across nine models; aggregate strict ASR is 44.4%, with Claude Haiku 4.5 at 20.5%, Gemini 3.1 Flash Lite at 92.9%, and Code of Practice loss-of-control chains reaching 93.3% category-level ASR.

#Agent#Safety#Benchmarking#Claude

why featured

HKR-H/K/R all pass: the multi-turn attack hook is fresh, the post gives testable ASR figures, and enterprise-agent safety is a live practitioner worry. As a single benchmark paper, it fits the 78–84 featured band, not P1.

editor take

Agent safety is finally testing state pollution; Gemini 3.1 Flash Lite at 92.9% ASR is ugly for office-agent deployment.

sharp

Boiling the Frog hits the gap most agent benchmarks still dodge: failure comes from state getting corrupted across turns, not from one bad refusal. The setup starts with benign workspace edits, inserts the risky payload at controlled turns, then scores the final artifact state. Across nine models, strict ASR is 44.4%; Gemini 3.1 Flash Lite lands at 92.9%, while Claude Haiku 4.5 sits at 20.5%. I trust this direction more than another text-only jailbreak leaderboard. Office agents fail through documents, calendars, CRM fields, and permission flows, not just chat bubbles. The 93.3% category-level ASR on Code of Practice loss-of-control chains is the sharpest number here. Still, the article does not expose enough tool-environment detail or human-review protocol, so I would not treat the 92.9% as a product-grade verdict before reproduction.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:21

18d ago

HuggingFace Papers (takara mirror)· rssEN15:21 · 05·21

→Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

The paper proposes head-conditioned local LoRA and an out-of-cone penalty to improve gaze reasoning in vision foundation models for gaze following, reports state-of-the-art results on GazeFollow and VAT, highlights stronger gains when gaze targets are not semantically salient, and says the code will be released after paper acceptance.

#Vision#Reasoning#Fine-tuning#Research release

why featured

HKR-K passes with two concrete mechanisms and GazeFollow/VAT evaluation. HKR-H/R are weak, and the post gives no gain numbers or usable code, so this stays in the lower research band.

editor take

The paper claims SOTA on GazeFollow and VAT, but code waits for acceptance; I don’t buy gaze-following gains without repro.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

15:18

18d ago

HuggingFace Papers (takara mirror)· rssEN15:18 · 05·21

→Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

The paper proposes a vision-only UAV video detection framework that aligns adjacent frames with homography-based GMC, extracts short- and long-term motion cues through dual-interval differencing, and adds an MGA module to a Feature Pyramid Network, reporting consistent gains over a YOLOv8 baseline on VisDrone-VID without disclosing exact metrics in the snippet.

#Vision#YOLOv8#VisDrone-VID#Research release

why featured

HKR-K passes via concrete mechanisms and a benchmark setup, but HKR-H and HKR-R are weak. This is a narrow vision-detection paper, not hard-excluded, but below featured threshold.

editor take

The authors modify YOLOv8 on VisDrone-VID, but exact gains are undisclosed; until numbers land, GMC plus dual-interval differencing smells incremental.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

13:28

18d ago

HuggingFace Papers (takara mirror)· rssEN13:28 · 05·21

→MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

MaSC uses externally provided foreground concept masks to separate subject and background evaluation, reaching Krippendorff alpha 0.471 for concept preservation on DreamBench++ and AUC 0.992 for identity preservation on ORIDa.

#Vision#Multimodal#Benchmarking#MaSC

why featured

HKR-K passes with a testable evaluation mechanism and two metrics. HKR-H/R are weak: the title reads like a paper name, and the impact is concentrated in image-generation evaluation, so it fits the 60–71 research-signal band.

editor take

MaSC hits 0.471 alpha on DreamBench++; external foreground masks are the catch, so don't sell it as label-free evaluation.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

11:29

18d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN11:29 · 05·21

→Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Sibyl-AutoResearch introduces Scientific Trial-and-Error Harnesses in SIBYL, and a retrospective audit identifies 8 high-confidence conversion events with a median latency of 1 iteration and a maximum latency of 3 iterations.

#Agent#Memory#Tools#Sibyl-Research-Team

why featured

HKR-H/K/R all pass: the title attacks paper-generator agents, and HKR-K has a concrete harness mechanism plus 8 audited transitions. Unknown team prominence and no impact evidence keep it at 79, below p1.

editor take

Sibyl-AutoResearch moves the goalpost from paper drafting to trial conversion; 8 events is tiny, but the audit framing is saner than most research-agent demos.

sharp

Sibyl-AutoResearch makes the right enemy explicit: paper-generating agents that turn weak trials into confident prose. Its two units, trial-to-behavior and trial-to-harness-behavior conversion, ask whether an experiment actually changes later planning, validation, critique, writing, or the harness itself. The SIBYL audit reports only 8 high-confidence conversion events, with median latency of 1 iteration and max latency of 3. Small sample, clean target. I don’t buy any performance halo around it. The authors explicitly avoid a comparative performance claim; they show the conversion traces are recoverable from realistic workspaces. That restraint matters. Compared with AI Scientist-style end-to-end paper factories, SIBYL is betting on failure memory, gates, artifact traces, and repair loops. Unflashy, but aimed at the chronic failure mode of research agents: they produce artifacts, then fail to learn from the mess they just made.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

11:24

18d ago

HuggingFace Papers (takara mirror)· rssEN11:24 · 05·21

→Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Meta-Soft compresses KV cache with a learnable orthogonal meta-library, a Gumbel-Softmax selector that synthesizes k prompt-specific Soft Tokens, and an attention-flow integration mechanism that moves information from removed tokens into retained tokens; the snippet says experiments on multiple datasets outperform existing eviction methods, but it does not disclose model sizes, compression ratios, latency numbers, or dataset names.

#Inference-opt#Memory#Research release#Benchmark

why featured

HKR-K and HKR-R pass: the item gives concrete compression mechanisms and a multi-dataset claim over eviction baselines. HKR-H is weak, and the post lacks code, throughput/memory numbers, or production evidence, so it stays in the interesting band.

editor take

Meta-Soft synthesizes k soft tokens via Gumbel-Softmax; no compression or latency numbers, so I’d treat it as idea-stage.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

08:20

19d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN08:20 · 05·21

→Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Ratchet lets a frozen Claude Opus 4.7 write, retrieve, curate, and retire natural-language skills. On MBPP+ hard-100, across 100 rounds and 3 seeds, it raises held-out pass@1 from 0.258 to a 0.584 late-window rolling mean; on SWE-bench Verified, the same recipe gives a 0.22 peak lift over 20 rounds.

#Agent#Code#Memory#Claude

why featured

HKR-H/K/R all pass: the self-evolving-agent angle is clickable, and the article gives benchmark deltas. As a single research item without clear artifact or broad cluster, it fits the 78–84 band, not P1.

editor take

Ratchet makes self-evolving agents look less magical: the gain comes from deletion discipline, not skill-writing theater.

sharp

Ratchet’s useful claim is not that Claude Opus 4.7 can write skills; it is that agent memory rots unless it has deletion pressure. On MBPP+ hard-100, 100 rounds and 3 seeds move held-out pass@1 from 0.258±0.047 to a 0.584 late-window mean, with a 0.658±0.042 peak. The no-skill control drifts only +0.002±0.005. The ablation is the part I buy: retirement and the meta-skill authoring prior carry the result, while explicit deduplication gets absorbed by the meta-skill. Voyager made “accumulated experience” sound cleaner than it is. Ratchet looks closer to the engineering truth: long-running agent memory is mostly garbage collection. The SWE-bench Verified +0.22 peak over 20 rounds is attractive, but peak lift is not a stability story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

05:37

19d ago

HuggingFace Papers (takara mirror)· rssEN05:37 · 05·21

→FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED releases a multimodal autonomous driving dataset for flooded road environments, covering five locations with a 2.3 MP camera, 64-beam 360° LiDAR, IMU, and RTK GNSS data.

#Multimodal#Vision#Robotics#FRED

why featured

HKR-H and HKR-K pass: flooded roads are a concrete autonomy edge case, and the post gives sites plus sensors. HKR-R is weak because there is no benchmark result, license, adoption, or broader practitioner consequence.

editor take

FRED covers five flooded sites; sample count is undisclosed, but water-hazard labels beat another sunny-road dataset.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

05:18

19d ago

HuggingFace Papers (takara mirror)· rssEN05:18 · 05·21

→Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

DiTo changes token reduction for Diffusion Transformers from input-similarity matching to output-similarity-aware matching, reusing prior-step correspondences across reduction timesteps and reporting 1.6-3.9 dB higher PSNR than existing token reduction methods at comparable speedups.

#Vision#Inference-opt#DiTo#Research release

why featured

HKR-K/R pass: the item gives a concrete mechanism and a 1.6-3.9 dB PSNR gain tied to diffusion inference cost. HKR-H is weak, and this is a narrow single-paper summary, so it stays in all.

editor take

DiTo reports 1.6–3.9 dB PSNR gains at matched speedups; I buy the pivot from ViT-style input similarity to output-aware matching.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:08

19d ago

HuggingFace Papers (takara mirror)· rssEN04:08 · 05·21

→Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

The study tests knowledge graph construction on 6 statistical CSV datasets and finds serialization format plus extraction schema has a joint effect up to +1.180, while schema-format mismatch drops fact coverage below the unconstrained baseline on 4 of 6 datasets through entity inflation or extraction refusal.

#RAG#Benchmarking#CSVFidelity-Bench#Research release

why featured

HKR-H/K/R pass, but this is a narrow benchmarking paper, not a model or product release. Useful for table-to-KG/RAG pipelines, with limited industry spread, so it stays in 60–71.

editor take

CSVFidelity-Bench tests 15 CSV sets; schema mismatch undercuts unconstrained extraction on 4/6, so GraphRAG evals need direct graph access.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→On the Limits and Opportunities of AI Reviewers: Reviewing Nature-Family Reviews with 45 Experts

Forty-five scientists spent 469 hours rating 2,960 criticisms from reviews of 82 Nature-family papers, and a GPT-5.2 reviewing agent scored 60.0% on the composite metric, above the top-rated human reviewer per paper at 48.2% with p = 0.009.

#Agent#Reasoning#Benchmarking#GPT-5.2

why featured

HKR-H/K/R all pass with a testable 45-expert Nature-family review setup and 60.0% vs 48.2% result. It stays below 85 because it is a single arXiv study, not a model or platform release.

editor take

GPT-5.2 beats the top human reviewer per paper, but 21% overlap says the danger is review monoculture, not reviewer unemployment.

sharp

The sharp part is not that GPT-5.2 scored 60.0% versus 48.2% for the top human reviewer per paper. It is that AI review quality now has measurable strength and measurable bias. Forty-five scientists spent 469 hours rating 2,960 criticisms across 82 Nature-family papers, and the GPT-5.2 agent won on correctness, significance, and evidence sufficiency with p=0.009. It also surfaced a distinct 26% of issues no human raised. I don’t buy the replacement narrative. AI reviewers overlapped with each other 21% of the time; humans overlapped only 3%. That is the tell. These systems converge on the same failure modes, and the paper names 16 recurring weaknesses: limited subfield knowledge, poor long-context handling across multiple files, and over-criticism of minor issues. That is exactly where serious peer review still lives.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Most Transformer Modifications Still Do Not Transfer at 1–3B

The paper tests 20 post-2021 Transformer modifications at 1.2B and 3B under iso-data, iso-compute, and iso-recipe controls, using CLIMB-12 downstream evaluation; only two pass Bonferroni correction at 1.2B, and one of those fails to train stably at 3B.

#Benchmarking#Reasoning#Research release#Benchmark

why featured

HKR-H/K/R all pass: the paper tests 20 Transformer changes under matched 1.2B/3B conditions and finds only 2 clear Bonferroni correction. Strong for research and training choices, but not a major model release.

editor take

This is a cold shower for Transformer tweaks: 20 changes, only 2 survive correction, and architecture papers without multi-seed downstream eval deserve suspicion.

sharp

Most 1-3B Transformer modifications still look like lab noise, not portable progress. The authors test 20 post-2021 changes under iso-data, iso-compute, and iso-recipe controls; only two pass Bonferroni at 1.2B, and one of those breaks training stability at 3B. The nastier result is the loss/downstream split. Two failed attention-output modifications land within 2-3% of baseline validation loss, yet drop 6-16 CLIMB-12 points. That punches directly at architecture papers selling small perplexity wins as substance. Narang et al. 2021 aged well here: without multi-seed noise floors, downstream eval, and cross-scale stability, a Transformer tweak is mostly a story with charts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Introspective X Training: Feedback Conditioning Improves Scaling Across All LLM Training Stages

The paper proposes Introspective Training, which uses a thinking reward model to add natural-language critique feedback and prefix-condition training data; experiments on 7.5-12B dense LLMs trained from scratch up to 18 trillion tokens report up to 2.8x compute efficiency and higher math and code performance than baselines.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-H/K/R all pass: the paper gives a mechanism, scale, and a 2.8x efficiency claim tied to training cost. As a single arXiv paper needing replication, it fits the upper 78-84 research-release band.

editor take

IXT pulls post-training preference signals into pretraining; 2.8x compute efficiency is juicy, but biased critique can poison trillions of tokens early.

sharp

IXT is sharp because it moves “which tokens deserve weight” into pretraining, not the SFT/RLHF cleanup phase. The mechanism is concrete: a thinking reward model writes natural-language critiques, then training data is prefix-conditioned on that feedback. The paper reports 7.5B–12B dense transformers trained from scratch up to 18 trillion tokens, with up to 2.8x compute efficiency and better math/code than baselines. I buy the direction before I buy the headline number. The last year of post-training work leaned on data filtering, synthetic CoT, and preference tuning as patches after the base model existed. IXT turns the reward model into an early training teacher. That also moves the failure mode earlier: critique text carries the reward model’s taste, blind spots, and collapse patterns. The abstract does not disclose benchmark tables, RM size, or how 2.8x is measured, so treat it as a strong claim, not a settled recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

The paper proposes Flow Map Reward Guidance, a training-free single-trajectory method that casts guidance as deterministic optimal control; at text-to-image scale, it matches or beats baselines on inverse and reward-guided tasks with as few as 3 NFEs, giving at least a 10x speedup over prior state of the art.

#Alignment#Inference-opt#Research release

why featured

HKR-H/K/R all pass: 3 NFEs and order-of-magnitude speedup are concrete hooks, with a testable FMRG mechanism. As a single arXiv paper needing replication, it stays in the 78–84 band.

editor take

FMRG getting guidance down to 3 NFEs is a cleaner cost story than stapling on another reward model.

sharp

FMRG’s sharp point is moving the guidance bill into sampler efficiency instead of adding another reward-model layer. It casts guidance as deterministic optimal control, then uses the flow map to integrate and steer the trajectory. The headline number is 3 NFEs on text-to-image tasks, matching or beating baselines and claiming at least a 10x speedup over prior SOTA. I buy the inference angle; I don’t buy the casual use of “alignment” yet. This is reward-guided sampling for aesthetic or preference objectives, not RLHF-style behavioral control. Training-free and single-trajectory are the right words for production cost, but the arXiv abstract does not give the exact base model, resolution, reward setup, or failure cases. If 3 NFEs only holds on narrow rewards, this is a strong inference-opt trick, not a general alignment method.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Adaptive Probe-based Steering for Robust LLM Jailbreaking

The paper proposes an adaptive probe-based steering attack that uses model extraction to approximate an ideal steering vector and tunes steering strength from contrastive activation statistics, raising the average harmfulness score from 6% to 70% without extra contrastive prompts or manual tuning.

#Safety#Alignment#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: adaptive steering jailbreaking has a clear hook, the summary gives a 6%→70% harmfulness jump and a strength-control mechanism, and it matters to safety teams. Single arXiv paper, so 78–84 rather than must-write.

editor take

This moves jailbreaking from prompt craft to activation engineering; 6% to 70% harmfulness makes chat-only guardrails look brittle.

sharp

The sharp part is not “another jailbreak”; it pushes the attack into steering vectors. The authors use model extraction to approximate an ideal vector, then tune strength from contrastive activation statistics. Average harmfulness jumps from 6% to 70%, without extra contrastive prompts or manual tuning. That lands badly for safety stacks built around system prompts, refusal templates, and external classifiers. Activation-level attacks bypass the surface where most product guardrails still spend their budget. The paper is ICML 2026, 19 pages, 13 figures, with code released, so this is not a tweet-sized stunt. I’d still discount the headline number until the tested model list is checked; the abstract does not name the targets, and 70% means different things on an open-weight lab model versus a hardened frontier endpoint.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

The paper proposes Proximal State Nudging, a shared autonomy algorithm that nudges users toward learnable states, and reports that in two CARLA driving tasks with 60 participants, PSN produced up to 7x larger unassisted skill gains than standard blended shared autonomy and 50% fewer collisions than unassisted self-practice.

#Robotics#Safety#Alignment#CARLA

why featured

HKR-H/K/R all pass: the deskilling angle is sticky, and the post gives a 60-person CARLA study with 7x gains. As a single arXiv paper, it fits the 78–84 research band, not P1.

editor take

PSN turns assistance into coaching: 7x skill gain is sharp, but 60 CARLA subjects do not settle the safety case.

sharp

PSN hits the old failure mode in shared autonomy: the smoother the assist, the more the operator rots. The paper does not chase stronger takeover. It nudges users into “learnable” states. In two CARLA driving tasks with 60 participants, PSN reports up to 7x larger unassisted skill gains than blended shared autonomy, and 50% fewer collisions than unassisted self-practice. I buy the problem framing before I buy the safety claim. LunarLander plus CARLA shows a mechanism, not real-world robustness under messy handoffs and long-term dependence. The Tesla FSD debate has lived in this gap for years: short-term comfort can hide long-term skill loss. PSN gives the field a measurable algorithmic handle, but a 9-page arXiv v1 is not deployment evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Pix2Fact: Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Pix2Fact introduces 1,000 4K+ images across eight scenarios to evaluate fine-grained VQA with external knowledge search; among ten VLMs, Gemini-3.1-Pro reaches only 51.7% average accuracy even with visual ground truth and search tools.

#Vision#Multimodal#Benchmarking#Gemini

why featured

HKR-K is strong: 1,000 4K+ real-world images, 10 VLMs, and Gemini-3.1-Pro at 51.7%. The low score challenges vision-plus-search agent claims, clearing featured, but it is still an arXiv benchmark rather than a same-day must-write release.

editor take

Pix2Fact is a nasty check on VLM optimism: even Gemini-3.1-Pro hits only 51.7% with visual ground truth and search.

sharp

Pix2Fact exposes an interface failure in VLM agents, not a plain vision-accuracy gap. The benchmark uses 1,000 4K+ real-world images across eight scenarios and tests ten VLMs. Gemini-3.1-Pro still reaches only 51.7% average accuracy with visual ground truth and search tools. That setup is brutal because it separates “can’t see it” from “can’t verify it,” and the models still break on fine-grained grounding, shallow search use, and long-tail local information. I don’t buy the story that a larger vision encoder fixes this. GPT-4V through Gemini have looked strong on salient-object demos; Pix2Fact asks the dirty workflow questions: signs, local text, place facts, messy web evidence. For field-grade VLM agents, the weak link is the perception-retrieval-verification loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Why Does Self-Distillation Sometimes Degrade the Reasoning Capability of LLMs?

The paper attributes math reasoning degradation under self-distillation to suppressed epistemic verbalization, and reports performance drops of up to 40% across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct under controlled context-richness and task-coverage experiments.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the paper makes a counterintuitive self-distillation claim, gives a 40% drop and a mechanism, and matters to fine-tuning teams. It is strong research signal, not a same-day model-launch event.

editor take

Self-distillation is deleting the model’s “I’m not sure,” and a 40% math drop is a nasty warning for trace-compression work.

sharp

Self-distillation breaks math reasoning here because it suppresses uncertainty, not because it shortens traces. The paper tests Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct under controlled context-richness and task-coverage settings, with drops up to 40%. The sharp part is the mechanism: a teacher conditioned on rich information sounds more certain, so the student learns cleaner answer paths and loses the hesitation that helps on OOD problems. I don’t buy the common “shorter CoT is cleaner reasoning” story without this check. A lot of distillation pipelines optimize for correct traces and token savings, then wonder why transfer gets brittle. The code is linked, so this is a good regression test for anyone shipping SFT or self-distill loops.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Equilibrium Reasoners scale test-time compute by iterating latent states, raising Sudoku-Extreme accuracy from 2.6% for feedforward models to over 99% when unrolled to an equivalent of 40,000 layers.

#Reasoning#Inference-opt#Research release

why featured

HKR-H/K/R pass: the post has a stark benchmark jump, a named mechanism, and a reasoning-scaling nerve. It stays in the 78–84 band because it is an arXiv paper on Sudoku-Extreme, not a shipped system or broad artifact.

editor take

2.6% to 99% on Sudoku-Extreme is wild, but EqR first proves latent dynamics can spend test-time compute well—not general reasoning solved.

sharp

EqR’s sharp claim is not the 99% score; it is moving test-time compute inside the model’s latent dynamics. The paper says easy Sudoku cases converge in 1 to 5 steps, while hard ones benefit from unrolling to an equivalent of 40,000 layers. On Sudoku-Extreme, accuracy jumps from 2.6% for feedforward models to above 99%, without an external verifier or task-specific priors. I buy the mechanism direction, but not the broad “scalable reasoning” halo yet. Sudoku has clean constraints and stable solutions, so the attractor story gets a friendly arena. Proofs, code, and agentic tasks have messier fixed points. This feels like a serious alternate route beside verifier-heavy reasoning systems: hide deliberation in iterative latent state updates. The burden is showing the convergence survives outside grid puzzles.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Evolutionary Generation of Multi-Agent Systems

EvoMAS reformulates multi-agent system generation as evolutionary search over structured configurations, using execution-trace-guided mutation, crossover, and experience memory; it beats EvoAgent by 10.5 points on BBEH and 7.1 points on WorkBench, and reaches 79.1% on SWE-Bench-Verified with Claude-4.5-Sonnet.

#Agent#Reasoning#Tools#Amazon Science

why featured

HKR-H/K/R all pass: EvoMAS has a clear mechanism and three benchmark numbers, and targets multi-agent design cost. It remains an arXiv research release without adoption or cross-source validation, so it sits in 78–84.

editor take

EvoMAS makes agent design look like search, not craft; 79.1% SWE-Bench-Verified is serious, but don’t confuse leaderboard fit with general agent automation.

sharp

EvoMAS lands because it stops asking the model to invent agent code and searches structured configurations instead. Execution traces drive mutation, crossover, and memory; the paper reports +10.5 over EvoAgent on BBEH, +7.1 on WorkBench, and 79.1% on SWE-Bench-Verified with Claude-4.5-Sonnet. I buy the direction. Agent systems usually fail on executability and runtime brittleness, not because the role prompt lacked elegance. AutoGen- or CrewAI-style hand wiring turns into topology tuning once tasks get messy. The caveat is cost and selection pressure: the abstract gives the headline scores, but not search budget, failure distribution, or how much Claude-4.5-Sonnet is carrying. Treat 79.1% as a strong benchmark result, not proof that enterprise agent design is now automated.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel selects a fixed-budget subset of web-agent trajectory steps using an importance-diversity objective, adds target-centered AXTree pruning and model-generated rationales, and reports roughly 9.7-12.5x training speedups over standard fine-tuning across WebArena, WorkArena, and MiniWob with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B.

#Agent#Fine-tuning#Tools#Qwen

why featured

HKR-H/K/R all pass: this is not just a benchmark paper; it gives a web-agent data-selection mechanism and a 9.7-12.5x speedup claim. As a single arXiv research item, it stays in the 78-84 quality band pending replication.

editor take

Weasel attacks the unglamorous bottleneck: messy traces and bloated AXTrees. The 9.7-12.5x speedup is strong; generalization is still the hard claim.

sharp

Weasel is credible because it does not pretend web-agent generalization comes from a larger backbone. It attacks the boring failure mode: noisy trajectories, repeated states, and huge AXTrees. The method picks fixed-budget trajectory steps with an importance-diversity objective, prunes around the ground-truth action target, and swaps expert traces for model-generated rationales. The reported 9.7-12.5x training speedups span WebArena, WorkArena, MiniWob, Qwen2.5-7B, Gemma3-4B, and Qwen3-8B. I still don’t buy the generalization claim without the table. The abstract does not give success rates, variance, per-benchmark splits, or ablations showing the greedy selector survives new website distributions. Web agents have a long habit of looking clean in offline training and getting wrecked by evaluator quirks or DOM churn. Open code helps; replication will decide whether this is a training recipe or a benchmark-local cleanup trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

The paper evaluates four open-weight reasoning models and finds that supervised fine-tuning on data without reasoning traces reduces valid reasoning-trace rates, while answer-only metrics obscure the failure even when performance conditioned on valid reasoning stays high.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper names a concrete failure mode, tests 4 open reasoning models, and exposes an eval blind spot for fine-tuning teams. It is research signal, not a major lab release, so it sits in 78–84.

editor take

Stop trusting post-SFT accuracy alone; across four open-weight reasoning models, SFT can train reasoning models into answer-only performers.

sharp

This paper hits the failure mode enterprise fine-tuning keeps hiding: the answer stays right while the reasoning interface rots. The authors test four open-weight reasoning models and split evaluation into final-answer accuracy, valid traces, empty traces, missing traces, and truncated traces. Standard SFT on data without reasoning traces rapidly suppresses valid reasoning traces, while answer-only metrics mask the damage. The useful nuance is that performance conditioned on valid reasoning stays high in several settings. So the model has not simply lost the underlying skill; the fine-tune changes when it emits the reasoning structure. That matters for agents, verifiers, and audit logs, where the trace is part of the contract. Loss masking helps without teacher-generated reasoning traces, which is a much cheaper fix than generating fresh CoT data. The abstract does not disclose model names or exact drop sizes, so treat this as a mechanism warning, not a magnitude claim yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→COBALT: Crowdsourced Robot Learning via Cloud-Based Smartphone Teleoperation

COBALT supports cloud-based smartphone teleoperation for robot demonstration collection, sustaining up to 8 concurrent users per GPU under 100 ms end-to-end latency and dozens at 20 Hz, while the pilot crowdsourcing run collected 7,500+ demonstrations totaling 50+ hours across nine countries in five days.

#Robotics#Agent#Tools#COBALT

why featured

HKR-H/K/R all pass: COBALT turns robot demo collection into smartphone cloud teleoperation, with one GPU, 8 concurrent users, under 100 ms latency, and 7,500+ demos in 5 days. It is an arXiv research release, not an industry product, so it stays in the 78-84 band.

editor take

COBALT’s punch is not robot policy quality; it makes teleop labor look internet-native, with phones and sub-100 ms latency doing the boring work.

sharp

COBALT’s useful claim is operational, not algorithmic: teleoperation becomes a phone-based labor market instead of a lab-hardware ritual. The concrete hook is strong enough: 8 concurrent users per GPU under 100 ms end-to-end latency, 256 simulated clients across 8 GPUs, and 7,500+ demos across nine countries in five days. That smells like infrastructure you can run, not a one-off robotics video. I’m more convinced by this than another “robot foundation model” headline. ALOHA and DROID already showed that manipulation progress is gated by dense, usable demonstrations. COBALT attacks the collection funnel. The weak spot is also obvious: the abstract does not spell out real-robot hardware diversity, task breadth, or how failure-heavy crowdsourced traces get filtered beyond logged metrics.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

The paper proposes an optimization-triggered backdoor framework that reaches 90% average attack success across four open-source LLMs and four tasks, while standard safety evaluations without compilation fail to detect the attacks.

#Inference-opt#Safety#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: the optimization-triggered backdoor angle is fresh, the summary gives a 90% attack rate across 4 LLMs and 4 tasks, and the risk lands on deployment safety. Single arXiv paper, so it stays below must-write.

editor take

This punctures the “trusted weights” story: the backdoor fires after compilation, exactly where production inference stacks live.

sharp

The trusted-weights assumption takes a clean hit here: the same open-source LLM passes uncompiled evaluation, then activates after compilation. The paper reports 90% average attack success across four open-source LLMs and four tasks, with clean accuracy near 100%, and says the attack needs no compiler or hardware modification. That condition lands close to production, because many teams vet raw graphs and then ship TensorRT-LLM, vLLM, or torch.compile-style paths for throughput. I have some doubts about the 90% number because the snippet gives no model names, task list, optimization settings, or defense details. But the direction is ugly and practical: if safety evals do not run the deployed graph, they are testing the demo model, not the system users hit.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Research Proposes RELEX Method for Accelerating LLM Reinforcement Learning via Rank-1 Trajectories

RELEX estimates a rank-1 subspace from a short RLVR observation window and matches or exceeds full RLVR performance across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base while using as few as 15% of training steps.

#Reasoning#Fine-tuning#Inference-opt#Qwen

why featured

HKR-H/K/R all pass, driven by the testable 15% RLVR-step claim. It stays in 78–84 because this is still an arXiv method paper, with external replication and real cost details not disclosed.

editor take

RELEX treats RLVR as a near-linear rank-1 path; if it replicates, a lot of reasoning-tuning compute theater gets exposed.

sharp

RELEX’s sharp claim is that RLVR training can collapse into one direction estimate. On Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, it matches or beats full RLVR with 15% of the steps. The paper also claims a 50-step prefix can extrapolate to 1000 steps, a 20x jump with no extra training. I like the result, but I don’t fully buy the generality yet. A rank-1 delta carrying most gains says these RLVR runs behave more like climbing one reward-aligned slope than learning a rich policy. The catch is scope: all models are Qwen-family, capped at 8B, and the tasks are verifiable reasoning. Code agents, tool use, and long-horizon environments inject messier gradients; one-dimensional denoising may stop looking so magical there.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

FT-Dojo introduces an interactive benchmark for autonomous LLM fine-tuning across 13 tasks in 5 domains, and FT-Agent achieves the best performance on 10 of 13 tasks under held-out evaluation with comparisons against frontier agents and open-source planning backbones.

#Agent#Fine-tuning#Benchmarking#Microsoft

why featured

HKR-H/K/R all pass: the hook is autonomous fine-tuning, and the concrete facts are 5 domains, 13 tasks, and 10 best results. It stays in 78–84 because this is an arXiv benchmark/agent experiment, not proven production replacement.

editor take

FT-Dojo turns fine-tuning into an agent environment, but 13 tasks is tiny; 10/13 wins is a strong start, not a factory.

sharp

FT-Dojo’s useful move is not FT-Agent winning 10 of 13 tasks. It turns fine-tuning into an interactive environment with raw data, sandboxed execution, structured feedback, and held-out evaluation fixed in one loop. Fine-tuning agents have too often been pitched as warmed-over AutoML; this at least makes data edits, training configs, and failure inspection part of the same task. I’m not excited by the 10/13 number yet. Thirteen tasks across five domains is too small to rule out method overfitting, and the abstract does not give task mix or compute budget. This smells like early SWE-bench-style interface gains: useful, but easy to over-read. FT-Dojo earns its keep if it exposes cross-domain failure diagnosis, not if FT-Agent tops a small leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→How Much Online RL Is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

The paper introduces G2D, a three-stage GRPO-to-DPO pipeline, and reports that on Qwen2.5-7B with K=150 it reaches 62.4% on MATH-500, beating GRPO’s 51.6% by 10.8 percentage points while using about 4x less compute.

#Reasoning#Fine-tuning#Alignment#Qwen

why featured

HKR-H/K/R all pass: the paper challenges online RLVR cost with a concrete G2D recipe and MATH-500 numbers. It is a single arXiv result awaiting replication, so it sits in good-quality research, not must-write.

editor take

G2D dents the online-RL dogma: Qwen2.5-7B hits 62.4% on MATH-500 at K=150, beating GRPO while using ~4x less compute.

sharp

G2D’s sharp claim is that RLVR compute is wasted when every step keeps generating fresh rollouts. The paper shows Qwen2.5-7B at K=150 reaching 62.4% on MATH-500, versus GRPO at 51.6%, while using about 4x less compute. Llama-3.1-8B at K=500 reaches 49.4% in the same setup. I buy the direction, not the full victory lap. The useful mechanism is “informative” rollouts: a short GRPO warm-up creates calibrated uncertainty, then DPO extracts contrastive signal offline. Too much warm-up makes the policy overconfident, so preference pairs get dull. That is a cleaner story than “just collect more pairs.” The catch is scope: 7B/8B models and MATH-500 do not settle coding, agent loops, or frontier-scale RLVR. If it transfers, online rollout budgets get cut first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→D³-Subsidy: Online Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Markets

D³-Subsidy uses a prefix-conditioned diffusion model for city-level driver subsidy control under three production constraints: stochastic shocks, subsidy-rate caps, and low-latency execution; offline evaluations report higher Rides and GMV with better cap compliance, and the real-world A/B test reports significant uplift, but the abstract does not disclose the exact effect size.

#Agent#Fine-tuning#DiDi Chuxing#D³-Subsidy

why featured

HKR-H/K/R all pass, but the story stays in the 60-71 band: it has a production mechanism and A/B test, while lift size, savings, and reproducible artifacts are not disclosed.

editor take

The 3 sources are the same arXiv paper repeated; D³-Subsidy matters because DiDi frames subsidies as a deployable constrained controller, not model theater.

sharp

All 3 entries point to the same arXiv record with identical title and body, so this is a single-source chain, not independent coverage. The paper is 14 pages, 14 figures, and reached v3 on May 21, 2026. My read: D³-Subsidy pushes driver subsidies from growth-ops tuning into online constrained control. The concrete hook is the constraint stack: stochastic shocks, subsidy-rate caps, and low-latency city-scale execution. It uses prefix-conditioned diffusion to sample future trajectories, then a Lagrangian-dual mapping to convert city-level plans into order-driver incentives without per-order iterative optimization. The weak spot is evidence disclosure: the abstract claims significant A/B uplift in Rides and GMV, but gives no uplift number. Still, compared with another offline recommender paper, this is closer to where platform AI burns or saves real money.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Training Language Agents to Learn from Experience

The paper introduces ICT for cross-task self-improvement: a reflector observes actor trajectories and generates system prompts, and trained reflectors outperform an untrained baseline on most held-out ALFWorld and MiniHack task families.

#Agent#Reasoning#Tools#MetaGym

why featured

HKR-H/K/R all pass: the paper has a clear self-improving-agent hook, a concrete reflector→system-prompt mechanism, and agent-builder resonance. As a single arXiv paper, it stays below same-day must-write status.

editor take

Reflection is moving from fixing one run to teaching the next, but “most held-out families” without effect sizes is a yellow flag, not a victory lap.

sharp

ICT pushes reflection from single-episode repair into cross-task transfer, and that is the right axis for agents. The mechanism is concrete: a reflector reads actor trajectories and writes system prompts, trained with RL from experience rather than human examples. On ALFWorld and MiniHack, trained reflectors beat an untrained baseline on most held-out task families, with some claimed transfer to different environments. The missing parts matter. The snippet gives no effect size, model names, rollout budget, or failure distribution. Compared with Reflexion-style “fix this attempt” loops, ICT asks whether experience becomes reusable control policy. I buy the framing more than another planner paper, but without deltas this is a benchmark-plus-recipe result, not evidence that long-horizon agent learning is solved.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Reflector internalizes step-wise reflection with teacher-guided SFT and RL using outcome-driven plus reward-validity supervision, reporting over 90% Defense Success Rate against complex indirect jailbreaks and a 5.85% gain on GSM8K.

#Safety#Alignment#Reasoning#Reflector

why featured

HKR-H/K/R all pass: the paper offers a concrete defense mechanism, testable metrics, and agent-safety resonance. Kept at 78 because it is an arXiv paper with limited deployment evidence.

editor take

Reflector trains reflection into the trajectory, not the wrapper; 90%+ DSR looks strong, but the base model and attack set decide whether it travels.

sharp

Reflector is aiming at the right layer: safety inside the generation trajectory beats a refusal wrapper after the model has already reasoned itself into trouble. The paper uses teacher-guided SFT to create reflection traces, then RL with outcome-driven and reward-validity supervision. The headline numbers are strong: over 90% DSR on complex indirect jailbreaks and a 5.85% GSM8K gain. I have doubts about the “without significant computational overhead” claim. The abstract does not disclose the base model, attack-set size, safety baselines, or whether reflection tokens count toward inference cost. A lot of guardrail papers looked solid on curated red-team sets in the last year, then weakened inside tool-using agent workflows. Reflector deserves replication; the scalable-safety framing should wait for harder evals.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

The paper introduces DG-Hard, a checkpoint-only SVD filtering method over the weight delta between W_base and W_ft, and reports balanced repair across 14 model-task settings and nine cross-domain held-out benchmarks without retraining or data-dependent tuning.

#Fine-tuning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the no-retraining recovery claim is clickable, DG-Hard gives a testable SVD-on-weight-deltas mechanism, and finetuning regression is a real practitioner pain. Single arXiv source keeps it at 78.

editor take

DG-Hard turns fine-tuning forgetting into SVD cleanup on weight deltas; if the 14-setting result holds, post-merge model triage changes.

sharp

DG-Hard’s sharp move is refusing to treat forgetting as a retraining problem. It operates only on W_ft minus W_base, runs Donoho-Gavish hard-threshold SVD on each delta matrix, and needs no data-dependent tuning. The paper claims results across 14 model-task settings and nine held-out cross-domain benchmarks, with metrics split into healing, preservation, non-damage, and target-task retention. That matters because naive recovery scores can reward a dumb slide back toward the base model. I’d pressure-test the edges first: full SVD cost on large checkpoints, MoE and quantized merges, and whether the reported safety recovery is just base prior leaking back in. Still, compared with replay, EWC-style regularization, or adapter tricks, this is clean. If it reproduces, checkpoint-only spectral triage becomes hard to skip before shipping fine-tuned models.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

The paper evaluates local reconstruction after LLM pruning across model families up to 72B parameters, reporting that adapting one parameter subset at a time on a calibration set matches post-pruning retraining while using over an order of magnitude less data and compute; it also finds matrix-level reconstruction underperforms, and pruning-criterion gaps shrink as model scale increases.

#Inference-opt#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the title has a contrarian hook, the summary gives testable 72B and >10x cost claims, and it hits deployment-cost concerns. As a single arXiv compression paper, it lands at 78.

editor take

This pruning paper has operator value: up to 72B, local reconstruction cuts the retraining bill by over 10x without chasing fancier criteria.

sharp

This paper pushes pruning back into engineering territory: stop over-investing in exotic pruning criteria, then calibrate the surviving weights properly. The authors test local reconstruction up to 72B parameters, adapting one parameter subset at a time on a calibration set to match dense-model intermediate activations. They report near post-pruning retraining quality with over an order of magnitude less data and compute. The sharp part is the granularity result: once the reconstruction window contains at least one nonlinear submodule, final quality is largely insensitive to window size, while single-matrix reconstruction accumulates activation drift. For teams shipping smaller private or edge deployments, that is a cleaner lever than another saliency metric paper.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

The paper profiles eight 0.5B to 9B LLMs on a flagship Android device using a reproducible pipeline, finding importance-aware quantization cuts memory but yields negligible energy savings, while MoE models keep 7B-class storage capacity with a 1B to 2B energy profile.

#Inference-opt#Benchmarking#Qwen#Research release

why featured

HKR-H/K/R all pass: the energy result is counterintuitive, the setup gives concrete model/device ranges, and the deployment cost angle resonates. Single arXiv benchmark, so it stays below must-write.

editor take

Quantization just took another hit on mobile: battery life is being decided by architecture, memory traffic, and thermals, not weight bits.

sharp

Mobile LLM energy claims look weaker after this paper: eight 0.5B to 9B models ran on a flagship Android device, and importance-aware quantization mainly cut memory, not energy. That is a painful result for on-device teams, because users feel battery drain and heat before they care that a larger checkpoint fits in RAM. The useful hook is MoE. The paper says MoE keeps 7B-class storage capacity while tracking a 1B to 2B energy profile, and it names Qwen2.5-3B as a practical balance point. That lines up with the last year of Gemini Nano and Apple Intelligence positioning: if the pitch is still “lower bits,” I’d ask for device power traces before buying the story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→TIP: Token Importance Evaluation in On-Policy Distillation

The TIP paper uses student entropy and teacher-student divergence to select OPD tokens; retaining 50% high-entropy tokens matches or exceeds full-token training and reduces peak memory by up to 47%.

#Fine-tuning#Inference-opt#Reasoning#Qwen

why featured

HKR-H/K/R pass: the summary gives 50% token retention, 47% peak-memory reduction, and an entropy plus disagreement mechanism. It is training-engineering heavy, but the cost claim is testable, so it lands in the 78 band.

editor take

TIP is a practical OPD paper: 50% token retention and up to 47% peak-memory reduction beats another tiny distillation benchmark bump.

sharp

TIP makes a clean call: OPD should cut token-level supervision, not just examples. The paper selects tokens using student entropy and teacher-student divergence; keeping 50% high-entropy tokens matches or beats full-token training, with up to 47% lower peak memory. The wild part is the low-entropy, high-divergence bucket: under 10% of tokens nearly matches full-token baselines. That is the student being confident and wrong. This feels more like training-budget routing than plain loss masking. The validation spans Qwen3, Llama, and Qwen2.5 on MATH-500, AIME 2024/2025, and DeepPlanning. My caveat is scale: the abstract names model families, but not the exact parameter sizes here, so the 47% memory win still needs a production-size distillation run before I’d price it into a training stack.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

LlamaWeb adds a WebGPU backend to llama.cpp for browser LLM inference, reducing memory use by 29–33% across tested device, browser, and OS combinations and increasing decode throughput by 45–69% on four GPUs from separate vendors.

#Inference-opt#LlamaWeb#llama.cpp#Research release

why featured

HKR-H/K/R all pass: WebGPU browser inference is a clear hook, and the post gives cross-device memory and throughput numbers. Technical depth keeps it below same-day product-release weight.

editor take

Browser LLMs are leaving demo land; LlamaWeb makes WebGPU-on-llama.cpp look like a product path, not a weekend hack.

sharp

LlamaWeb’s sharp claim is not “Llama runs in a browser.” It turns WebGPU into a llama.cpp backend and reports numbers across real hardware variety. The paper tests 16 devices, 8 vendors, 10 language models, and 4 weight formats. It cuts memory by 29–33% and raises decode throughput by 45–69% on four GPUs from separate vendors. If those numbers reproduce, browser inference moves from demo viability to product budgeting. I’d still be careful with the 45–69% figure. The excerpt names “existing browser-based LLM frameworks,” but does not list the baselines or model sizes here. The llama.cpp integration is the bigger hook. It lowers migration cost versus yet another JS inference stack. WebGPU has kept losing on fragmentation; LlamaWeb is aimed exactly at that failure mode.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

The paper tests off-the-shelf persona steering on two instruction-tuned models, where doubt- or scrutiny-oriented personas reduce sycophancy to about 68% and 98% of CAA's effect while preserving accuracy when the user is correct.

#Alignment#Interpretability#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: the paper offers a surprising steering shortcut, concrete 2-model results, and a safety-quality nerve. Evidence is still narrow, so it stays at 78, not P1.

editor take

Sycophancy looks less like a clean honesty direction and more like persona residue; this paper dents the CAA story hard.

sharp

The sharp part is that anti-sycophancy may not need bespoke labeled contrast pairs. Plain persona steering gets close enough to embarrass the cleaner CAA story. On two instruction-tuned models, doubt- or scrutiny-oriented personas reach about 68% and 98% of CAA’s effect, while preserving accuracy when the user is correct. CAA’s failure mode is obvious: it can turn “don’t flatter the user” into “push back too often.” The asymmetry matters. Agreeable personas do not mirror the effect by increasing sycophancy, and the persona vector is largely independent from the sycophancy direction in activation space. That makes the single-direction safety framing look too tidy. For deployed assistants, role tone is not cosmetic prompt dressing; it is a safety control surface.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Microsoft Introduces GenAI-Driven Threat Detection for Security Copilot

Microsoft introduces DTDA for Microsoft Security Copilot, reporting 80.1% precision in a 120-day online evaluation and novel alerts for about 15% of investigated incidents across tens of thousands of Defender customers.

#Agent#Reasoning#Tools#Microsoft

why featured

HKR-H/K/R all pass: the story has a production Security Copilot hook plus testable 120-day, 80.1%, and 15% figures. Microsoft-authored scope keeps it below major model/product-release weight.

editor take

Security Copilot finally has production-shaped numbers: 120 days, 80.1% precision, and $2.04 per investigation beats another AI-SOC demo.

sharp

Microsoft’s strongest claim is not GPT-5.4; it is running an agent inside Defender customers for 120 days. The paper reports 80.1% precision from customer feedback, novel alerts for about 15% of investigated incidents, 28 minutes median runtime, $2.04 median token cost, and 0.38% job-level failure. That is close to a SOC cost unit, not a slideware assistant. I still distrust the 80.1% until the feedback selection path is clear. Security alerting dies when false positives flood analysts, not when the model fails to write a nice explanation. The credible part is boring engineering: schema validation, grounding requirements, bounded retries, and fail-closed suppression. Those controls matter more than the GPT-5.4 nameplate.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas defines six control-decision states, nine trajectory-failure categories, and a coverage audit across 15 agent benchmarks. In an eight-model synthetic run with 1,342 generated items, removing the explicit label menu reduces trajectory accuracy by 14–40 percentage points, and no model leads on all three measured axes.

#Agent#Benchmarking#Tools#AgentAtlas

why featured

HKR-H/K/R all pass: this reframes agent evaluation around trajectory audits, with a concrete 14–40 pp drop. Single arXiv paper, so it lands in featured, not must-write.

editor take

AgentAtlas hits the agent-eval sore spot: models look composed with label menus, then drop 14–40 points when the menu disappears.

sharp

AgentAtlas lands because it measures the crutch, not another leaderboard crown. In an eight-model, 1,342-item synthetic run, removing the explicit label menu drops trajectory accuracy by 14–40 points and compresses models into a 0.54–0.62 band. That is a nastier result than a single success column: many agents are learning the answer interface, not the control policy. I like that the authors call it a measurement-protocol demonstration, not a benchmark release. WebArena, OSWorld, and SWE-bench have all suffered from overreading one outcome number. AgentAtlas at least forces six control states onto the table: Act, Ask, Refuse, Stop, Confirm, Recover. The weakness is also plain: synthetic items do not prove deployed browser or coding agents behave safely, and the 15-benchmark coverage audit does not replace live trajectory review.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Multi-agent Collaboration with State Management

STORM mediates multi-agent writes to a shared codebase and resolves conflicts at write time; it beats the git-worktree multi-agent baseline by 18.7 points on Commit0-Lite and 1.4 points on PaperBench.

#Agent#Code#Benchmarking#STORM

why featured

HKR-H/K/R pass: STORM gives a concrete mechanism for shared-code multi-agent conflict handling and reports +18.7 on Commit0-Lite and +1.4 on PaperBench. Single arXiv paper, so it stays below must-write.

editor take

STORM attacks the right failure mode: shared-state writes. +18.7 on Commit0-Lite is real signal; +1.4 on PaperBench says don’t crown it yet.

sharp

STORM makes the right bet: multi-agent coding fails less from missing agents than from unmanaged shared state. Instead of giving each agent a separate git worktree, it mediates writes into one codebase, detects conflicts at write time, and keeps views consistent. The +18.7 gain on Commit0-Lite, with a top score of 87.6, is too large to dismiss as benchmark noise. I don’t buy the “plug into any multi-agent system” claim yet. PaperBench improves by only +1.4, topping out at 78.2, which says harder research-style tasks bottleneck on decomposition, verification, and long-horizon planning, not only file conflicts. After a year of AutoGen- and CrewAI-style topology talk, STORM is a useful slap: fancy agent graphs are expensive theatre if the shared workspace has no state discipline.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Research proves conditional equivalence of DPO and RLHF with failure modes and provable alignment

The paper proves DPO and RLHF equivalence depends on one implicit assumption: the RLHF-optimal policy must prefer human-preferred responses; the authors introduce CPO, add constraints for provable alignment, and provide code, while the RSS snippet does not disclose benchmark names or quantitative results.

#Alignment#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv alignment paper without replication or adoption signal. Code and testable claims put it in the low featured band.

editor take

DPO has been treated as cheap RLHF for too long; this paper pins that bargain on one fragile assumption, and CPO’s SOTA claim needs receipts.

sharp

DPO’s cheap-RLHF story takes a real hit here: the equivalence only holds under one hidden condition, that the RLHF-optimal policy also prefers the human-preferred response. When that breaks, DPO optimizes advantage against the reference policy, not human preference directly; the ugly case is loss going down while the policy favors rejected answers. That lands because DPO became the default low-friction post-training move after teams got tired of PPO plumbing. The authors propose CPO, add constraints for provable alignment, and ship code in a 49-page paper. But the abstract only says “standard benchmarks” and “state-of-the-art”; it gives no benchmark names, model sizes, win rates, or compute cost. I buy the theoretical warning faster than the CPO victory lap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Instant GPU Efficiency Visibility at Fleet Scale

The paper introduces OFU, a GPU efficiency metric derived from Tensor Pipe Activity and SM clock frequency; across 608 production training jobs, it reaches r=0.78 correlation with MFU and detects a 2.5x efficiency regression.

#Benchmarking#Inference-opt#Research release#Benchmark

why featured

HKR-H/K/R all pass: the story has a fleet-scale GPU-efficiency hook, testable OFU details, two on-chip counters, 608 production jobs, and r=0.78. It is narrower than a model release, so it sits at 78.

editor take

OFU is more useful than another training-stack tweak: 608 production jobs and r=0.78 make GPU waste observable before the postmortem.

sharp

OFU matters because it moves utilization tracking out of framework instrumentation and into hardware counters. It uses only Tensor Pipe Activity and SM clock frequency, stays precision-agnostic across FP16, TF32, FP8, and NVFP4, and the paper claims MFU prediction within 2 percentage points after GEMM calibration on H100 and GB200. The 608-job production result is not magic: r=0.78 is good enough for fleet alarms, not for billing-grade truth. Still, it caught two framework FLOPs miscalculations and a 2.5x efficiency regression. That is the expensive failure mode in AI infra: not slow models, but GPU burn nobody sees until the run is already waste.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

VerifySteer selectively intervenes on hidden states at verification paragraph boundaries to control step-wise verifier strictness, and experiments on ProcessBench and Hard2Verify show performance above prompt optimization and activation-steering baselines while matching self-consistency with 4-7x less inference compute.

#Reasoning#Interpretability#Inference-opt#Research release

why featured

HKR-H/K/R all pass: the paper gives a concrete steering mechanism, ProcessBench/Hard2Verify tests, and a 4-7x compute claim. It stays below 85 because source authority and artifact details are not disclosed.

editor take

VerifySteer treats verifier strictness as a latent control knob, not a prompt trick; 4-7x less compute is attractive if the boundary signal transfers.

sharp

VerifySteer’s useful claim is that step-wise verifier calibration lives at a controllable location: hidden states near verification paragraph boundaries. That is a cleaner lever than another round of prompt tuning or self-consistency sampling. The paper’s concrete hook is strong: on ProcessBench and Hard2Verify, it beats prompt optimization and activation-steering baselines, then matches self-consistency with 4-7x less inference compute. I buy the direction, with one caveat. If this boundary signal only holds for a narrow verifier family, it is an inference-cost hack. If it transfers across verifier models and problem formats, it becomes a runtime control layer on top of verifier fine-tuning. The authors say it adds gains over fine-tuned verifiers, and the code is public, so this one should get stress-tested fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

FEAT replaces O(N^2) attention with a dual-axis encoder for structured data, combines AFBM with Conv-GLA for O(N) cross-tuple contextualization, and reports zero-shot gains on 12 real-world database benchmarks with up to 50x lower inference latency.

#Inference-opt#Benchmarking#FEAT#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv paper with mechanism and benchmark claims only; open-source status and independent replication are not disclosed, so it sits low in featured.

editor take

FEAT attacks the O(N²) bottleneck in table models; for enterprise databases, that beats another chat-model release.

sharp

FEAT’s sharp move is not the “foundation model” label. It moves cross-tuple modeling from O(N²) to O(N) while preserving permutation invariance, the constraint most table models quietly break. The paper reports zero-shot wins across 12 real-world database benchmarks and up to 50x lower inference latency, using a dual-axis encoder with AFBM plus Conv-GLA for cross-tuple contextualization. I still discount the branding. The body does not give training scale, parameter count, released weights, or production throughput on enterprise databases. The 50x number also depends on sequence length and hardware. But the direction is right: TabPFN- and CARTE-style structured-data models hit scaling pain before they hit product usefulness. FEAT at least moves the problem back into database-system constraints, not demo-board accuracy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

HandITL blends human corrective intent with autonomous execution for bimanual dexterous manipulation, cutting intervention jitter by 99.8%, reducing grasp failures by 87.5%, lowering mean completion time by 19.1%, and producing policies that average 19% better than standard teleoperation data across three long-horizon tasks.

#Robotics#Agent#Multimodal#HandITL

why featured

HKR-H/K/R all pass, with concrete robotics results, but this is still an arXiv VLA-policy paper. It is featured-level research, not a platform or model-launch story.

editor take

HandITL’s 99.8% jitter drop is the tell: dexterous-hand data quality breaks at takeover time, not in the demo count.

sharp

HandITL hits the ugly interface in dexterous robotics: the instant a human takes over. VLA policies accumulate small contact errors over long horizons, and standard teleoperation injects gesture jumps exactly when the correction should be clean. The reported numbers are unusually direct: 99.8% lower intervention jitter, 87.5% fewer grasp failures, and 19.1% lower completion time. I buy this direction more than another “more teleop data” paper. ALOHA-style work showed cheap bimanual data can move the field, but high-DoF hands are less forgiving; a discontinuous takeover poisons the correction distribution. The 19% average gain across three long-horizon tasks is the right hook. I’d still want the PDF details on sample count and hardware, but the failure mode is real.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

ZEBRA reduces multi-phase LLM-agent budget allocation to a continuous nonlinear knapsack problem; on 150 APPS coding tasks, at a 0.5 unconstrained-spend budget, it recovers 94.4% of unconstrained quality versus 88.1% for LLM-direct.

#Agent#Code#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the half-budget quality hook is concrete, the post gives a mechanism plus 150-task results, and orchestration cost resonates with practitioners. Single arXiv paper with no adoption or artifact disclosed keeps it below the top band.

editor take

ZEBRA turns agent cost control back into an optimization problem; 94.4% vs 88.1% is modest, but beats vibes-based LLM budgeting.

sharp

ZEBRA’s useful move is not the zero-shot label; it pulls multi-stage agent budgeting out of prompt judgment and into continuous nonlinear knapsack plus water-filling. On 150 APPS tasks, it recovers 94.4% of unconstrained quality at 0.5 spend, versus 88.1% for LLM-direct. The 14.3-point gain on a three-phase HotpotQA pipeline matters more than another coding-only result. I still have doubts. The controller estimates utility curves, and production agents face tool failures, queue latency, retries, and cache hits that rarely behave like clean curves. But the direction is right: agent cost control should not mean “think fewer steps” everywhere. Budget should move to the phase with the best marginal return.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner adds a standalone reasoning module’s logits to frozen LLM outputs at inference time, and the paper reports that this plug-and-play design outperforms existing fine-tuning methods in experiments on mathematical reasoning and machine translation.

#Reasoning#Fine-tuning#Inference-opt#Universal Reasoner

why featured

HKR-H/K/R all pass, but this is a single arXiv item with no benchmark numbers, code link, or independent replication in the provided text. The inference-time reasoning-module claim clears featured, not p1.

editor take

UniR’s sharp move is turning rewards into logits; if cross-backbone transfer holds, a lot of PEFT retraining starts looking wasteful.

sharp

UniR is a cleaner attack on PEFT than another adapter recipe. The paper trains a standalone module with verifiable rewards, then adds UniR logits to a frozen LLM’s logits at inference time. It also claims weak-to-strong transfer inside a model family and composition by summing multiple task modules. If the released code reproduces that across backbones, per-backbone PEFT starts to look like expensive repetition. I still discount the word “universal.” The abstract only says it beats existing fine-tuning methods on math reasoning and machine translation, with broader claims for VLMs and medical reasoning. The exact backbones, gains, and decoding overhead are in the PDF tables, not this arXiv page. Logit addition sounds elegant; in practice it lives or dies on tokenizer alignment and calibration drift.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

The paper models synthetic data markets with SDCE. PMIR converges to an epsilon-SDCE in O(epsilon^-2 log T) iterations. A C4-synthetic benchmark over ten retraining generations estimates a collapse-rate coefficient of 0.181, while calibrated experiments raise generation-ten model quality by 23.1%.

#Alignment#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is still an arXiv research paper without a major-lab release, artifact, or cross-source cluster. Concrete mechanisms and C4 numbers place it just above the featured threshold.

editor take

A 7-page paper claiming SDCE, Cramer-Rao, PMIR, and a 23.1% lift smells less like a theory breakthrough than formula fireworks.

sharp

The loud part is not “synthetic data markets”; it is the amount of closure packed into 7 IEEEtran pages. The paper claims SDCE existence and generic uniqueness, gives s*=KL(q||p)/(2κ), proves PMIR convergence at O(ε^-2 log T), then reports a C4-synthetic ten-generation collapse coefficient of 0.181, nearly matching the structural 0.183. I buy provenance pricing as a useful frame, but I don’t buy this much neatness on first pass. The Shumailov-style collapse literature keeps running into messy generation pipelines and filtering choices. A 23.1% generation-ten quality lift and Wasserstein drift dropping from 0.318 to 0.142 need released scripts and dataset construction details; otherwise the calibrated experiment may just validate its own assumptions.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Reinforcing Human Behavior Simulation via Verbal Feedback

The paper introduces DITTO, a model that uses verbal feedback and GRPO to jointly optimize initial and improved rollouts; its SOUL benchmark covers 10 tasks across six categories, where DITTO improves 36% on average over the base model and exceeds GPT-5.4 on 6 of 10 benchmarks.

#Agent#Reasoning#Benchmarking#DITTO

why featured

HKR-H/K/R all pass, but this is a single arXiv paper without top-lab backing or external replication. The 36% gain and 6/10 GPT-5.4 comparison clear featured, not same-day must-write.

editor take

DITTO makes verbal feedback trainable via GRPO; the 36% lift is sharp, but 6/10 over GPT-5.4 on SOUL is not general social intelligence yet.

sharp

DITTO’s useful move is not “human-like behavior”; it makes subjective verbal feedback trainable. After each rollout, the model receives verbal feedback, produces an improved rollout, then jointly optimizes both with GRPO. At test time, no feedback is required. That is a cleaner fit for role play, patient simulation, learner simulation, and user simulation than scalar preference labels. The numbers are strong: SOUL spans 10 tasks across six categories, DITTO improves 36% over the base model, and beats GPT-5.4 on 6 of 10 SOUL benchmarks. I would still discount the headline. SOUL is both a benchmark and a training data suite, and the snippet does not disclose train/eval separation, prompting, or sampling settings for GPT-5.4. Human-behavior simulation benchmarks turn into acting exams very easily.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Retrospective Sparse Attention for Efficient Long-Context Generation

RetroAttention retrospectively revises past attention outputs with newly arrived KV entries during decoding, maintaining a lightweight output cache and raising effective KV exposure by up to 1.6x and accuracy by up to 21.9% on long-generation benchmarks.

#Inference-opt#Reasoning#Code#RetroAttention

why featured

HKR-H/K/R all pass, but this is a single arXiv inference-optimization paper with no disclosed code, major-model integration, or production validation, so 72–77 is the safer band.

editor take

RetroAttention attacks the ugly part of long decoding: fixing bad sparse attention after the fact, not pretending the first KV selection was right.

sharp

RetroAttention is a useful admission that sparse attention during long decoding makes recoverable mistakes. The mechanism is the hook: keep a lightweight output cache, then use newly generated KV entries to revise earlier attention outputs. The paper claims up to 1.6x effective KV exposure and up to 21.9% accuracy gain on long-generation benchmarks. That is a better target than another input-context pruning trick, because code generation and multi-turn agents accumulate errors after the prompt. I have one hard doubt: the abstract says “minimal latency overhead” but gives no latency curve here. If retrospective correction dents decoding throughput, the 1.6x exposure number becomes a benchmark win that serving teams will treat very carefully.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

The paper introduces ACR to measure GRPO batches with ineffective gradients. AVSPO injects virtual reward samples without extra rollouts, reduces advantage collapse by 58-63% versus GRPO, and improves math reasoning accuracy by 4-6 percentage points across 0.5B to 14B parameter models.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv training-method paper for post-training teams. The concrete ACR/AVSPO mechanism and 58-63% / 4-6 point results put it just above the featured bar.

editor take

GRPO finally gets a failure gauge: ACR turns all-correct/all-wrong groups into a measurable waste signal, and AVSPO’s 4–6 point gain smells like a practical patch.

sharp

GRPO’s failure here is not sparse reward; it is homogeneous group reward killing the gradient. ACR measures those ineffective batches, and AVSPO patches them with virtual reward samples without extra model rollouts. Across 0.5B to 14B math-reasoning models, it cuts collapse by 58–63% versus GRPO and adds 4–6 accuracy points. I buy this more than another vague “RLVR improves reasoning” paper. After DeepSeek-R1, everyone knows verifiable rewards can move math scores; the expensive question is which training batches burn compute without learning. The open code helps. The caveat is narrowness: the abstract names math benchmarks plus one out-of-domain task. If ACR does not stay predictive on coding or tool-use trajectories, AVSPO remains a strong contest-math training patch, not a general post-training recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→LT2: Linear-Time Looped Transformers

The paper introduces LT2, replacing quadratic softmax attention in looped transformers with linear or sparse attention, and reports gains on controlled recall, state-tracking, and language modeling tasks; its converted Ouro-hybrid-1.4B model, trained on about 1B tokens, outperforms industry-level 1B models and competes with industry-level 4B models.

#Inference-opt#Memory#Reasoning#Ouro-hybrid-1.4B

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with mechanism and benchmark claims only, not independent replication or deployment. It clears featured, but stays below 78.

editor take

LT2 attacks the ugly part of looped transformers: attention cost. A 1.4B model trained on ~1B tokens rivaling industry 4B is a bold claim.

sharp

LT2’s sharp idea is that looping gives linear and sparse attention a second life. The paper names two concrete mechanisms: LT2-linear uses repeated passes for iterative memory refinement, while LT2-sparse expands its effective receptive field across loops. That is a cleaner architectural claim than another “replace softmax with linear attention” paper. I’m discounting the Ouro-hybrid-1.4B claim for now. The abstract says ~1B training tokens beat industry 1B models and compete with industry 4B models, but the snippet does not name the baselines, data mix, or token-budget matching. Efficient small-model papers have had this movie before with Mamba, RWKV, and RetNet: strong synthetic or controlled wins, weaker general LM adoption. If LT2 reproduces on long-context reasoning and real serving latency, looped transformers stop looking like a neat academic trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

RoPeSLR decomposes DiT attention into an O(L^3/2) sparse spike set and an O(d_h log L) low-rank background, cutting Wan2.1-1.3B FLOPs by up to 10x at 90% sparsity and speeding HunyuanVideo-13B 100K+ token inference by 2.26x with under 1.3% average VBench degradation.

#Multimodal#Vision#Inference-opt#Wan2.1

why featured

HKR-H/K/R pass: the 10x FLOP claim, 90% sparsity, and Wan2.1 test make it concrete. It stays below 78 because it is a specialist attention paper, not a model or product release.

editor take

RoPeSLR gives video DiTs a measurable 2.26x long-context speedup, but the 10x FLOPs claim is not product throughput.

sharp

RoPeSLR hits the right failure mode for video DiTs: vanilla linear attention breaks the 3D RoPE relative-position structure, so the paper splits attention into O(L^3/2) sparse spikes plus an O(d_h log L) low-rank background. That is a stronger claim than another generic sparse-attention recipe. The concrete numbers are useful: up to 10x fewer FLOPs on Wan2.1-1.3B at 90% sparsity, 2.26x end-to-end speedup on HunyuanVideo-13B at 100K+ tokens, with under 1.3% average VBench degradation. I would still discount the headline FLOPs number. Wall-clock only lands at 2.26x, which says kernels, memory traffic, and scheduling eat a lot. Video generation needs attention swaps that survive production inference, not prettier asymptotic math.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→The Illusion of Intervention: Your LLM-Simulated Experiment Is an Observational Study

The paper argues that LLM-simulated user experiments can induce user drift under interventions, where latent user attributes shift across treatment conditions, and proposes negative control outcomes to diagnose distribution shifts plus persona adjustments with setting-relevant confounders to reduce bias in survey-style and multi-turn agent evaluations.

#Agent#Benchmarking#Alignment#Research release

why featured

HKR-H/K/R all pass: the paper has a contrarian hook, concrete causal-diagnostic mechanisms, and clear resonance for synthetic-user evals. As a single arXiv methods paper with no adoption signal disclosed, it sits just above the featured threshold.

editor take

The scary part isn’t fake users; it’s the model silently swapping the population when treatment changes.

sharp

This paper hits the weak spot in LLM user simulation: the intervention can change the simulated population itself. The concrete hook is user drift: once the treatment condition changes, latent user attributes shift too, so treatment and control are no longer comparable. Their proposed diagnostic is also practical: use negative control outcomes, attributes that should stay invariant, to catch distribution shifts. I buy the direction because a lot of agent and survey evaluation work has been using synthetic users as a cheap substitute for human experiments. The abstract does not disclose drift size, model list, or benchmark numbers, so the empirical strength is still unclear. But framing this as selection bias, rather than a prompting flaw, is the useful move.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

The authors audited evaluation disclosure in 12 LLM agent benchmark papers: eight agent papers averaged 0.38 out of 1.0, four static benchmark papers averaged 0.66, and none of the eight agent papers disclosed inference cost in any form.

#Agent#Benchmarking#Inference-opt#Research release

why featured

HKR-H/K/R pass: the audit has an exposé angle and concrete benchmark-disclosure numbers. Its 12-paper pilot scope and single arXiv source keep it at the featured threshold, not a high-priority research event.

editor take

Agent evals don’t just have noisy scores; their papers often hide the run recipe. A 0.38/1.0 disclosure score is ugly.

sharp

Agent benchmarking’s core failure is not leaderboard inflation; it is missing run recipes. This audit covers 12 papers, and the eight agent benchmark papers average only 0.38/1.0 on disclosure, versus 0.66 for four static benchmarks. None of the eight reports inference cost in any form. None fully disclose a content-addressed container image for the evaluation environment. The paper is small: one auditor, one pass, and no claim that disclosure equals correctness. I still buy the diagnosis. Agent scores move with scaffolds, sampling settings, subsets, evaluator versions, and tool harnesses. After the SWE-bench arms race, every serious lab should treat the harness as part of the result. Reporting pass rate without the execution recipe is basically benchmark theater.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Compute Aligned Training: Optimizing for Test-Time Inference

The paper proposes Compute Aligned Training, aligning SFT and RL objectives with aggregated or filtered test-time strategies, and derives new loss functions for common inference strategies; the abstract states empirical gains in test-time scaling but does not disclose benchmark numbers.

#Reasoning#Inference-opt#Fine-tuning#Research release

why featured

HKR-H/K/R all pass, but the body only gives the mechanism summary; gains, model scale, and reproducible setup are not disclosed, so this stays at featured-threshold research rather than 78+.

editor take

Compute Aligned Training trains for the inference policy you actually run; useful direction, but the abstract gives no benchmark numbers, so don't crown it yet.

sharp

Compute Aligned Training hits a real mismatch in post-training: models train on single samples, then ship with aggregation, filtering, or best-of-N at inference. The paper frames test-time strategies as operators over the base policy, then derives SFT and RL losses for those operators. That is the right place to look if your product already spends tokens on search. The hard evidence is thin: arXiv:2604.24957 v2, two authors, SFT/RL coverage, and “common test-time strategies.” The abstract claims substantial test-time scaling gains, but gives no benchmarks, model sizes, N values, or compute budget. I’d file this beside verifier and reranker training, not as a proven new post-training recipe yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Self-Refining Video Sampling

The paper introduces self-refining video sampling, which treats a pre-trained video generator as a denoising autoencoder for iterative inference-time refinement without an external verifier or extra training, and reports over 70% human preference versus the default sampler and guidance-based sampler.

#Multimodal#Vision#Inference-opt#Research release

why featured

HKR-H and HKR-K pass: the method is concrete and reports >70% human preference. HKR-R is weak because model details, benchmark setup, and deployment cost are not disclosed, so this sits at the featured threshold.

editor take

Video models don’t just need better prompt following; they need samplers that stop wrecking physics. Self-refinement is cheap enough to matter.

sharp

Self-refining sampling is a cost story, not a generic “better video” story. The paper treats a pretrained video generator as a denoising autoencoder, then runs iterative inner-loop refinement at inference time. It uses no external verifier and no extra training. The authors report over 70% human preference against the default sampler and a guidance-based sampler. I care less about the preference score than whether this drops into existing video stacks. Sora, Veo, and Runway-style systems keep exposing the same failure mode: motion coherence and physics break before image quality does. Retraining is expensive, and post-processing often smears detail. The uncertainty-aware part is the useful hook: refine only low self-consistency regions. But the body gives no end-to-end latency or added sampling-step count. If that 70% preference costs 2–3x inference, product teams will cool fast.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

PrefixWall monitors cross-user Automatic Prefix Caching reuse and selectively isolates suspicious prefixes in multi-tenant LLM serving systems, reporting up to 70% higher cache reuse and 30% lower inference latency than defenses that isolate users and disable shared caching.

#Inference-opt#Safety#PrefixWall#Research release

why featured

HKR-H/K/R all pass, but this is an arXiv systems-security paper rather than a major model or platform release. The mechanism and latency/cache numbers clear the featured bar, not the 78+ band.

editor take

PrefixWall hits a real serving pain: APC latency gains are too expensive to throw away with blanket per-user cache isolation.

sharp

PrefixWall belongs in production-serving conversations because it prices security in latency, not slogans. APC leakage is concrete: an attacker observes cache hit and miss timing, then incrementally infers another tenant’s prefix. The proposed fix monitors cross-user APC reuse and isolates suspicious prefixes instead of killing shared caching. The hook is the reported gain: up to 70% higher cache reuse and 30% lower inference latency than defenses that isolate users. That is the right axis for vLLM-style and cloud multi-tenant serving, where prefix caching is already part of the margin math. I have one hard caveat: the abstract does not show false-positive rates, attack workload details, or curves under messy prompt distributions. In production, those decide whether PrefixWall is a clean guardrail or another brittle policy layer.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

The paper introduces a 3,034-sample human-curated benchmark for 3D spatial understanding and evaluates six VLMs; models reach 53–97% accuracy on visible-layout rearrangement planning, but drop to 6–45% on occlusion and below 7% on reflection tasks.

#Multimodal#Vision#Benchmarking#Qwen3-VL-8B-Thinking

why featured

HKR-H/K/R all pass: the paper has a sharp framing, a 3,034-sample benchmark across six VLMs, and a practical reliability angle. It stays in the featured-threshold band because it is a single arXiv benchmark, not a major lab release.

editor take

VLMs are not failing at object naming; token compression is erasing 3D signals. Some robot-demo confidence needs a haircut.

sharp

This paper pins VLM spatial failure on an architectural seam, not a vague lack of data. Across 3,034 human-curated samples, six models, and 18,204 human-scored responses, visible-layout rearrangement hits 53–97% accuracy, while occlusion falls to 6–45% and reflection stays below 7%. That gap says the models parse objects, then lose the geometry needed for stable 3D reasoning. The sharp result is the Qwen3-VL-8B-Thinking white-box analysis: spatial information remains recoverable in the vision encoder, then becomes inaccessible after the visual-token merger and only recovers when clean post-merger activations are patched into the decoder. Vendors keep selling higher-resolution multimodal input as perception progress; if the merger discards depth, occlusion, and reflection relations, embodied planning is still gambling with an object inventory.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

APEX uses a DAG-based strategy map, Fork Discovery, and Policy Selection to reduce exploration collapse in self-evolving LLM agents, and the paper reports that it outperforms all baselines on nine Jericho text-adventure games and WebArena.

#Agent#Reasoning#Memory#APEX

why featured

HKR-H/K/R all pass, but this is a single arXiv paper; code, model cost, and production replication details are not disclosed. It clears featured, not the 78+ band.

editor take

APEX nails the failure mode: more memory can make agents more conservative. But “beats all baselines” is thin without scores.

sharp

APEX is useful because it names the agent failure mode correctly: self-evolution does not only forget; it overfits its own memories. The DAG strategy map, milestone dependencies, Fork Discovery, and Policy Selection turn reflection memory into a search controller, not another diary bolted onto the prompt. The evidence is still too abstract. The abstract says APEX beats all baselines on nine Jericho games and WebArena, but gives no scores, baseline names, model backbone, or episode budget. Jericho rewards long-horizon exploration, and WebArena is closer to tool work, but both can move a lot with prompt and budget tuning. My read: this is one of the cleaner attacks on exploration collapse in agents, but it is not yet an engineering default without the full results table.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

The paper introduces Hack-Verifiable TextArena, which embeds detectable vulnerabilities inside environments and uses deterministic checks to measure whether language models exploit reward-hacking opportunities across settings.

#Agent#Alignment#Benchmarking#TextArena

why featured

HKR-H/K/R all pass: the paper offers a concrete reward-hacking eval mechanism, not just benchmark rhetoric. No hard-exclusion rule applies, but arXiv-only sourcing and limited deployment evidence keep it in the low featured band.

editor take

Embedding traps into the environment is cleaner than post-hoc trajectory forensics; without model-level results, this is a benchmark method, not a safety verdict.

sharp

Hack-Verifiable TextArena turns reward hacking from after-the-fact auditing into an instrumented test, and that is the useful move here. The concrete mechanism is simple: put detectable vulnerabilities inside the environment, then use deterministic checks to see whether the model exploits them. That is cleaner than asking humans to inspect trajectories and decide whether an agent “cheated.” It gives agent evals a second column beside task success: how dirty the win was. I buy the setup, but not any implied alignment win yet. The snippet names TextArena and says “diverse environments and settings,” but it does not disclose model lists, exploit types, trigger rates, or how they separate accidental exploration from strategic abuse. OpenAI and Anthropic agent evals keep running into this exact pressure: stronger agents find edge cases faster. This benchmark earns its keep if model releases start reporting hack rate next to success rate.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→Comparing Explanations Is Not Enough: Explain Behavioral Shifts in Large Language Models

arXiv 2602.02304v2 proposes Comparative XAI, or XAI_Delta, to explain behavioral shifts between two large-language-model checkpoints after interventions such as scaling, fine-tuning, RLHF, or in-context learning; the paper defines four requirements—comparability, validity, actionability, and monitoring—and frames the output as a transition report for governance and incident documentation.

#Interpretability#Safety#Fine-tuning#Research release

why featured

HKR-H/K/R all pass, but this is a safety/interpretability standards proposal, not a model launch or proven production replacement; it fits the 72–77 featured band.

editor take

Good framing: vendors explain snapshots, auditors need deltas. A checkpoint update without a transition report is governance theater.

sharp

Comparative XAI hits a real audit gap: model incidents often appear after fine-tuning, RLHF, scaling, or in-context changes, while most XAI explains one checkpoint in isolation. Framing XAI_Delta as a transition report between two model instances is practical. The four requirements—comparability, validity, actionability, monitoring—map better to change control than another saliency map or probe paper. I buy the problem framing; I don’t buy the “new paradigm” label yet. The abstract says “illustrative experiments,” but gives no model names, task suite, drift metric, or causal identification protocol. EU AI Act files and enterprise model cards already invite checkbox compliance. Without a reproducible delta test harness, XAI_Delta becomes a cleaner PDF for the same vendor story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

RAT+ uses dense pretraining with full-sequence recurrence to switch to dilated attention at inference time, and a 1.5B-parameter model trained on 100B tokens closely matches dense accuracy at D=16.

#Inference-opt#Reasoning#Memory#RAT+

why featured

HKR-H/K/R all pass, but this is still a technical inference-optimization paper whose impact depends on replication and integration. It sits in the 72–77 research-release band.

editor take

RAT+ makes sparse attention an inference-time switch, not a training-time bet; the catch is that 7.6B scale says little about frontier serving pain.

sharp

RAT+ lands because it turns dilated attention into a serving-time configuration, not a separate pretraining choice. The concrete hook is strong: a 1.5B model trained on 100B tokens stays near dense accuracy at D=16, drops 2–3 points at D=64 on commonsense reasoning and LongBench, and the 2.6B / 7.6B runs report a 64x cut in attention FLOPs and KV cache with about one point average loss. I like this more than another KV-cache-only compression paper, because it touches compute and memory with one knob. I’m still skeptical on deployment value. The “short” adaptation is still 1B tokens, and the abstract gives no online throughput, tail latency, or batching setup. Those decide whether this beats paged attention tricks in a real inference stack.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

LAION-C introduces six new distortion types to test OOD robustness in web-scale vision models, and the authors report that contemporary systems including Gemini and GPT-4o still face significant challenges on the benchmark.

#Vision#Multimodal#Benchmarking#LAION

why featured

HKR-H/K/R pass, but this is a single arXiv benchmark without cross-source traction, code impact, or a production-replacement claim; score stays at the featured threshold.

editor take

LAION-C punctures the ImageNet-C victory lap: web-scale vision models often look robust because the corruptions are already in the crawl.

sharp

LAION-C lands because it dodges the stale corruption set, instead of letting models farm blur and JPEG artifacts again. The paper introduces six distortion types designed to stay OOD even for LAION-scale crawls, then says Gemini and GPT-4o still struggle. That is a cleaner stress test than most multimodal leaderboards, because web-scale training has already swallowed much of the ImageNet-C world. I don’t buy the paper’s bigger claim that top models now match or beat the best human observers without seeing the psychophysics details. The abstract names the experiment, but not sample size, task framing, or statistics. Human baselines can be kneecapped by protocol. LAION-C looks useful as a contamination-resistant robustness probe, not as proof that vision robustness is solved.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

The arXiv:2502.12120v3 paper reports that pretraining data determines loss-to-loss scaling trends, while model size, optimization hyperparameters, tokenizer choice, and architectural differences between Llama-style transformers and Mamba-style state-space models have limited impact.

#Benchmarking#Reasoning#Llama#Mamba

why featured

HKR-H/K/R all pass, but the feed gives only abstract-level claims with no experiment scale or error numbers; score stays in the lower 72–77 research-release band.

editor take

This paper kicks architecture anxiety back to the data ledger: even Llama-vs-Mamba differences don’t steer loss-to-loss trends as much as pretraining data.

sharp

The sharp claim here is that loss-to-loss scaling has one dominant steering wheel: pretraining data. In arXiv:2502.12120v3, the authors say the trend is largely set by the dataset, while model size, optimization hyperparameters, tokenizer choice, and even Llama-style Transformer versus Mamba-style state-space architecture have limited effect. I buy the direction, but not the broad extrapolation yet. The snippet does not give dataset mixes, model-size range, training-token counts, or the downstream task suite. Those details decide whether “limited impact” means engineering noise or a cross-architecture rule. For model teams, the practical read is blunt: optimize architecture for throughput and cost; bet on pretraining data if you care about transfer.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→torchtune: PyTorch Native Post-Training Library

The paper introduces torchtune as a PyTorch-native LLM post-training library, covering fine-tuning, experimentation, and deployment-oriented workflows, and evaluates it against Axolotl and Unsloth on performance and memory efficiency across representative settings.

#Fine-tuning#Inference-opt#PyTorch#torchtune

why featured

HKR-K/R pass: torchtune frames fine-tuning, experimentation, and deployment as a PyTorch-native post-training stack against Axolotl and Unsloth. No concrete benchmark or memory numbers are disclosed, so this sits at the low featured band.

editor take

torchtune is PyTorch pulling post-training back into its own stack, not another fine-tuning wrapper; Axolotl and Unsloth lose some oxygen.

sharp

torchtune’s bet is not convenience; it is PyTorch reclaiming the default surface for LLM post-training. The paper is 14 pages and names model builders, training recipes, a distributed stack, and comparisons against Axolotl and Unsloth on speed and memory. The excerpt gives no concrete throughput, peak memory, GPU setup, or winning margin. That still matters for research teams. Axolotl owns recipe density, and Unsloth owns the low-VRAM quick-start lane. torchtune is pitching hackability, direct PyTorch access, and reproducible workflows. Honestly, if PyTorch makes FSDP, recipes, and deployment-oriented training feel native, third-party fine-tuning libraries have to offer more than YAML ergonomics and clever patches.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→GradPower: Powering Gradients for Faster Language Model Pre-Training

GradPower applies an elementwise sign-power transform, sign(g_i)|g_i|^p, before the base optimizer with a one-line code change; AdamPower reports lower terminal loss across LLaMA, Qwen2MoE, 66M to 2B parameters, C4, OpenWebText, and cosine or warmup-stable-decay schedules.

#Reasoning#Inference-opt#GradPower#LLaMA

why featured

HKR-H/K/R all pass, but this is a single arXiv optimizer paper whose impact depends on independent replication. The mechanism and test scope justify featured, not the 78+ band.

editor take

GradPower is the kind of one-line optimizer tweak that makes training teams sweat: if it replicates, many expensive recipes lose to one exponent.

sharp

GradPower’s sharp edge is not the ICML 2026 acceptance. It compresses a pretraining gain into one transform: sign(g_i)|g_i|^p. The authors report lower terminal loss for AdamPower across LLaMA, Qwen2MoE, 66M to 2B parameters, C4, OpenWebText, and cosine or warmup-stable-decay schedules, without changing Adam internals or hyperparameters. I’m always suspicious of “one-line” optimizer papers, because many die at serious scale. This one has two hooks that make it harder to dismiss: the largest gains show up on MoE with warmup-stable-decay, and the transform stacks with Muon. If 7B or 70B runs hold, GradPower belongs in default training-stack trials, not the arXiv tricks folder.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·21

→ZeroUnlearn: Few-Shot Knowledge Unlearning Method for Large Language Models

ZeroUnlearn reformulates machine unlearning as knowledge re-mapping, uses a closed-form multiplicative parameter update to map sensitive inputs to a neutral target state, and releases code on GitHub.

#Fine-tuning#Safety#ZeroUnlearn#XMUDeepLIT

why featured

HKR-H/K/R all pass, but the body gives only abstract-level facts: no metrics, model sizes, or eval setup. Open-source safety research clears featured, but stays in the 72–77 band.

editor take

ZeroUnlearn is a clean model-editing take on unlearning; without model scale, benchmarks, or audit details, I’m not buying it as a safety fix yet.

sharp

ZeroUnlearn’s useful move is compressing unlearning into a closed-form multiplicative parameter update, not another round of destructive fine-tuning. The abstract gives a concrete mechanism: map sensitive inputs to a neutral target state, enforce representational orthogonality, remove the original representations, and add a gradient-based variant for multi-sample unlearning. Code is also released. I’d discount the performance claim until the eval is visible. The snippet says it outperforms baselines while preserving utility, but gives no model sizes, datasets, attack setup, or retain/forget metrics. Unlearning has not failed because demos cannot suppress a target string; it fails on paraphrases, multi-hop recovery, and collateral damage to nearby knowledge. ROME/MEMIT-style editing already showed the same pattern: clean local surgery, messy behavior once triggers move off-distribution.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Matryoshka Concept Bottleneck Models

MCBM uses one nested concept hierarchy for multi-granularity inference without retraining separate models for concept budgets, reducing expected test-time intervention cost from linear order to O(log K) while guaranteeing monotonic performance improvement.

#Interpretability#Inference-opt#Research release

why featured

HKR-H and HKR-K pass via nested CBMs and the O(log K) intervention-cost claim; HKR-R is weak because impact centers on interpretability specialists. Single arXiv sourcing keeps it below featured.

editor take

MCBM claims O(log K) intervention cost. Experiments are undisclosed, so I’d treat it as a CBM deployment-cost paper.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

SAVER uses a Conformal Groundability Gate to decide whether MNER spans or MRE entity pairs should consult visual evidence, then calibrates activation thresholds on a held-out split with Clopper-Pearson upper bounds. Experiments report higher F1 than text-only and always-on multimodal baselines, while reducing FLOPs and P90 latency.

#Multimodal#Vision#Benchmarking#SAVER

why featured

HKR-H/K/R all pass: selective vision is a clean hook, with a concrete calibration mechanism and cost-latency angle. The MNER/MRE scope is niche and exact F1/FLOPs/P90 numbers are not disclosed, so it stays in the 60–71 band.

editor take

SAVER gates vision per span with CGG and reports F1/FLOPs/P90 wins; datasets and margins aren’t disclosed here, so trust the routing idea, not the victory lap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

CoPhy distills VLM knowledge into a BEV encoder and removes the VLM at inference, uses an auto-regressive BEV world model to predict future semantic maps conditioned on candidate actions, and optimizes the driving policy with GRPO using physical rewards from BEV rollouts and cognitive rewards from a language-aligned scorer.

#Robotics#Vision#Reasoning#CoPhy

why featured

HKR-K/R pass: CoPhy gives a VLM-to-BEV distillation path, VLM-free inference, a BEV world model, and dual-reward GRPO. No results, code, or road-test evidence are disclosed, so this stays in the 60–71 band.

editor take

CoPhy claims SOTA on NAVSIM v1/v2, but RSS gives no scores; verify the BEV-distilled, VLM-free inference path first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Bayesian Preference Learning for Test-Time Steerable Reward Models

ICRM models latent preference probabilities with a Bradley-Terry likelihood and a conjugate Beta prior, then steers reward models at test time using in-context preference demonstrations. The paper reports RM-Bench accuracy rising from 60.5 to 70.8 with more demonstrations, lower calibration error than a generative judge on moral dilemmas, broader Pareto frontiers under conflicting preferences, and stronger math reasoning rewards than a conventional reward model.

#Alignment#Reasoning#Benchmarking#Research release

why featured

HKR-K passes with a concrete mechanism and RM-Bench gain; HKR-R passes for alignment/eval relevance. As a single arXiv paper with a narrow technical title, it stays below the featured threshold.

editor take

ICRM lifts RM-Bench from 60.5 to 70.8; I buy test-time preference demos, but RSS omits model size and demo count.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Understanding and Improving Communication Performance in Multi-node LLM Inference

The paper introduces NVRAR, a hierarchical all-reduce algorithm using NVSHMEM, and reports 1.9–3.6x lower latency than NCCL for 128KB–2MB messages, plus up to 1.72x lower end-to-end batch latency for Llama 3.1 405B in multi-node decode-heavy tensor-parallel inference.

#Inference-opt#YALIS#NVRAR#NCCL

why featured

HKR-H/K/R pass: NVRAR vs NCCL and Llama 3.1 405B latency numbers are concrete. The topic is narrow distributed inference plumbing, so it stays below featured.

editor take

NVRAR cuts 128KB–2MB all-reduce latency 1.9–3.6x; for 405B decode, the ugly comms work is the bottleneck.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO matches dense Transformer performance under the same total parameter budget and training tokens, activates 20% of routed experts, and delivers a 2.93x inference speedup over dense inference on Jetson AGX Orin.

#Inference-opt#Tsinghua NLP#DECO#Jetson AGX Orin

why featured

HKR-K/R are strong: 20% expert activation and 2.93x Jetson AGX Orin speedup are concrete. The arXiv architecture angle is narrow for general AI pros, so it stays in the 60-71 band.

editor take

DECO activates 20% experts and runs 2.93x faster on Jetson AGX Orin; edge MoE finally tackles memory traffic head-on.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Lean Refactor uses a retrieval-augmented agentic framework to refactor Lean proofs, achieving over 70% token-level compression on competition benchmarks, over 20% on research repositories, and up to 60% compilation-time reduction while using version-filtered strategy retrieval for Lean/Mathlib compatibility.

#Agent#RAG#Code#Lean Refactor

why featured

HKR-K is strong and HKR-H comes from the concrete agentic proof-compression result; HKR-R is weak because Lean is niche. The practical numbers help, but the technical-accessibility drag keeps it in 60–71.

editor take

Lean Refactor cuts competition proofs by 70%+ tokens; I trust version-filtered retrieval more than the agentic-search wrapper.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

PlexRL multiplexes unified LLM services across RLVR jobs with centralized model placement, state transitions, and function-level scheduling under affinity constraints, reducing user GPU-hour cost by up to 37.58% while preserving algorithmic flexibility and adding minimal per-job overhead.

#Reasoning#Inference-opt#PlexRL#Research release

why featured

HKR-K/R pass: the 37.58% GPU-hour cost cut and cluster orchestration mechanism are concrete and relevant to RLVR compute budgets. HKR-H is weak, and a single arXiv abstract keeps it below featured.

editor take

PlexRL cuts RLVR GPU-hour cost up to 37.58% via cluster scheduling; I buy it, but cluster scale is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Efficient Numeracy in Language Models Through Single-Token Number Embeddings

The paper introduces BitTokens, which encodes any number as one token using its IEEE 754 floating-point representation, and reports that small language models learned basic arithmetic algorithms with near-perfect accuracy in experiments.

#Reasoning#Research release

why featured

HKR-H/K/R all have signal: single-token number embeddings are novel and tied to LLM numeracy pain. The post only gives basic arithmetic results, with no model size, error rate, code, or replication, so it stays in 60–71.

editor take

BitTokens packs any number into one token; near-perfect results cover basic arithmetic, not numeric reasoning broadly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Runtime-Certified Bounded-Error Quantized Attention

The paper proposes a tiered KV cache architecture that stores INT8 keys and INT4 values in GPU memory while retaining FP16 originals in system RAM, computing per-head, per-step error bounds and fallbacks on LLaMA 3.1-8B with contexts up to 128K.

#Inference-opt#Safety#Benchmarking#LLaMA

why featured

HKR-K/R pass with a concrete KV-cache design, bit widths, and 128K test setup. HKR-H is weak, and this is a single arXiv paper without code or production evidence, so it stays in 60–71.

editor take

INT8/INT4 KV gets per-step error bounds plus FP16 fallback; don’t sell this as speed, it sells recoverability.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

The paper proposes parallel chunk-level LLM reasoning with evidence-anchored consolidation, and experiments across multiple model types and sizes report about 84% lower omission error, up to 130% higher evidence traceability, and up to 91% fewer unsupported claims.

#Reasoning#RAG#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv methods paper with no named lab weight, artifact, or production replacement claim. Research-release signal fits 70 and tier all, below featured.

editor take

Parallel chunking cuts omissions 84%, but datasets and baselines aren’t disclosed here; don’t crown it a long-context fix yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Optimization Hyper-parameter Laws for Large Language Models

Opt-Laws predicts final LLM training loss from the LR schedule, model size, and data size; on held-out configurations, it achieves a 94% Top-2 hit rate for near-optimal schedule candidates and detects training divergence with F1=0.92.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K/R pass: the summary gives a testable mechanism and two metrics, tied to training-run failure cost. It stays all because this is a niche arXiv optimization paper with no code, author signal, or production validation disclosed.

editor take

Opt-Laws hits 94% Top-2 on held-out configs; I’d judge it by avoided full runs, not elegant loss prediction.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Multimodal LLMs under Pairwise Modalities

The paper proposes a two-stage framework for training MLLMs with pairwise modality data, using latent representation alignment and cross-modal recomposition; it evaluates the method by adding 3D point clouds and tactile modalities to pre-trained MLLMs with three modality pairs, while the RSS snippet does not disclose benchmark names or exact scores.

#Multimodal#Embedding#Research release

why featured

HKR-H and HKR-K pass: the paper offers a pairwise-modality training mechanism and 3 modality pairs. Without benchmarks, artifacts, or product impact, it stays in the lower research-release band.

editor take

It adds 3D point clouds and touch via 3 modality pairs; no benchmarks or scores disclosed, so treat it as a data-curation bet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DIVE: Embedding Compression via Self-Limiting Gradient Updates

DIVE compresses embeddings with a self-limiting hinge triplet loss and head-wise NT-Xent contrastive loss, and the 14M-parameter open-source adapter beats Matryoshka-Adaptor, Search-Adaptor, and SMEC across six BEIR datasets at every evaluated compression ratio.

#Embedding#RAG#Fine-tuning#DIVE

why featured

HKR-K has concrete mechanisms and BEIR comparisons; HKR-R hits RAG cost and latency. Still, this is a single arXiv compression method with benchmark wins, below the featured threshold.

editor take

DIVE uses a 14M adapter for embedding compression; it beats three baselines on six BEIR sets, but no absolute scores disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Research paper analyzes MXFP4 quantization error decomposition and recovery methods for LLM reinforcement learning

The paper decomposes MXFP4 quantization error into scale bias, deadzone truncation, and grid noise, then applies macro-block scaling, outlier fallback, and adaptive quantization noise on Qwen2.5-3B and Qwen3-30B-A3B-Base, recovering BF16 accuracy to within 0.7% and 3.0%, respectively.

#Reasoning#Fine-tuning#Inference-opt#Qwen

why featured

HKR-K is strong: the paper gives MXFP4 error mechanisms and Qwen experiment numbers. HKR-H/R are real for quantization and RL-tuning teams, but the low-level training focus keeps it in the 60–71 band.

editor take

MXFP4 lands within 0.7% of BF16 on Qwen2.5-3B; this error decomposition beats another mystery tuning recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR trains a 500M-parameter 3D perception model on 50 million driving examples. The model extends Sparse Window Transformer inputs to LiDAR, radar, cameras, and map priors, and reports a new state of the art on the Waymo Open Dataset challenge.

#Multimodal#Vision#Robotics#STELLAR

why featured

HKR-K/R pass on concrete scale, multimodal fusion, and Waymo benchmarking; HKR-H is weak. As a single AV perception paper rather than a product or foundation-model release, it stays in the 60–71 band.

editor take

STELLAR trains 500M parameters on 50M driving examples; autonomy perception is finally doing its scaling-law homework.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

AVIS uses autoregressive video diffusion models for streaming video restoration, reducing initial latency from 114 seconds to 4 seconds and raising throughput from 0.71 to 1.18 FPS versus leading non-autoregressive solvers.

#Vision#Inference-opt#AVIS#AVIS Flash

why featured

HKR-H/K pass on the concrete latency/FPS gains and autoregressive streaming mechanism. HKR-R is weak: this remains a niche arXiv video inverse-problem paper, so it stays below featured.

editor take

AVIS Flash hits 5.91 FPS on one RTX 4090; video inverse solvers are starting to look deployable, not just publishable.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

TelecomTS introduces an observability dataset derived from a 5G telecommunications network, preserving de-anonymized covariates and absolute scale information while covering anomaly detection, root cause analysis, and multi-modal question-answering tasks.

#Multimodal#Reasoning#Benchmarking#TelecomTS

why featured

Single arXiv dataset paper with concrete data shape and task setup, so HKR-K/R pass. The topic is narrow and lacks model or product impact, keeping it in the interesting-but-not-featured band.

editor take

TelecomTS keeps absolute 5G metric scale; normalized time-series benchmarks are a bad proxy for observability agents.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

WestWorld pretrains on 89 simulation and real-world environments, using Sys-MoE and structural embeddings to improve zero-shot and few-shot trajectory prediction across diverse robot morphologies.

#Robotics#Reasoning#WestWorld#Unitree Go1

why featured

HKR-K and HKR-R pass: 89 environments plus Sys-MoE give concrete research signal, and cross-embodiment generalization matters for robotics teams. Single arXiv source and a jargon-heavy title keep it below featured.

editor take

WestWorld pretrains on 89 environments; Sys-MoE plus structural embeddings is practical for cross-morphology robots, but gains aren't disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→MeMo: Memory as a Model

MeMo encodes new knowledge into a dedicated memory model while keeping LLM parameters unchanged, and the paper evaluates it on three benchmarks: BrowseComp-Plus, NarrativeQA, and MuSiQue.

#RAG#Memory#Tools#MeMo

why featured

HKR-H/K/R pass, but the post gives only the mechanism and 3 benchmark names; no metrics, code, or model scale are disclosed. Interesting research signal, below featured threshold.

editor take

MeMo reports 3 benchmarks and corpus-size-independent retrieval cost; I’m waiting on update cost and latency, both absent here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Informationally Compressive Anonymization for Privacy-Preserving Supervised Machine Learning

The paper introduces ICA and the VEIL architecture, which encode raw inputs inside a trusted Source Environment into low-dimensional, task-aligned latent vectors; the abstract says the method avoids noise budgets, gradient clipping, and encryption at inference time.

#Fine-tuning#Inference-opt#Safety#arXiv

why featured

HKR-K/R pass: the paper offers a concrete privacy mechanism and a non-degradation claim. As a single arXiv item with no disclosed metrics or artifact in the summary, it stays in the lower interesting band.

editor take

ICA compresses raw inputs into latent vectors; no benchmarks disclosed, so treat “zero reconstruction” as a theorem setup.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Towards Autonomous Mechanistic Reasoning in Virtual Cells

The paper introduces VCR-Agent, a multi-agent framework that uses mechanistic action graphs, biologically grounded retrieval, and verifier-based filtering to generate and validate virtual-cell explanations, and releases VC-TRACES from the Tahoe-100M atlas.

#Agent#RAG#Reasoning#VCR-Agent

why featured

HKR-H and HKR-K pass via the virtual-cell agent hook and concrete framework/dataset details. HKR-R is weak because the biology setting is niche and no product or general-agent impact is disclosed.

editor take

VCR-Agent derives VC-TRACES from Tahoe-100M; size is undisclosed, so the verifier’s hallucination filter is the bet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

BudgetMem structures runtime agent memory as modules with Low, Mid, and High budget tiers, then trains a compact reinforcement-learning router to choose tiers per query; across LoCoMo, LongMemEval, and HotpotQA, it beats strong baselines in the high-budget setting and improves accuracy-cost frontiers under tighter budgets.

#Agent#Memory#Reasoning#BudgetMem

why featured

HKR-K/R pass: agent-memory cost control is useful, and the post names the RL routing mechanism plus benchmarks. No accuracy/cost numbers or artifact are disclosed, so it stays in the 60-71 all band.

editor take

BudgetMem tests three memory-budget tiers on 3 benchmarks; I like the setup, but RSS gives no cost numbers.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Statistical Guarantees in the Search for Less Discriminatory Algorithms

The paper formalizes LDA search as an optimal stopping problem and proposes an adaptive stopping algorithm that gives a high-probability upper bound on disparate-impact gains from continued retraining.

#Safety#Benchmarking#arXiv#Black et al.

why featured

HKR-K is clear: optimal stopping plus high-probability bounds. HKR-R lands on fairness-audit cost, but the academic framing and narrow scope keep it below the 72 featured line.

editor take

Black et al. turn LDA search into a stopping rule; dataset sizes aren’t disclosed, but legal audit teams will want this certificate.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Diffusion Models Memorize in Training -- and Generalize in Inference

The paper analyzes diffusion models’ denoising objective and finds a validation-training generalization gap most pronounced at intermediate noise levels, while inference does not reproduce training samples because sampling trajectories move far from the noisy training-sample distribution used during training.

#Multimodal#Benchmarking#Interpretability#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv paper on training dynamics with no product, artifact, or cross-source debate. It fits the 60–71 band for useful but non-featured research.

editor take

Diffusion overfits hardest at intermediate noise; the wild part is model error blocks recall once sampling leaves training-noise support.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DelTA: Discriminative Token Credit Assignment for Verifiable Reward Reinforcement Learning

DelTA reweights a self-normalized RLVR surrogate with discriminative token coefficients, and on seven math benchmarks it improves over the strongest same-scale baselines by 3.26 points on Qwen3-8B-Base and 2.62 points on Qwen3-14B-Base.

#Reasoning#Fine-tuning#Alignment#Qwen

why featured

HKR-K is strong and HKR-R is moderate: concrete RLVR mechanism and Qwen3 math gains, but it is still an arXiv training paper with no product impact or cross-source cluster.

editor take

DelTA adds 3.26 points on Qwen3-8B across 7 math benchmarks; I like that it attacks RLVR’s formatting-token noise directly.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Code Generation by Differential Test Time Scaling

DiffCodeGen selects code-generation candidates with coverage-guided differential analysis, without public tests or extra LLM calls for selection. The paper evaluates it across 4 large language models and reports consistent gains over baselines, with competitive or better performance than state-of-the-art test-time scaling methods while using fewer time and token resources.

#Code#Inference-opt#Agent#DiffCodeGen

why featured

HKR-H/K/R pass, but the body gives only the mechanism and a 4-model evaluation, not gains, datasets, or artifacts. A single arXiv codegen method fits the 60–71 band.

editor take

DiffCodeGen selects candidates across 4 LLMs without extra LLM calls; code TTS needs execution traces, not more sampler spam.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

Cormac Cureton and Narges Armanfard propose TabPFN-MT for multi-target tabular in-context learning, evaluating it on 344 datasets with fewer than 1,000 samples on average and reducing inference for T tasks from O(T) to O(1) forward passes.

#Reasoning#Inference-opt#Cormac Cureton#Narges Armanfard

why featured

HKR-H and HKR-K pass: TabPFN-MT gives a 344-dataset setup and an O(T)-to-O(1) multitask inference claim. The tabular small-sample focus narrows HKR-R, keeping it in the 60–71 research-signal band.

editor take

TabPFN-MT cuts T-task inference to O(1). For small tabular data, PFNs still look cleaner than general LLMs.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Research paper introduces Spectral Souping framework for online preference alignment

The paper introduces Spectral Souping, which learns an offline basis of specialized policies and merges outputs or parameters at inference time, adapting LLMs to individual preferences without costly online retraining against tailored preference rewards.

#Alignment#Fine-tuning#Inference-opt#Research release

why featured

HKR-H/K/R pass, but the post gives only the mechanism summary; authors, benchmark numbers, scale, and code are not disclosed. This is useful alignment research, not a same-day industry story.

editor take

Spectral Souping uses a two-phase offline-basis, inference-merge setup for preference alignment. No gains disclosed; “universal spectral representation” needs proof beyond soup demos.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

The paper evaluates unlearned language models on TOFU multiple-choice QA and finds that models retain low calibration error around ECE 0.04 after unlearning, while forget-split accuracy drops and attribution with Integrated Gradients and Local Mutual Information shows greater reliance on correlation-based tokens.

#Alignment#Interpretability#Benchmarking#arXiv

why featured

HKR-H/K/R pass on a concrete evaluation paradox, ECE number, and safety-eval relevance. Single arXiv paper, narrow scope, no artifact or broad discussion, so it stays in all.

editor take

Unlearned models keep ECE≈0.04 while losing TOFU forget accuracy; calibration as unlearning reliability is a bad proxy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

The paper proposes two training-free Table QA prompting frameworks, TGN and PIP, and evaluates 17 LLMs against 6 baselines on TableBench and FeTaQa; TGN scores 3.8 points above the strongest TableBench baseline, while PIP reports SOTA over ReAct and Chain-of-Thought on FeTaQa.

#Reasoning#Tools#Fine-tuning#arXiv

why featured

HKR-K and HKR-R pass: the paper gives training-free mechanisms, a 17-model evaluation, and a +3.8-point gain. HKR-H fails because the angle is dry, so this stays in the 60–71 research-signal band.

editor take

TGN gains 3.8 on TableBench; training-free is not cheap until token cost and table-size limits are disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

The paper benchmarks four local LLMs for EHR GraphRAG on one 8 GB VRAM consumer GPU; Llama 3.1 builds the richest graph with 1,172 entities, Qwen 2.5 scores highest on answer quality at 3.3/5, and 3.8B Phi-4-mini fails the pipeline because of structured-output errors.

#RAG#Benchmarking#Reasoning#Microsoft

why featured

HKR-K and HKR-R are clear: 8GB VRAM, four local models, and structured-output failure are testable details. The healthcare EHR niche limits reach, so it stays in the 60–71 band.

editor take

Four local models ran 8GB EHR GraphRAG; Qwen 2.5 tops out at 3.3/5. Offline compliance, not cheap reliability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→roto 2.0: The Robot Tactile Olympiad

roto 2.0 introduces a GPU-parallelized tactile RL benchmark across four robotic morphologies with 16–24 DOF; its blind agents use only proprioception and tactile sensing, without state information or distillation, and achieve 13 Baoding ball rotations in 10 seconds.

#Robotics#Benchmarking#roto#Research release

why featured

This arXiv robotics benchmark clears HKR-H/K with concrete mechanisms and numbers. It lacks HKR-R beyond a narrow robotics RL crowd and has no product or platform impact, so it stays in the 60-71 band.

editor take

roto 2.0 spans four 16–24 DOF hands and hits 13 rotations in 10s; tactile RL finally gets a usable arena.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Winfree Oscillatory Neural Network

The paper proposes WONN, a neural architecture using generalized Winfree dynamics to evolve representations on a torus, and evaluates it on CIFAR, ImageNet, Maze-hard, and Sudoku, with Maze-hard reaching 80.1% accuracy using 1% of prior state-of-the-art parameters.

#Reasoning#Vision#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: 1% parameters and 80.1% on Maze-hard create a real hook, with Winfree torus dynamics and multiple benchmarks disclosed. A single niche arXiv architecture paper stays below featured.

editor take

WONN hits 80.1% on Maze-hard with 1% parameters; ImageNet details aren’t disclosed, so I’d file it under strong inductive bias.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

The paper introduces CSA, a deployment-side wrapper for RLVR-trained local LLMs, and reports pathwise validity plus non-refusing deployment across 480 specialist streams, 160 adversarial shift streams, and 10,300 online LoRA rounds.

#Safety#Fine-tuning#Alignment#Research release

why featured

HKR-K/R pass: CSA plus three concrete test scales, tied to RLVR deployment risk. HKR-H is weak, and the conformal-risk framing is specialist, so this stays in all.

editor take

CSA stayed non-refusing across 480 specialist streams, 160 shift streams, and 10,300 LoRA rounds; regulated local LLMs need wrappers like this.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Frontier simulates modern LLM inference serving with disaggregated execution and stateful workloads, achieving below 4% average throughput error on a 16-H800 GPU testbed and reducing end-to-end latency error from 44.9% to 6.4% under co-location.

#Inference-opt#Agent#Reasoning#Frontier

why featured

HKR-H/K/R all pass, but this is an arXiv inference-simulation paper for infra readers, with no major-lab release or adoption signal, so it stays in 60–71.

editor take

Frontier gets under 4% throughput error on 16 H800s; inference simulation is finally catching up to PDD, AFD, and agent workloads.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Pseudo-Formalization for Automatic Proof Verification

The paper proposes Pseudo-Formalization and Block Verification, decomposing natural-language proofs into modules with premises, conclusions, and proofs, then evaluating PF+BV on 2 olympiad and research-level math benchmarks where it outperforms LLM-as-judge baselines on error-finding precision and recall.

#Reasoning#Benchmarking#ArxivMathGradingBench#Research release

why featured

HKR-K is clear: a new verification mechanism plus 2 benchmark comparisons. HKR-R is present around evaluation reliability, but the arXiv-only summary lacks effect sizes, dataset details, and reproducibility conditions, so it stays in all.

editor take

PF+BV beats LLM-as-judge on 2 math-verification benchmarks; I buy weak formalization before forced Lean translation.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder models temporal distances between frame pairs from robot demonstrations and human videos, supplying step-wise proxy rewards that reached near-perfect success on 9 of 10 Meta-World tasks with 200,000 environment interactions per task.

#Robotics#Vision#Fine-tuning#TimeRewarder

why featured

HKR-H and HKR-K pass: passive-video reward learning is a clear hook, with 10 tasks, 200k interactions, and 9 near-full-success results. As a single robotics paper with limited product immediacy, it stays in the 60–71 band.

editor take

TimeRewarder nearly solved 9/10 Meta-World tasks at 200k interactions each; I don’t buy real-robot generalization from this benchmark yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench abstracts real planning scenarios into more than 30 task types, subtasks, constraint families, and difficulty factors, then uses a constraint-driven synthesis pipeline to generate verifiable data for LLM evaluation and reinforcement-learning training.

#Reasoning#Benchmarking#Fine-tuning#PlanningBench

why featured

HKR-K and HKR-R pass: the paper offers a concrete verifiable planning-data pipeline. It stays in the 60–71 band because it is a single arXiv paper with no disclosed model gains or adoption.

editor take

PlanningBench spans 30+ planning factors; I buy the verifiable synthesis angle, but model roster and gains are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

The paper introduces a PPO fine-tuning framework for code-generating LLMs, using execution-aware rewards for syntax, correctness, style, security, and simulator executability; it reports a 19% absolute pass@1 gain on MBPP and a 51% reduction in execution failures on RoboEval.

#Code#Fine-tuning#Robotics#Research release

why featured

HKR-K has concrete benchmark deltas, and HKR-R maps to code generation and robotics reliability. But this is a single arXiv paper with an academic title and no disclosed artifact or major-lab signal, so it stays in all.

editor take

PPO lifts MBPP pass@1 by 19% and cuts RoboEval failures 51%; I want the post-toy-benchmark survival rate.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Praxium detects cloud microservice anomalies with over 0.97 macro-F1 across 75 trials and four synthetic anomaly types, then uses causal impact analysis over recent software installations to infer the root cause under increasingly short package-install intervals.

#Agent#Reasoning#Praxium#PraxiPaaS

why featured

HKR-K is strong on metrics and attribution mechanism; HKR-R hits cloud incident triage. HKR-H is weak, and synthetic anomalies keep it in the 60–71 all band.

editor take

Praxium hits >0.97 macro-F1 across 75 synthetic trials; the SRE sell is causal install attribution under compressed rollout intervals.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

arXiv:2605.20189 proposes SOLAR, an autonomous agent for test-time adaptation using parameter-level meta-learning, multi-level reinforcement learning, and a knowledge base of valid modification strategies; the abstract says experiments cover six reasoning categories—commonsense, math, medical, coding, social, and logical—but does not disclose scores.

#Agent#Reasoning#Memory#SOLAR

why featured

HKR-H and HKR-K pass: the lifelong-agent hook is clear, and the summary gives three mechanisms plus six task categories. No scores, code, or production-replacement evidence are disclosed, so it stays in the 60–71 band.

editor take

SOLAR spans 6 reasoning categories, but scores are undisclosed; treating weights as an RL environment is clever, lifelong learning is unproven.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

The paper tests LLaMA-3.1 8B across 8-, 4-, 3-, and 2-bit quantization on 82 interview transcripts, and proposes multi-pass prompt verification to reduce hallucinations and unstable qualitative-analysis outputs under low-bit settings.

#Inference-opt#Alignment#LLaMA#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete setup and a verification mechanism, and it speaks to low-cost deployment reliability. The use case is narrow and HKR-H fails, so it stays in the 60–71 band.

editor take

LLaMA-3.1 8B ran on 82 transcripts; 8-bit holds up, 4-bit needs verification, 2/3-bit is risky for coding.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

ClaimDiff-RL uses reference-conditioned atomic visual claim differences as the reward unit for caption RL, separating hallucinated claims from omitted salient facts; on a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, it improves the hallucination–missing-fact balance and surpasses Gemini-3-Pro-Preview on several fine-grained capability dimensions.

#Vision#Multimodal#Fine-tuning#ClaimDiff-RL

why featured

HKR-K/R pass: the paper offers a concrete reward mechanism and a 160-image diagnostic set for VLM hallucinations. As a single arXiv paper with limited scale, it stays in the 60–71 band.

editor take

ClaimDiff-RL rewards atomic visual claims; 160 diagnostic images is thin, but splitting hallucination from omission beats scalar caption scores.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Chronicle trains a 324M-parameter decoder-only Transformer from scratch for text and time series, uses one shared backbone, and reports evaluation on 19 NLU tasks, 24 UCR/UEA datasets, and Time-MMD multimodal forecasting.

#Multimodal#Benchmarking#Paul Quinlan#Gemma

why featured

HKR-H and HKR-K pass: a 324M decoder-only backbone spans text and time series with concrete benchmark settings. It remains a single arXiv research prototype without product impact or major-lab pull, so it stays in the 60–71 band.

editor take

Chronicle runs text and time series through one 324M backbone; I buy the setup, not the implicit scratch-training victory lap.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

AGPO uses group-level statistics to control clipping and decoding temperature, and Qwen2.5-14B trained with AGPO beats PPO and GRPO on nine English and Chinese math/STEM benchmarks under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH; gains also transfer to Llama-3-8B and Gemma-2-9B.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-K is solid: AGPO gives a testable mechanism and Qwen2.5-14B math results. HKR-R is narrow to reasoning fine-tuners, and this is a single arXiv paper, so it stays in the 60–71 band.

editor take

AGPO beats PPO/GRPO on 9 math/STEM benchmarks; I buy the mechanism, not broad claims from 67.3% GSM8K.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

CROWDio runs partitioned ONNX inference for a 67M-parameter DistilBERT across five Android handsets, holding peak per-device RSS at 43±2 MB and reducing streaming-concurrency batch latency by 34% versus barrier synchronization.

#Inference-opt#CROWDio#DistilBERT#Android

why featured

HKR-K is strong and HKR-H has a concrete Android-crowd hook, but the item is a narrow systems-optimization arXiv paper with limited practitioner reach, so it stays in all.

editor take

CROWDio runs 67M DistilBERT on five Androids at 43±2MB RSS; neat, but the comms bill is still underexplained.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Spectral Structural Distortion Reveals Redundant Neurons in Neural Networks

The paper proposes a spectral structural importance score that compares neuron-level graphs before and after each layer transformation to identify redundant units; pruning recomputes scores after each structural change, performs no intermediate parameter updates, and applies one recovery fine-tuning stage after reaching the target reduction.

#Inference-opt#Interpretability#Fine-tuning#Research release

why featured

HKR-K and HKR-R pass via a concrete pruning mechanism and cost angle. HKR-H is weak, and the article lacks compression ratios, accuracy loss, or benchmark details, so it stays in all.

editor take

This scores pruning via graph-spectral distortion, but reports no compression ratios or baselines here; for now, it's an interpretable-pruning candidate.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

CAdam reframes 3DGS densification as signal verification and reduces Gaussian counts by 85%-97% across SDS, ISM, and VFDS objectives while preserving comparable perceptual quality in optimization-based generative distillation.

#Vision#Inference-opt#Research release

why featured

HKR-K is strong via the 85%-97% Gaussian reduction and a clear densification mechanism; HKR-H comes from the efficiency contrast. The SDS/ISM/VFDS context is narrow, so it stays in all rather than featured.

editor take

CAdam cuts Gaussian counts 85%-97% under SDS, ISM, VFDS; the SNR gate is the sane part—stop densifying noise.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

HeadQ corrects KV-cache quantization error with a low-rank residual side code in a learned query basis; across six models, K-only WikiText-103 decode experiments with dense values removed about 84%–94% of excess perplexity on the strongest 2-bit rows.

#Inference-opt#Benchmarking#HeadQ#Pythia

why featured

HKR-K is strong and HKR-R is limited: the paper gives a concrete correction mechanism and 84%-94% reductions, but HKR-H is weak and there is no product release or broad sourcing. This stays in the 60-71 all band.

editor take

HeadQ removes 84–94% excess perplexity in six-model 2-bit K-only decode; KV quantization needs logits, not MSE worship.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

Sangwoo Park and eight coauthors propose SELFCI, a complementary self-distillation framework that uses two independent reverse KL objectives over feedback-derived teacher distributions to separate task-relevant information preservation from minimal disclosure; the 28-page paper includes 16 figures, but the abstract does not disclose exact improvement numbers over GRPO or other baselines.

#Alignment#Safety#Agent#Sangwoo Park

why featured

HKR-K/R pass: SELFCI adds a two-teacher reverse-KL self-distillation setup for retention vs disclosure. HKR-H is weak, and the excerpt gives no gains or reproducible result, so it stays in all.

editor take

SELFCI splits privacy and utility with two reverse-KL losses; no gains disclosed, so the GRPO-beating claim stays soft.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

The paper benchmarks 13 Python-accessible JPEG decode paths on five matched 16-vCPU Google Cloud CPUs, using the 50,000-image ImageNet validation split to compare single-thread throughput with PyTorch DataLoader throughput at 0, 2, 4, and 8 workers.

#Benchmarking#Tools#PyTorch#TensorFlow

why featured

HKR-H/K/R pass, but this is an ML data-pipeline benchmark with impact mainly for vision-training engineers. No model release, product capability, or industry-level event, so it stays in the 60–71 band.

editor take

13 JPEG paths across five 16-vCPU CPUs show single-thread decode charts mislead PyTorch DataLoader choices.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Multi-step likelihood-ratio correction for reinforcement learning with verifiable rewards

The paper proposes NFPO, which augments PPO for RLVR with the cumulative likelihood ratio over the next N-1 tokens, and reports consistent gains on reasoning benchmarks while the snippet does not disclose benchmark names or exact scores.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K is clear: NFPO adds a concrete likelihood-ratio correction to PPO. HKR-R applies for RLVR stability, but no gain size, model scale, or reproduction detail is disclosed, so this stays all.

editor take

NFPO adds next-N-1-token likelihood ratios to PPO; scores aren’t disclosed, so RLVR is back to bias-variance bookkeeping.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

DODOCO instruments five MoE checkpoints across five sequence-mixer designs and an EP scan from 4 to 32 H100 ranks. The study finds EP scaling changes each architecture’s per-expert max/mean token ratio by at most 5%, while mock tokens overestimate routing Gini by up to 2.35× and create a batch-size trend that disappears with real text.

#Inference-opt#Benchmarking#DeepSeek#Qwen

why featured

HKR-K/R pass: it gives test scale and Gini-bias numbers, and MoE serving cost matters to infra teams. HKR-H is weak; EP dispatch diagnostics are narrow, so this stays in the 60-71 all band.

editor take

DODOCO tests 5 MoEs on 4–32 H100 EP ranks; mock tokens inflate routing Gini 2.35×, so many AlltoAll papers rest on sand.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

The paper introduces TaperNorm, which tapers RMSNorm or LayerNorm into sample-independent linear or affine maps, and reports up to 1.18× higher throughput after folding in a KV-cached autoregressive decoding benchmark.

#Inference-opt#Research release

why featured

HKR-K is clear: TaperNorm tapers RMSNorm/LayerNorm into a sample-independent mapping and reports 1.18x throughput. HKR-R is cost-relevant, but HKR-H is weak and the feed only gives abstract-level detail, so it stays in 60–71.

editor take

TaperNorm reports 1.18× decoding throughput; I trust foldable inference knives more than another architecture slogan.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→rePIRL: Learn PRM with Inverse RL for LLM Reasoning

rePIRL trains process reward models with a dual learning loop that alternately updates the policy and PRM, and the arXiv abstract says it outperforms existing methods on standardized math and coding reasoning datasets.

#Reasoning#Alignment#Fine-tuning#arXiv

why featured

HKR-K and HKR-R pass: the paper gives a concrete inverse-RL PRM training mechanism tied to reasoning reliability. No gains, model scale, or reproducibility details are disclosed, so it stays in the 60–71 research band.

editor take

rePIRL alternates policy and PRM updates; no scores in the snippet, so treat the generalization claim as unverified.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization

OCTOPUS compresses transformer KV cache with joint quantization of rotated coordinate triplets; across text, video, and audio, it matches or beats prior rotation codecs at every reported bit width and metric, and a fused Triton path reconstructs keys online without materializing uncompressed keys.

#Inference-opt#Multimodal#OCTOPUS#TurboQuant

why featured

HKR-K/R pass: KV-cache quantization is practical for serving cost and memory. HKR-H fails because the angle is a dense arXiv method, and the snippet lacks speedup, memory numbers, code status, or adoption, so this stays in all.

editor take

OCTOPUS beats TurboQuant at every reported bit width; KV-cache compression is now fighting over geometry, not just kernels.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs

TimeSRL uses a two-stage LLM pipeline to convert passive-sensing time series into natural-language abstractions before predicting mental-health outcomes, and under a leave-one-dataset-out protocol it reduces anxiety MAE by 3.1–44.1% versus non-LLM and LLM baselines.

#Reasoning#Fine-tuning#Benchmarking#TimeSRL

why featured

HKR-H and HKR-K pass: the cross-modal framing is fresh, and LOSO plus MAE reductions are concrete. It remains a vertical arXiv paper with no artifact or deployment, so it stays in the 60–71 band.

editor take

TimeSRL cuts anxiety MAE 3.1–44.1% under LOSO; I buy the semantic bottleneck, but mental-health cohorts leak easily.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO uses normal images as visual domain context to segment defect regions, injects domain knowledge through SFT, and guides reasoning with GRPO rewards; the paper reports higher MMAD benchmark performance than Qwen2.5-VL-7B and GPT-4o, while the RSS abstract does not disclose exact scores.

#Multimodal#Vision#Reasoning#JUDO

why featured

HKR-H/K pass: JUDO uses normal images as visual context plus SFT and GRPO, claiming MMAD gains over Qwen2.5-VL-7B and GPT-4o. Single arXiv paper and niche inspection scope keep it in all.

editor take

JUDO beats GPT-4o on MMAD, exact scores undisclosed; in industrial QA, normal-image context still trumps generic vision muscle.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Mechanisms of Misgeneralization in Physical Sequence Modeling

The paper defines physical misgeneralization: generated trajectories look plausible individually, but their aggregate physical-quantity distribution is wrong, and it uses a data deviation kernel to predict mass shifts across synthetic, maze-navigation, and double-pendulum tasks.

#Robotics#Benchmarking#Research release

why featured

HKR-K passes via a named failure mode and prediction mechanism; HKR-H passes on the plausible-trajectory/wrong-distribution hook. The arXiv paper is niche research, not a product, safety incident, or broad tooling release, so it stays in 60–71.

editor take

Physical misgeneralization names a nasty failure: valid-looking trajectories, shifted energy distributions. For robotics, that beats another planner score.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA introduces 56,953 image-grounded 3D brain MRI QA pairs from 12,977 subjects across 12 datasets, and the best zero-shot vision-language model reaches 47.5% accuracy on closed-format public test items, below the 49.4% text-only majority-template floor.

#Vision#Multimodal#Benchmarking#NeuroQA

why featured

HKR-H/K pass: the dataset scale and 47.5% zero-shot result are concrete. Scope is narrow medical 3D MRI benchmarking, with no product or major-model release, so it stays in the 60–71 research-signal band.

editor take

NeuroQA has 56,953 3D MRI QAs; best zero-shot hits 47.5%, below the 49.4% text-only majority floor.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Compute Only Once: UG-Separation for Efficient Large Recommendation Models

ByteDance presents UG-Sep for TokenMixer-based recommendation models, reusing user-side computation through separated user and item flows, then adding W8A16 weight-only quantization; online A/B tests across Douyin Feed, Hongguo Feed, Chuanshanjia Ads, and Qianchuan Ads report up to 20% lower inference latency without adverse business-metric changes.

#Inference-opt#ByteDance#Douyin#TokenMixer

why featured

HKR-K/R pass via a concrete mechanism and online A/B latency number. HKR-H is weak because UG-Separation for TokenMixer is vertical infra research, with no product or open-source hook for a broader AI audience.

editor take

UG-Sep cuts online A/B latency up to 20%; TokenMixer recommenders finally get reusable user-side compute across ads and feeds.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

CompilerKV compiles KV-retention correction tables offline from a calibration corpus and reaches compressed SOTA on four backbones under a 512-token budget, beating the strongest prefill-only baseline by 1.67 points on average with a 95% CI of [+1.08,+2.37].

#Inference-opt#CompilerKV#LongBench#SnapKV

why featured

HKR-K/R pass: 512-token budget, four backbones, and +1.67 avg over the strongest prefill-only baseline. HKR-H fails on a narrow arXiv title; no deployment or open-source hook, so it stays in 60–71.

editor take

CompilerKV beats the best prefill-only baseline by 1.67 at 512 tokens; 0.4–0.8 cross-model loss makes online SnapKV-style estimation look shaky.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→How Many Human Survey Respondents Is a Large Language Model Worth? An Uncertainty Quantification Perspective

The paper proposes a framework that converts LLM-simulated survey responses into confidence sets for human population parameters and adaptively selects the simulation sample size; the abstract does not disclose specific model names, dataset counts, or coverage numbers.

#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the title has a sharp hook and the paper offers confidence sets plus adaptive sample sizing. Missing models, datasets, coverage rates, and respondent-equivalence numbers keep it in all.

editor take

This frames LLM survey simulation as coverage control; no model names or rates disclosed, so stop treating 10k synthetic answers as sample size.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

The paper proposes ProxyCoT, a training framework that generates chain-of-thought traces on proxy contexts via reinforcement learning or teacher distillation, then grounds them in full long contexts with supervised fine-tuning; the abstract says it outperforms strong baselines across datasets with lower computational overhead.

#Reasoning#Fine-tuning#Research release

why featured

HKR-K is clear via the ProxyCoT training mechanism, and HKR-R hits long-context cost concerns. The post does not disclose scores, dataset names, cost reduction, or code, so it stays in the 60–71 research-release band.

editor take

ProxyCoT trains CoT on proxy contexts, then SFTs full contexts; 10M-token windows still fail at retrieval-conditioned reasoning.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

The paper introduces PolyNeXt, replacing ReLU, GELU, and softmax in MLPs, convolutions, and attention with Hadamard-product polynomial modules, and reports matching or exceeding activation-based MetaFormer models on ImageNet classification, ADE20K segmentation, and out-of-distribution robustness.

#Vision#Benchmarking#MetaFormer#PolyNeXt

why featured

HKR-H/K pass: PolyNeXt has a counterintuitive activation-free vision design and tests on ImageNet, ADE20K, and OOD robustness. HKR-R is weak; no deployment, cost, open-weight, or flagship-model impact is disclosed.

editor take

PolyNeXt swaps ReLU, GELU, and softmax for Hadamard products; I buy the direction, but scores are undisclosed here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH searches hybrid attention architectures on Qwen2.5-3B-Instruct using 12.3 million tokens per run and finishes in about 20 minutes on a single RTX Pro 6000 GPU.

#Inference-opt#Reasoning#Benchmarking#Qwen

why featured

HKR-H and HKR-K pass: the title has a one-GPU minutes-level search hook, and the post gives hardware/token conditions. Still, architecture search is specialist research, below featured threshold.

editor take

DASH searches Qwen2.5-3B with 12.3M tokens in 20 minutes; Jet-Nemotron’s 200B-token search bar just got embarrassing.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Mitigating Label Bias with Interpretable Rubric Embeddings

The paper proposes rubric embeddings to replace black-box embeddings with expert-defined criteria, evaluates them on a new dataset of applications to a large master's program, and reports reduced group disparities plus improved cohort quality measures under biased-label conditions.

#Embedding#Interpretability#Alignment#Research release

why featured

HKR-K and HKR-R pass: the paper offers a concrete mechanism and admissions-data test, with fairness relevance. Single arXiv source and no disclosed effect sizes keep it in the 60–71 band.

editor take

Rubric embeddings reduce disparities on master's admissions data; sample size is undisclosed, so interpretability is no bias waiver.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

DEL trains numerical prediction with digit conditional probability and binary cross-entropy, and reports higher overall accuracy and numerical-distance results across seven mathematical reasoning benchmarks and four LLM families: CodeLlama, Mistral, DeepSeek, and Qwen-2.5.

#Reasoning#Code#Fine-tuning#CodeLlama

why featured

HKR-K/R pass: the mechanism and evaluation setup are concrete, and LLM numeracy is a real practitioner pain. This is still a single arXiv method paper with no major model release, product impact, or cross-source cluster, so it stays in 60–71.

editor take

DEL wins on 7 math benchmarks and 4 model families; I want stress tests on long decimals and unit conversions.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Bryce Hinkley and Peyman Najafirad introduce Residual Paving, a routed residual editing method that cuts edit-prompt refusal on the Gemma-3-4B-IT held-out split from 88.6% to 4.0%, while harmful keep-side refusal remains below the frozen baseline at 65.3% versus 81.6%.

#Alignment#Safety#Interpretability#Bryce Hinkley

why featured

HKR-K and HKR-R pass: the paper gives testable refusal metrics and a concrete safety trade-off. Single arXiv paper, high jargon, and no product impact keep it in the 60–71 band.

editor take

Residual Paving cuts Gemma edit refusal to 4.0%, but harmful refusal drops to 65.3%; the router fix still bleeds safety.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TRAM: Test-Time Risk Adaptation with Mixture of Agents

TRAM reuses a fixed library of risk-neutral policies at test time, scores each source policy by target reward and occupancy-based deployment risk, and reduces deployment risk without parameter updates in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment setting.

#Agent#Alignment#Safety#TRAM

why featured

HKR-K/R pass: the mechanism and test settings are concrete, and the safety angle matters for agent deployment. HKR-H is weak; no major-lab or discussion signal, so it stays in the 60–71 research-release band.

editor take

TRAM mixes fixed policies with zero test-time updates; I buy the engineering, but source-hull mismatch is the deployment bill.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC uses a hypernetwork to generate task-specific visuomotor policy parameters from instructions, outperforming entangled baselines on LIBERO-90, Meta-World, and a real-world benchmark with identical visual contexts; the authors say it also surpasses pretrained π0 without external pretraining data, and the code is available on GitHub.

#Robotics#Vision#Fine-tuning#DISC

why featured

HKR-K passes: DISC gives a concrete instruction-to-policy-parameters mechanism, reports wins on LIBERO-90, Meta-World, and a real same-vision benchmark, and releases code. No quantified gains or broad product impact, so it stays in 60–71.

editor take

DISC compiles instructions into full policy weights; wins on LIBERO-90 and Meta-World, but the π0 claim needs replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

AVSD trains Qwen3-8B and Qwen3-4B with multi-view privileged self-distillation on AIME24, AIME25, and HMMT25, separating cross-view consensus from view-specific residuals and improving Avg@8 over the strongest baselines by 3.1% and 2.2%, while Qwen3-8B gains 2.4% on Codeforces and LiveCodeBench v6.

#Reasoning#Code#Fine-tuning#Qwen

why featured

HKR-K is clear: multi-view self-distillation reports 3.1%/2.2% gains on AIME24/AIME25/HMMT25. HKR-R is present for small-model training costs, but HKR-H is weak and the story stays in the 60–71 research band.

editor take

AVSD adds 3.1% Avg@8 on Qwen3-8B; gating privileged-view residuals is a cleaner bet than single-teacher distillation.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Effective Model Pruning: Measure the Redundancy of Model Components

The paper proposes Effective Model Pruning, which computes Neff from importance-score distributions via the inverse Simpson index and removes the N-Neff lowest-scoring components; experiments cover MLPs, CNNs, Transformers, LLMs, KAN, and criteria including weight magnitude, attention score, and image pixels.

#Inference-opt#Benchmarking#Research release

why featured

HKR-K is clear: EMP gives a reproducible pruning rule across MLP, CNN, Transformer, LLM, and KAN. HKR-R comes from cost compression; HKR-H is weak, so a single arXiv method paper stays in 60–71.

editor take

EMP sets pruning count via inverse Simpson index; it spans 5 architectures, but LLM size and compression ratios are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

The paper proposes TSRL for deepfake detection, modeling training as an MDP where a PPO Tutor assigns each sample loss a continuous 0-1 weight using visual features, EMA loss, and forgetting counts.

#Vision#Agent#Safety#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete training mechanism and touches deepfake safety. No metrics, code detail, or production-replacement claim are disclosed, so it stays in the 60-71 research band.

editor take

TSRL uses PPO to assign 0–1 loss weights; without cross-dataset metrics, this smells like curriculum overfitting.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→CASCADE Conformal Prediction for Two-Stage Clinical Decision Support

CASCADE propagates epistemic uncertainty from a screening classifier into a downstream dose-change regressor, using Venn-Abers multi-probabilistic uncertainty to scale conformal intervals and producing 38.9% narrower intervals than standard conformal baselines for confident Parkinson's Disease patients.

#Reasoning#Safety#CASCADE#Research release

why featured

HKR-K is strong via the uncertainty-transfer mechanism and 38.9% interval reduction; HKR-R is limited to safety-minded practitioners. The clinical conformal-prediction niche lacks product or platform impact, so this stays in all.

editor take

CASCADE narrows confident PD dose intervals by 38.9%; I buy the mechanism if coverage isn't hidden behind averages.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

The paper argues that language pretraining gives time-series training a reusable manifold; a linear probe decodes realistic trajectories from frozen LLM states without paired supervision, while projected-space retrieval yields competitive forecasts and finetuning behaves as low-dimensional alignment.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the cross-modal transfer angle is novel, and the frozen-state linear-probe claim is testable. Impact stays paper-level, with no product, code, or benchmark traction, so it sits in 60-71.

editor take

The paper claims frozen LLM states linearly decode time series. Models and benchmarks are undisclosed, so treat it as mechanism, not capability.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Dynamic Shapley Computation

The paper introduces D-Shap, which represents Shapley data valuation as a player-by-task matrix, updates new task valuations in milliseconds, and reduces new-player update cost by up to three orders of magnitude while matching full recomputation quality across tested models.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-K is solid: D-Shap has a concrete matrix mechanism plus millisecond updates and up to 1000x cost reduction. HKR-H and HKR-R are weak; no hard-exclusion trigger, so it fits the 60–71 research-signal band.

editor take

D-Shap makes Shapley updates millisecond-level via a player-by-task matrix; the bet lives or dies on locality holding in real data.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Consistently Informative Soft-Label Temperature for Knowledge Distillation

The paper proposes CIST, assigning separate sample-wise adaptive temperatures to teacher and student models, reweighting the distillation objective by teacher confidence and student learning difficulty, and reporting consistent gains over standard KD and strong baselines on vision and language distillation tasks with negligible computational overhead.

#Fine-tuning#Inference-opt#arXiv#Research release

why featured

HKR-K passes on a concrete distillation mechanism, and HKR-R passes on deployment-cost relevance. No results, model scale, or artifact are disclosed, so this stays in the 60–71 arXiv-method band.

editor take

CIST gives teacher and student separate sample-wise temperatures; gains are undisclosed, but fixed-temperature KD deserves this cut.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→ECUAS_n: A family of metrics for evaluating uncertainty-augmented systems

The paper proposes the ECUAS_n metric family for UA systems that output predictions and uncertainty scores, using proper scoring rules and a parameter n that controls the trade-off between incorrect prediction costs and imperfect uncertainty costs under application-specific decision settings.

#Benchmarking#Safety#Research release#Benchmark

why featured

HKR-K passes: ECUAS_n gives a concrete metric mechanism for uncertainty-augmented systems. HKR-H and HKR-R are weak, and the feed only gives abstract-level detail with no deployment or tooling impact.

editor take

ECUAS_n scores predictions and uncertainty with proper scoring rules; I buy the direction, but choosing n is the trap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

The paper proposes a Learning-to-Defer framework that routes extractive QA queries to specialized experts, provides theoretical guarantees for optimal deferral, and evaluates reliability and computational cost on SQuADv1, SQuADv2, and TriviaQA; the abstract does not disclose exact overhead-reduction percentages or model counts.

#Reasoning#Inference-opt#Research release#Benchmark

why featured

HKR-K passes for a concrete allocation mechanism and benchmark setup; HKR-R passes on cost/reliability for query routing. HKR-H is weak, and the extractive-QA research scope keeps it in the 60-71 band.

editor take

Learning-to-Defer tests 3 QA sets, but gives no overhead cut; I’d worry about tail-query routing outside benchmarks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→The Economics of AI Inference: Inflation Dynamics, Welfare Costs, and Optimal Monetary Policy under the Inference-Cost Phillips Curve

The paper introduces an Inference-Cost Phillips Curve that adds AI inference marginal costs to a New Keynesian Phillips curve, then estimates an empirical slope of 0.087 using U.S. monthly data from 2022M01 to 2026M04.

#Inference-opt#Research release

why featured

HKR-H and HKR-K pass: it links inference cost to the Phillips Curve and reports a 0.087 slope from 2022M01-2026M04 US data. HKR-R is weak because macro policy modeling sits far from product and engineering practice.

editor take

Inference cost enters a Phillips curve with slope 0.087; the macro leap is bold, but identification has to survive first.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment

The paper proposes CLAIR for federated LoRA fine-tuning, using structured low-rank plus block-sparse decomposition to recover the shared LoRA subspace and detect contaminated clients under heterogeneous client conditions.

#Fine-tuning#Alignment#Research release

why featured

HKR-K and HKR-R pass: the mechanism is concrete and tied to private fine-tuning risk. HKR-H fails, and this is a single arXiv paper without production replacement or large-scale deployment evidence.

editor take

CLAIR detects contaminated clients in federated LoRA; the experiment is only text-copying, far from real instruction tuning.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

The paper introduces PRISM, a data selection method that weights target examples using the current model’s preferences, builds a preference-aware target representation, and scores candidate training samples by alignment; the abstract says experiments across model families and scales improved efficient fine-tuning and safety-oriented SFT repair, but it does not disclose datasets, model names, or exact gains.

#Fine-tuning#Alignment#Safety#PRISM

why featured

HKR-K and HKR-R pass: PRISM offers a testable data-selection mechanism tied to fine-tuning efficiency and preference alignment. HKR-H is weak, and this is a single arXiv method paper without code or production evidence.

editor take

PRISM weights targets by current-model preference; datasets and gains are undisclosed, so I’d treat it as a testable SFT data-selection trick.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→PULSE achieves state-of-the-art results on non-stationary time series forecasting

PULSE uses a Disentangle-Evolve-Simulate framework for non-stationary time series forecasting, combines phase-anchored disentanglement, a Phase Router, and Statistic-Aware Mixup, and reports state-of-the-art or competitive results with a simple MLP backbone across 12 real-world benchmarks.

#Reasoning#Benchmarking#PULSE#Research release

why featured

HKR-K passes with a concrete framework, 12 benchmarks, and open code. HKR-H and HKR-R are weak because this is a specialized forecasting paper without a product or industry-conflict hook, so it fits the 60–71 band.

editor take

PULSE hits near-SOTA on 12 benchmarks; I buy small MLP plus phase bias over another Transformer flex.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→When AI Gets It Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

An arXiv paper evaluates AI-assisted medication decision systems using controlled simulated scenarios covering drug interactions and dosage decisions; the post does not disclose the number of scenarios, model names, or quantitative failure rates.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-H and HKR-R pass on high-stakes medication safety, but HKR-K is weak: no model names, sample size, or result numbers are disclosed. This stays in the interesting research band.

editor take

This paper tests AI medication systems, but scenario count and model names are undisclosed; useful failure taxonomy, weak benchmark.

HKR breakdown

hook ✓knowledge —resonance ✓

→ open source

SCORE

H1·K0·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

CALIPER tests whether post-drift data is sufficient for retraining using single-pass weighted local regression, and across four domains, three learner families, and two detectors it matches or exceeds the best fixed retraining window with low per-update time and memory.

#Benchmarking#CALIPER#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the retraining trigger is concrete, with a named method and test matrix. HKR-R is weak because this is niche concept-drift research, so it stays in the 60–71 “interesting” band.

editor take

CALIPER gates retraining data with one-pass local regression; across 4 domains and 3 learner families, it beats fixed windows.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

REPVLM quantifies epistemic uncertainty with negative log-density on the hyperspherical manifold of VLM embeddings, and the abstract says it achieves near-perfect correlation with prediction error, but the post does not disclose the correlation coefficient or evaluation setup.

#Vision#Multimodal#Benchmarking#REPVLM

why featured

HKR-K/R pass: the mechanism is clear and relevant to VLM reliability. HKR-H is weak, with abstract-level detail only; correlation coefficient, datasets, and reproduction details are not disclosed.

editor take

REPVLM uses hyperspherical negative log-density for uncertainty; “near-perfect correlation” lacks coefficients, so I don’t buy it yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

The paper shows across 11 conditions and 0.82M to 85M-parameter models that weight decay separates memorization, developmental grokking, and collapse, with a memorization-to-development boundary at λc=0.0158.

#Interpretability#Benchmarking#Research release

why featured

HKR-H/K pass: the paper offers a concrete diagnostic angle and testable numbers. HKR-R is weak, and the training-dynamics focus keeps it in all below featured.

editor take

Across 11 conditions, λc=0.0158 is useful; don’t launder modular-arithmetic grokking into language-model claims.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

The paper evaluates multiple demographic fairness metrics in face recognition and introduces the Fairness Disagreement Index to measure cross-metric inconsistency; the abstract says disagreements remain high across thresholds and model configurations, while the RSS snippet does not disclose dataset names or exact numeric results.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-H/K/R all pass, but this is a single arXiv fairness-evaluation paper. It offers a metric and experiment result, not a production replacement or major model update, so it stays in the 60–71 band.

editor take

The paper adds FDI for fairness-metric disagreement, but gives no datasets or numbers; single-metric fairness claims look weak.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Quant.npu: Enabling Efficient Mobile NPU Inference for On-Device LLMs via Fully Static Quantization

Quant.npu adapts on-device LLM inference to mobile NPU constraints with integer-only fully static quantization, using learnable quantization parameters, rotation matrices, a two-stage quantization pipeline, and sensitivity-guided mixed precision; experiments on real-world mobile NPUs report accuracy comparable to state-of-the-art PTQ methods and up to 15.1% lower inference latency.

#Inference-opt#Quant.npu#Research release

why featured

HKR-K is solid with a concrete mechanism and 15.1% latency figure; HKR-R applies to on-device deployment pain. HKR-H is weak, and NPU quantization is niche, so it stays in 60–71.

editor take

Quant.npu cuts real mobile NPU latency by 15.1%; I care if it survives long context, but the abstract omits that.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Can Conversational XAI Improve User Performance? An Experimental Study

The researchers tested conversational XAI against Q&A-based assistance with 42 participants; both treatment groups significantly outperformed the model, but the preliminary results showed no performance difference between assistance types and only modest engagement.

#Interpretability#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives a 42-person experiment and a testable “no difference between aid types” result for XAI design. HKR-H is weak, and the small sample keeps it in the mid all band.

editor take

With 42 participants, conversational XAI failed to beat Q&A help; don’t sell a chat wrapper as performance gain yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

The authors co-trained one self-driving car and 12 pedestrians with MAPPO; in 500 evaluation episodes, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35% goals and 33% collisions for the best rule-based baseline.

#Agent#Robotics#Safety#Prakash Aryan

why featured

HKR-K/R pass: the paper gives test settings and baseline numbers, and AV safety has practitioner pull. It remains an arXiv research item with no product or code impact disclosed, so it stays in all.

editor take

MAPPO trains 1 car and 12 pedestrians, yet 500 runs still hit 14% collisions; I’d call this a stress-test generator, not safety.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

The paper proposes CAML, which treats each active-learning round as a meta-learning task, uses the current labeled set for adaptation and the newly queried batch for generalization evaluation, and reports minority-group accuracy gains of up to 27.8% on Dominoes, 29.9% on Waterbirds, 14.3% on SpuCo, and 24.0% on CivilComments.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-K is strong via a named method and four gain figures; HKR-R lands for robustness practitioners. HKR-H is weak, and this remains an academic arXiv paper, so it sits in the interesting-not-featured band.

editor take

CAML turns active-learning rounds into meta-learning tasks and reports up to 29.9% minority accuracy gain; I buy the mechanism, not the missing cost details.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax implements a fully vectorized Riichi Mahjong environment in JAX and reaches 2 million steps per second under no-red rules and 1 million steps per second under red rules on eight NVIDIA A100 GPUs.

#Agent#Robotics#Benchmarking#Mahjax

why featured

HKR-H comes from the Mahjong+GPU+JAX angle, and HKR-K has concrete 8xA100 throughput numbers. HKR-R is weak because it lacks product impact or broad developer-tool relevance, so it stays in the 60-71 band.

editor take

Mahjax hits 2M steps/sec on 8 A100s; Riichi RL needs tougher self-play evaluation more than another fast env.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

The paper proposes TextReg, a regularization framework for text-space prompt optimization, and reports out-of-distribution accuracy gains up to 11.8% over TextGrad and 16.5% over REVOLVE across multiple reasoning benchmarks.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives a regularized text-space optimization method plus two gain figures, and prompt overfitting is practitioner-relevant. HKR-H is weak, and a single arXiv paper stays in the interesting band.

editor take

TextReg beats TextGrad by 11.8% OOD on reasoning benchmarks. Prompt optimization needed this anti-bloat regularizer badly.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image v2 proposes a unified MLLM plus MMDiT architecture for visual understanding, text-to-image generation, and instruction-guided editing, with training signals for long-text rendering, spatial grounding, and general and spatial edits; the abstract says it reaches state-of-the-art or highly competitive results across multiple benchmarks, but does not disclose exact scores.

#Multimodal#Vision#Reasoning#JoyAI-Image

why featured

HKR-H/K pass: the unified multimodal setup and MLLM+MMDiT mechanism add some signal. HKR-R fails because the post gives no scores, artifact, or major-lab context, so this stays in the normal research-release band.

editor take

JoyAI-Image v2 couples MLLM with MMDiT, but scores are undisclosed; treat the SOTA claim as unverified.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Reviving Error Correction in Modern Deep Time-Series Forecasting

The paper proposes UEC-STD, an architecture-agnostic error corrector that plugs into existing time-series forecasters without retraining and tests it across 4 backbones and 10 datasets.

#Inference-opt#arXiv#Research release#Open source

why featured

HKR-K passes via a concrete mechanism and evaluation scale. HKR-H and HKR-R are weak; with no major lab, product tie-in, or cross-source discussion, this sits in the all research stream.

editor take

UEC-STD plugs into 4 backbones and 10 datasets without retraining; I buy the angle—fixing inference drift beats swapping models.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Towards the Anonymization of Language Modeling

The paper proposes anonymization methods for BERT-style MLM and GPT-style CLM specialization, evaluates them on one medical dataset against baselines, and targets memorization of direct and indirect identifiers; the RSS snippet does not disclose concrete privacy or utility metrics.

#Fine-tuning#Safety#Research release

why featured

This is a privacy/safety research item with HKR-K/R: it covers anonymization training for BERT-style MLM and GPT-style CLM on medical-data memorization. HKR-H is weak, and metrics are not disclosed, so it stays in all.

editor take

The paper tests one medical dataset but discloses no metrics; without attack success rates, I don't buy the privacy claim.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Research Paper Explores Military Object Detection in Multi-Spectrum Drone Imagery

The paper builds four KIIT-MiTA-derived datasets—Gray Scale, Thermal Vision, Night Vision, and Obscura Vision—and trains YOLOv11-small to detect military objects in drone imagery under low-visibility, heat-based, and nighttime conditions.

#Vision#KIIT-MiTA#YOLOv11-small#Research release

why featured

HKR-H/K/R all pass via the drone-defense hook and concrete dataset/model setup, but this is a single arXiv vision paper with no disclosed metric leap, artifact, or product impact, so it stays in all.

editor take

The paper trains YOLOv11-small on 4 KIIT-MiTA variants; mAP is undisclosed, so don’t buy the military-detection claim yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Conditioning Gaussian Processes on Almost Anything

The paper recasts Gaussian Processes as a class of linear diffusion models, recovers standard GP conditioning exactly in the linear-Gaussian setting, and supports conditioning statements with point-wise likelihood evaluation, including nonlinear physics and natural language via large language models.

#Reasoning#Research release

why featured

HKR-H/K pass: the title has a curiosity hook and the summary gives the GP↔linear-diffusion mechanism. HKR-R misses; this is a niche cs.LG theory paper, so it stays in 60–71.

editor take

They cast GP conditioning as a diffusion ODE; exact for linear-Gaussian, but LLM-based language likelihoods deserve skepticism.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→FAIR-Pruner: A Flexible Framework for Automatic Layer-Wise Pruning via Tolerance of Difference

FAIR-Pruner uses Tolerance of Difference to assign non-uniform layer-wise pruning depths from two within-layer rankings, and evaluates accuracy–compression trade-offs on CIFAR-10, ImageNet, five vision architectures, and prune-only routed-expert Qwen1.5-MoE-A2.7B-Chat experiments.

#Vision#Inference-opt#Qwen#Research release

why featured

HKR-K and HKR-R pass: it offers a named pruning mechanism and Qwen/MoE experiments tied to inference cost. HKR-H is weak, and a single arXiv compression paper fits the 60–71 band.

editor take

FAIR-Pruner allocates per-layer pruning via ToD; Qwen1.5-MoE-A2.7B is prune-only, so don't infer LLM serving wins yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions

The paper compares multipass and stochastic label sampling on CIFAR-10H, finding hard-label delivery outperforms soft-label training when only a small number of annotations per example is available, while both hard-label methods match soft-label training when full annotator distributions are available.

#Fine-tuning#Benchmarking#CIFAR-10H#SVHN

why featured

HKR-H and HKR-K pass: the paper offers a counterintuitive CIFAR-10H result under sparse annotation. HKR-R is weak because the impact stays within labeling/training methodology, so it fits the 60–71 all band.

editor take

Hard labels beat soft labels with few CIFAR-10H votes; multipass looks practical, but the OOD evidence is only descriptive.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

PREFINE adapts DPO to trajectory-level preferences in continuous control, using a small set of low-cost and high-cost trajectories to fine-tune a reward-optimized RL policy while reducing constraint violations and catastrophic failures by over 60%.

#Fine-tuning#Alignment#Safety#PREFINE

why featured

HKR-K and HKR-R pass: the item has a concrete mechanism and a >60% result, and it touches safety alignment. HKR-H is weak, and the single arXiv RL-control scope keeps it in the 60–71 band.

editor take

PREFINE ports DPO to continuous-control trajectories and cuts violations over 60%; its counterfactual sampling may hide the real safety cost.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

Yang Liu and coauthors propose CP-MoE, a continual learning framework that uses a transient expert, consistency-preserving routing bias, and transient expert-guided regularization to reduce forgetting in LLM/VLM MoE models; the paper reports validation on SuperNI and VQA v2, but the arXiv abstract does not disclose exact scores.

#Fine-tuning#Multimodal#RAG#Yang Liu

why featured

HKR-K passes through a concrete mechanism and benchmarks; HKR-H is weak and HKR-R is limited by missing scores, code, and deployment evidence. This is a normal arXiv methods paper, so it stays in all.

editor take

CP-MoE claims SOTA on SuperNI and VQA v2, but no scores are disclosed; I don’t buy anti-forgetting from abstracts.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Learning-to-Defer with Expert-Conditional Advice

The paper proposes Learning-to-Defer with advice, models expert and advice as a composite action space, proves H-consistency and an excess-risk transfer bound, and reports gains over standard Learning-to-Defer across tabular, language, and multimodal tasks.

#RAG#Tools#Multimodal#Research release

why featured

HKR-K passes via a concrete LTD-with-advice mechanism and tests on 3 task types. HKR-H/R are weak, with no major lab, artifact, or production-replacement claim, so it stays in the lower research-release band.

editor take

Composite expert-advice actions beat standard deferral on 3 task types; the useful bit is proving split routing/advice heads inconsistent.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

The authors introduce SpectralEarth-FM and SpectralEarth-MM, pairing HSI from three spaceborne sensors with Sentinel and Landsat data, then pretraining on about 2 million locations, 25 million georeferenced patches, and over 40 TB of data.

#Multimodal#Vision#Benchmarking#SpectralEarth-FM

why featured

HKR-K passes on concrete scale and multimodal pretraining setup. HKR-H and HKR-R are weak because the story is a niche Earth-observation foundation-model paper, so it stays in all.

editor take

SpectralEarth-MM hits 40TB and 25M patches; I buy HSI fusion, but PANGAEA-only SOTA leaves generalization under-proven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

SMoA partitions each layer into multiple aligned spectral blocks and adds a Hadamard-modulated low-rank branch to every diagonal block, reporting higher average performance than LoRA and LoRA-style baselines under a lower-budget setting across multiple tasks.

#Fine-tuning#SMoA#LoRA#Research release

why featured

HKR-K/R pass: SMoA adds spectral blocks plus Hadamard-modulated low-rank branches for cheaper PEFT. HKR-H fails and the feed gives no parameter or benchmark numbers, so this stays all.

editor take

SMoA claims better average scores than LoRA via spectral blocks plus Hadamard branches; no models, tasks, or parameter counts disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Paper proposes proactive client selection method for fair and efficient federated learning

The paper proposes a proactive federated-learning client selection framework that optimizes fixed-size client sets before training, using mutual information from differentially private contingency tables and simulated annealing over a Potential Federation Loss objective; experiments on four benchmarks report faster convergence, better fairness, and higher accuracy than uniform sampling, including when adaptive aggregation or sampling baselines are used.

#Fine-tuning#Safety#Benchmarking#Research release

why featured

HKR-K passes with a concrete mechanism and 4 benchmarks. HKR-H/R are weak: this is niche federated-learning optimization, far from mainstream model or agent product news, so it stays in all.

editor take

DP contingency tables preselect clients and beat uniform sampling on 4 benchmarks; I worry PFL tuning eats the saved rounds.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

The authors train a vision transformer on LARDv2 for runway keypoint regression, decompose per-patch embeddings with K-SVD sparse dictionary learning, and propose OOMS runtime monitoring to provide representation-level evidence requested by EASA learning-assurance guidance.

#Vision#Interpretability#Safety#EASA

why featured

HKR-K and HKR-R pass: the mechanism and certification target are concrete, but this is a narrow aviation-safety interpretability paper with high reading cost and no broad product or agent impact.

editor take

LARDv2 runway regression gets OOMS monitoring; K-SVD content/style splits are qualitative, still far from aviation-grade evidence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Complementing Reinforcement Learning with SFT Through Logit Averaging in LLM Post-Training

The paper introduces logit averaging between a frozen SFT reference policy and a trainable policy inside GRPO, without KL regularization or a critic, and evaluates it on MATH, cn-k12, and MMLU against canonical KL-regularized GRPO.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-K passes via a concrete GRPO post-training mechanism and MATH/cn-k12/MMLU comparisons. HKR-H and HKR-R are weak because this is a specialist paper, so it stays in all.

editor take

Logit-averaging frozen SFT with trainable GRPO matches or beats KL-GRPO on 3 benchmarks; small trick, very reproducible-looking.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity

NeighborDiv detects graph anomalies using the variance of inter-neighbor feature similarities, replacing node-to-neighbor consistency with a neighbor-to-neighbor diversity signal, and reports relative gains over the second-best baseline of 10.25% average AUC and 17.78% average AP under SDIT, plus 6.89% AUC and 9.58% AP under UMDT.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with a training-free mechanism and two benchmark gains. HKR-H/R are weak because graph anomaly detection is narrow research, so this fits all rather than featured.

editor take

NeighborDiv reports +10.25% AUC and +17.78% AP under SDIT; I buy the training-free angle, but “zero volatility” needs dataset receipts.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Why Ask One When You Can Ask k? Learning-to-Defer to the Top-k Experts

The paper introduces Top-k Learning-to-Defer, assigning each query to the k most cost-effective experts, and proposes a k-independent consistent surrogate loss that supports one-stage and two-stage settings.

#Reasoning#Benchmarking#Research release

why featured

This is a method-heavy ML paper: HKR-H comes from the top-k expert deferral setup, and HKR-K from the consistent surrogate-loss claim. No experiment numbers, code, or production use case are disclosed, so it stays in the 60–71 band.

editor take

Top-k L2D routes each query to k experts; experiment scale is undisclosed, so the k-independent loss is the claim to test.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

Dynamic TMoE detects distribution shifts with MMD and dynamically adds or prunes heterogeneous experts, while a temporal memory router uses recurrent states and an anomaly repository; experiments on nine benchmarks report 10.4% lower MSE and 7.8% lower MAE without test-time updates.

#Reasoning#Memory#Dynamic TMoE#arXiv

why featured

HKR-K passes via a concrete mechanism and benchmark numbers. HKR-H and HKR-R are weak because this is niche time-series forecasting research without product or agent impact, so it stays in the lower all band.

editor take

Dynamic TMoE cuts MSE 10.4% on 9 benchmarks. I buy drift-aware experts, but latency and expert-growth caps are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

The paper trains an LLM-based survey framework on 1972–2021 General Social Surveys data to predict missing opinions; retrodiction performs strongly under cross-validation, while prediction of entirely unasked opinions remains modest.

#Embedding#Benchmarking#arXiv#General Social Surveys

why featured

HKR-H/K/R all pass, but this is a methods paper, not a product launch, major-lab move, or reusable tool release. It fits the 60–71 band for interesting but not featured research.

editor take

This uses 1972–2021 GSS to fill missing opinions; unasked-opinion prediction stays modest, so don’t sell retrodiction as simulation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

The paper proposes ASR, a training-free channel-wise post-pruning repair method; on ResNet-50 at 90% sparsity, it recovers 55.6% CIFAR-10 top-1 accuracy, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration.

#Vision#Inference-opt#ASR#ResNet-50

why featured

HKR-K lands with a concrete pruning-repair result, and HKR-R is modest through inference-cost relevance. HKR-H misses because the title is specialist; no product, open-source release, or major-lab signal.

editor take

ASR lifts ResNet-50 at 90% sparsity to 55.6% on CIFAR-10; training-free pruning repair needs less BatchNorm folklore.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

The paper proposes Musical Attention, a Transformer attention mechanism that adds metadata including bar numbers, keys, signatures, and tempos, representing each note with five events plus three metadata elements to model correlations across eight features.

#Audio#Research release

why featured

HKR-K passes because the paper specifies a music-aware attention mechanism with bar, key, meter, tempo, and eight features. HKR-H and HKR-R are weak: no product angle, major lab, or practitioner-level tension.

editor take

Musical Attention uses 8 note features, but no metrics are disclosed; I don’t buy “significantly reduces repetition” without code and listening tests.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs

FedCoE maintains multiple global experts and a shared gating network for federated learning, reaching 78.00% average global accuracy, 89.32% personalized accuracy, and 77.27% cold-start accuracy without local fine-tuning.

#Fine-tuning#Inference-opt#FedCoE#Research release

why featured

HKR-K passes because the paper gives a concrete mechanism and benchmark numbers. HKR-H/R are weak: this is a niche federated-learning method with no product rollout, open-source artifact, or broad industry trigger.

editor take

FedCoE reports 78.00% global and 89.32% personalized accuracy; federated MoE looks sane, but datasets and baselines aren't disclosed here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

NaP-Control uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, replacing gradient-based test-time guidance for physics-based whole-body character control; the arXiv abstract says experiments show higher success rates and faster inference across diverse tasks, but the RSS snippet does not disclose exact metrics or benchmark settings.

#Robotics#Inference-opt#Research release

why featured

HKR-K passes on the latent-noise RL mechanism, but success-rate and speed gains lack numbers. The character-control angle is narrow, so it lands in the low 60s as a standard research release.

editor take

NaP-Control predicts diffusion noise with RL and skips test-time guidance; no success or latency numbers, so I don’t buy “fast” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Xikai Zhang and eight coauthors propose FBOS-RL, a feedback-driven reinforcement learning framework that uses environment feedback for exploration enhancement and combines two objectives, EPA and ECC, to improve training efficiency over GRPO under the same number of rollouts.

#Reasoning#Alignment#Xikai Zhang#Yongzhi Li

why featured

HKR-K passes with a concrete mechanism and GRPO comparison condition. HKR-H is weak and HKR-R lacks disclosed effect size, code, or model impact, so this sits in the lower all band.

editor take

FBOS-RL adds EPA and ECC to GRPO sampling, but exact gains aren’t disclosed; I don’t buy the flywheel claim without same-rollout replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Differentially Private Model Merging

The paper proposes two post-processing methods, random selection and linear combination, to generate models for any target differential privacy parameter from existing models trained on the same dataset with different privacy-utility tradeoffs, without additional training.

#Fine-tuning#Safety#Research release

why featured

HKR-K passes: the paper names two post-processing mechanisms for DP model merging without extra training. HKR-H and HKR-R fail because this is a dry single arXiv item with unproven practitioner impact.

editor take

The paper merges existing DP models via random selection or linear combinations; useful trick, but the cost hides in pretraining multiple privacy tiers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Secure, Verifiable, and Scalable Multi-Client Data Sharing via Consensus-Based Privacy-Preserving Data Distribution

The paper proposes the CPPDD framework for secure multi-client data aggregation, using per-client affine masking and sequential consensus locking, and reports linear scaling to N=500 on MNIST-derived vectors with sub-millisecond per-client computation.

#Safety#CPPDD#Research release

why featured

HKR-K and HKR-R pass via concrete protocol details and privacy relevance, but HKR-H fails. A single arXiv paper on privacy-preserving aggregation lacks product pull, so it stays below featured.

editor take

CPPDD reports N=500 MNIST vectors and sub-ms clients; I don’t buy the N-1 collusion claim without disclosed baselines.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→The General Theory of Localization Methods

The paper proposes the localization method, a machine learning framework built on localization kernels and local means, and relates it to self-attention, kernel methods, MeanShift, Hopfield networks, LLE, denoising autoencoders, and Transformer construction via hierarchical local models.

#Reasoning#Research release

why featured

HKR-H and HKR-K pass, but this is a theory-heavy arXiv paper with only a unifying-framework claim disclosed; no experiments, code, or production payoff are given, so it stays in all.

editor take

This unifies 8 model families via localization kernels; no experiment numbers, so I’d file it as theory synthesis, not a new architecture.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

The paper proposes Linear-DPO for text-to-image preference optimization, using a unified reverse-time SDE objective for diffusion and flow-matching models and testing it on SD1.5, SDXL, and SD3-Medium against existing baselines.

#Alignment#Multimodal#Research release

why featured

HKR-K passes: new method, unifying mechanism, and tests on SD1.5, SDXL, and SD3-Medium. HKR-H/R are weak, and the item is an arXiv abstract-level paper, so it stays in all.

editor take

Linear-DPO tests SD1.5, SDXL, and SD3-Medium; the sharp claim is that sigmoid DPO mismatches image regression.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Study Compares Automated ICD Classification for Psychiatric Diagnoses Across NLP Approaches

The study evaluates automated ICD coding on 145,513 Spanish psychiatric descriptions, comparing BoW, TF-IDF, e5_large, BioLORD, and Llama-3-8B; end-to-end fine-tuned e5_large achieves the top micro-F1 score of 0.866 and outperforms classical text representations.

#Embedding#Fine-tuning#Benchmarking#e5_large

why featured

HKR-K passes with dataset size and micro-F1. HKR-H is weak because the angle reads like a routine medical NLP paper; HKR-R is limited without a product, open model, or broad industry deployment hook.

editor take

e5_large hits 0.866 micro-F1 on 145,513 Spanish psychiatry notes; Llama-3-8B losing here is a size-scaling warning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Closed-Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

The paper proposes AutoScale, a closed-loop data engine that uses Graph-RAE, Cluster-GA, and cluster-guided vector retrieval to optimize real-synthetic driving data mixtures, and reports higher NavSim performance than vanilla co-training and cross-domain baselines with fewer synthetic samples under constrained budgets.

#Robotics#Benchmarking#AutoScale#NavSim

why featured

HKR-K passes via the closed-loop data-mixture mechanism and NavSim condition. HKR-H/R are weak, and the post gives no exact lift or sample-saving rate, keeping it a niche research item.

editor take

AutoScale beats baselines on NavSim with fewer synthetic samples. No gains disclosed, so don’t crown a driving-data flywheel yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Sequential Data Augmentation for Generative Recommendation

The paper introduces GenPAS for generative recommendation, modeling data augmentation as stochastic sampling over input-target pairs with 3 bias-controlled steps, and evaluates it against existing strategies on benchmark and industrial datasets.

#Fine-tuning#Benchmarking#Snap Research#Research release

why featured

HKR-K passes via a named mechanism and test settings; HKR-H/R fail because the angle is narrow recommender-system research. No hard exclusion, but general AI-practitioner value stays in the 40–59 band.

editor take

GenPAS frames recsys augmentation as 3-step sampling. The useful part is treating sample construction as a first-order training knob.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Learning Incentive Structures for Cooperative Resilience in Multi-Agent Systems under Social Dilemmas

The paper proposes a multi-agent reinforcement learning framework that ranks trajectories with a resilience metric and infers reward functions, then evaluates three incentive structures in disrupted resource-sharing environments under social dilemmas.

#Agent#Reasoning#Research release

why featured

HKR-K passes via a concrete MARL mechanism and 3 incentive structures. HKR-H/R are weak: this is specialized multi-agent RL research, not a product or practice-shaping release.

editor take

The paper tests 3 incentive schemes; hybrid rewards reduce collapse, but RSS omits environment scale and baseline strength.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→A Mechanistic Study of Tabular Foundation Models

The paper analyzes tabular foundation models across classification and regression tasks, finding that different architectures converge in accuracy while using distinct similarity-based readouts; the authors validate these mechanisms with causal interventions, trace permutation invariances to removable positional parameters, and reproduce predicted failures using engineered perturbations plus hub and rank attacks.

#Interpretability#Benchmarking#arXiv#Research release

why featured

HKR-K passes: the paper reports tabular foundation-model readout mechanisms and reproducible intervention tests. HKR-H/R are weak; this is useful method signal, not a broad industry story.

editor take

Tabular FMs converge on accuracy but split in readout mechanics; causal interventions and hub/rank attacks expose failures leaderboards miss.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Geometry-Lite evaluates prompt-level safety probes across nine instruction-tuned backbones from 1.2B to 70B and seven safety benchmarks, mapping each layer’s final prompt-token representation to signed margins from centroid, local-neighborhood, and supervised linear-boundary readouts; the paper finds persistent boundary-position geometry drives pooled AUROC, while finite-difference drift adds only small recall-oriented corrections under shifted low-FPR thresholds.

#Safety#Interpretability#Benchmarking#Woo Seob Sim

why featured

HKR-K passes via concrete test setup and mechanism claims; HKR-H/R are weak because the angle is niche and highly technical. No hard exclusion applied, but accessibility keeps it in the lower research-signal band.

editor take

Geometry-Lite tests 9 models on 7 safety sets; I buy the punchline: safety signal looks positional, not layer-drift.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

TreeText-CTS converts irregular EHR trajectories into deterministic tree-path evidence units and reports the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces across three clinical prediction tasks, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface.

#Interpretability#Benchmarking#TreeText-CTS#PhysioNet

why featured

HKR-K passes with a concrete mechanism and AUPRC gains, but HKR-H and HKR-R are weak. The topic is niche clinical time-series modeling, so it stays in all rather than featured.

editor take

TreeText-CTS adds 6.0–9.7 AUPRC points on 3 EHR tasks; I trust tree-path evidence over free-form clinical summaries.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Time-Prompt: Integrated Heterogeneous Prompts for LLMs in Time Series Forecasting

The paper introduces Time-Prompt, a framework that combines learnable soft prompts, textual hard prompts, semantic-space embeddings, and cross-modal alignment for LLM-based time-series forecasting, with evaluations on 6 public datasets and 3 carbon-emission datasets.

#Fine-tuning#Multimodal#Embedding#Time-Prompt

why featured

HKR-K passes via concrete prompt components and evaluation on 6 public plus 3 carbon-emission datasets. HKR-H/R are weak: this is a routine arXiv methods paper with no deployment or production-replacement claim.

editor take

Time-Prompt tests 9 datasets; without SOTA deltas in the abstract, I file it as prompt-engineering incrementalism for LLM forecasting.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

The paper introduces Distribution-Aware Reward, an on-policy RL objective that scores multiple decoded samples with CRPS as an empirical predictive distribution, reporting a 6-point Spearman gain on KBSS and competitive MoleculeNet results using only SMILES strings.

#Reasoning#Fine-tuning#Benchmarking#arXiv

why featured

HKR-K passes via a concrete mechanism and KBSS number; HKR-H/R are weak because the angle is academic and narrow. No hard exclusion, but technical accessibility keeps it in the 40–59 low-value research band.

editor take

Distribution-Aware Reward trains multi-sample distributions with CRPS and gains 6 Spearman on KBSS; I like the move, but MoleculeNet splits matter.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

IMPACT uses an influence function to estimate each training sample’s effect, then uses influence scores to create realistic unseen time-series anomalies and repurpose high-influence samples for decontamination; the abstract reports tests across multiple OSAD settings and contamination rates, but does not disclose dataset counts, metric values, or baseline names in the RSS snippet.

#Benchmarking#Research release#Open source#Benchmark

why featured

HKR-K passes on a concrete mechanism: influence-function scores generate unseen anomalies, with OSAD settings, contamination rates, and code. HKR-H/R fail because the title is academic and the audience impact is narrow; no hard-exclusion rule triggered.

editor take

IMPACT generates unseen anomalies via influence scores; RSS omits datasets and metrics, so treat “SOTA” as unverified.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→A Rigorous, Tractable Measure of Model Complexity

The paper introduces a model-complexity measure based on gradient similarities across inputs, applies it to parametric models and kernel-based non-parametric models, and proves it generalizes mechanisms such as polynomial degree, Matérn length scale, kNN neighbor count, decision-tree split count, and random-forest tree count.

#Interpretability#Benchmarking#Research release

why featured

HKR-K passes because the paper gives a testable gradient-similarity complexity measure across kernels, kNN, trees, and forests. HKR-H and HKR-R are weak, so this stays in all below featured.

editor take

Gradient-similarity complexity spans five classic mechanisms; I want the LLM-scale run, not another elegant theorem zoo.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Towards Understanding Self-Pretraining for Sequence Classification

The paper replicates and ablates Amos et al. 2024 on self-pretraining, finding that the bottleneck is label supervision learning useful query-key Attention patterns from random initialization, while masked reconstruction detects Attention-score directions that supervised labels miss.

#Reasoning#Benchmarking#Amos et al.#Research release

why featured

HKR-K passes for a concrete SPT replication/ablation claim, but HKR-H and HKR-R are weak. The topic is narrow training theory with no product or engineering hook, so it stays in the low-value band.

editor take

SPT boosts LRA classification by learning query-key patterns from scratch; labels are blind where masked reconstruction sees signal.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Why Aggregate Accuracy Is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

The paper analyzes facial recognition systems in law enforcement and security, arguing that aggregate accuracy can hide subgroup FPR and FNR disparities; the RSS snippet does not disclose a specific dataset, benchmark, or numerical error rates.

#Vision#Safety#Benchmarking#Research release

why featured

HKR-R passes because law-enforcement face recognition carries safety and compliance stakes. HKR-H/K are weak: no dataset, error rates, or reproducible setup are disclosed, so this stays in the low-value research-summary band.

editor take

The paper flags subgroup FPR/FNR gaps but gives no dataset or error rates; correct claim, thin evidence.

HKR breakdown

hook —knowledge —resonance ✓

→ open source

SCORE

H0·K0·R1

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation

The paper proposes a unified formulation that splits representation learning into a task component and a constraint component, then tests how different tasks interact with causal constraints on CausalVerse.

#Reasoning#Benchmarking#CausalVerse#Research release

why featured

HKR-K passes via the unified formulation and CausalVerse test setup. HKR-H/R fail: no result numbers, artifact, or practical stake, so this stays a low-value research item.

editor take

CausalVerse shows causal constraints are task-dependent. Scores aren't disclosed; without a reproducible task-constraint matrix, this risks taxonomy cosplay.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→CIG: Exploration via Conditional Information Gain

The paper introduces CIG, an intrinsic reward that approximates trajectory-level information gain with a log-determinant objective over an ensemble disagreement kernel, and evaluates it against prior exploration methods on 12 MiniGrid and OGBench tasks under clean and stochastic-distractor settings.

#Reasoning#CIG#MiniGrid#OGBench

why featured

HKR-K passes through a concrete intrinsic-reward mechanism and 12-task evaluation. HKR-H/R miss: the title is a standard RL paper and the audience impact is narrow, so this stays in the 40–59 band.

editor take

CIG tests log-det ensemble disagreement on 12 tasks; I buy the idea, but short-rollout model-based setup limits extrapolation.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

FedKDNAS lets each client select a lightweight architecture under accuracy-resource constraints, and evaluations on six datasets against six FL baselines report up to 15% higher accuracy, about 28% lower client CPU usage, and up to 44x lower communication overhead under non-IID conditions.

#Fine-tuning#Inference-opt#Research release#Benchmark

why featured

HKR-K passes with concrete benchmark scale and resource gains. HKR-H/R are weak because this is niche federated-learning research with no product rollout or broad practitioner controversy.

editor take

FedKDNAS beats 6 FL baselines on 6 datasets; 15% accuracy and 44x comms gains hinge on per-client architectures.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Divide and Contrast: Learning Robust Temporal Features without Augmentation

Di-COT trains time-series representations by randomly partitioning each window into a small number of overlapping sub-blocks per iteration, uses a contrastive loss dependent on batch size and sub-block count rather than sequence length, and reports tests on six real-world datasets plus UCR and UEA benchmarks.

#Embedding#Benchmarking#Di-COT#UCR

why featured

HKR-K passes via a concrete training mechanism and benchmark scope. HKR-H/R are weak: this is a niche time-series representation paper with no product release, deployment claim, or reported performance number.

editor take

Di-COT removes sequence length from loss cost; six real datasets plus UCR/UEA is solid, but training-time gains lack numbers here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

The paper introduces SMFP, a one-step generative policy that maps Gaussian noise to actions via a MeanFlow transform, trains it with off-policy mirror descent and an entropy surrogate, and reports better results than Gaussian and generative baselines across seven MuJoCo benchmarks.

#Agent#Inference-opt#Benchmarking#MuJoCo

why featured

Triggers hard-exclusion-technical-accessibility: MeanFlow, entropic mirror descent, and MuJoCo need RL/optimization context. HKR-K passes on the 7-benchmark claim; HKR-H/R fail, so score is capped.

editor take

SMFP beats baselines on 7 MuJoCo tasks; one-step sampling is the hook, but I’d wait for code and ablations.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

The paper proposes CA-LIG, a framework that computes layer-wise Integrated Gradients inside each Transformer block and fuses them with class-specific attention gradients, with evaluations across BERT, XLM-R, AfroLM, and a Masked Autoencoder vision Transformer.

#Interpretability#Vision#Benchmarking#BERT

why featured

HKR-K passes because the article names a concrete CA-LIG mechanism and model coverage. HKR-H/R are weak, and no metrics or production impact are disclosed, so it stays in the low-value but non-noise band.

editor take

CA-LIG spans 4 Transformer families, but the snippet gives no metrics; “clearer explanations” needs code and faithfulness numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

The paper proposes FF-BPSN for target-oriented proactive dialogue path planning, using two transformer-based decoders for forward and backward planning, then evaluating it on DuRecDial and DuRecDial 2.0.

#Agent#Reasoning#arXiv#DuRecDial

why featured

HKR-K passes on a concrete planning mechanism and datasets, but HKR-H/R fail. This is narrow dialogue-planning research with no product tie-in, major-lab signal, or practitioner-facing experiment, so it sits in the 40-59 band.

editor take

FF-BPSN uses dual decoders for bidirectional planning; DuRecDial-only evals make the SOTA claim stay in dialogue routing, not agents.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

The paper proposes an AI-native 6G vision that uses one foundation model and collaborative multi-agent systems to unify network management; the abstract does not disclose experiments, datasets, or a deployment timeline.

#Agent#Multimodal#Research release

why featured

HKR-K passes on the proposed one-foundation-model plus multi-agent architecture; HKR-H/R are weak, and the body discloses no experiments, dataset, or deployment timeline.

editor take

One foundation model manages 6G networks; no experiments, datasets, or timeline disclosed, so this reads like roadmap staking.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for VLMs and Autonomous Agents

WildRoadBench evaluates VLM grounding and LLM-driven agents on the same professionally annotated UAV road-damage corpus, using per-class AP_50 under two protocols. The abstract says closed-source frontier models lead but leave over half the metric unused; the post does not disclose dataset size, model names, or the fixed interaction-budget value.

#Agent#Vision#Benchmarking#WildRoadBench

why featured

HKR-K passes via the two-track AP_50 setup, but HKR-H/R are weak. The abstract omits scale, model list, and interaction budget, so this stays in the 40–59 low-value band.

editor take

WildRoadBench tests VLMs and agents on identical UAV images; dataset size, model names, and budget stay undisclosed, so agent failures sting most.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

The paper proposes Tunable MAGMAX, a continual-learning model-merging framework that uses a preference vector to control how many elements are selected from each task vector and automatically constructs that vector from small amounts of target-environment data plus training-task datasets.

#Fine-tuning#Inference-opt#Benchmarking#MAGMAX

why featured

HKR-K passes for a concrete mechanism, but the post lacks experiment scale, benchmark gains, or reproducible conditions. The angle is too niche for HKR-H/R, so it stays in all.

editor take

Tunable MAGMAX controls per-task vector element counts with one preference vector. Benchmarks and sample sizes are undisclosed; deployment claims feel early.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3 combines Multiscale Mamba, a Disentangled MoE framework, and an adaptive graph causal network for long-term spatio-temporal prediction, reports state-of-the-art results on 10 real-world benchmarks, and beats the second-best model on PEMSD8 by 7.1% MAE, 8.5% RMSE, and 15.9% MAPE.

#Benchmarking#STM3#Mamba#Research release

why featured

HKR-K passes via concrete mechanisms and PEMSD8 gains; HKR-H/R fail because this is a narrow spatio-temporal forecasting paper with little practitioner resonance.

editor take

STM3 claims SOTA on 10 benchmarks and -7.1% MAE on PEMSD8; long-sequence compute cost is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

The paper releases 2 open-source iris recognition algorithms with IREX-compliant C++ implementations, evaluates 4 methods under IREX X protocols, and reports tests across 8 academic iris benchmarks.

#Vision#Benchmarking#IREX#arXiv

why featured

HKR-K passes on concrete artifacts and benchmark counts; HKR-H/R are weak because iris-recognition evaluation is niche and far from mainstream AI product or model competition.

editor take

The paper opens 2 iris algorithms and an IREX C++ template; CRYPTS hit 1:N latency, so the win is entry friction.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→MoRe: Modular Representations for Continual Learning on Sequential Data

MoRe decomposes knowledge into two module levels, fundamental and specific, and tests the framework on synthetic benchmarks and real-world LLM activations; the abstract reports better plasticity-stability trade-offs but does not disclose metric values.

#Memory#Interpretability#MoRe#Research release

why featured

HKR-K passes via a modular representation mechanism and LLM-activation tests. HKR-H/R are weak, and metrics are not disclosed, so this stays in the low-value research band.

editor take

MoRe splits representations into fundamental/specific modules, but gives no metrics; using LLM activations beats another parameter-tuning CL recipe.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Augmented Analytics and Decision Quality: The Role of Trust among Non-Technical BI Users

The paper surveys 250 business professionals and uses PLS-SEM to analyze how augmented analytics, trust, BI adoption, and decision quality relate among non-technical BI users.

#Research release

why featured

HKR-K passes via the 250-person survey and PLS-SEM method. HKR-H/R are weak: this is academic BI-adoption work with no product mechanism, model capability, or industry shock.

editor take

The paper surveys 250 BI users; self-reports plus PLS-SEM don't prove decision quality, and trust may just mean compliance.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

The paper presents a two-stage low-light image enhancement framework using frozen algorithmic preprocessing and a compact depthwise-separable U-Net, reporting 3rd place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge; the abstract says it includes extended benchmarks and ablations but does not disclose parameter counts in the snippet.

#Vision#Inference-opt#Benchmarking#CVPR

why featured

HKR-K passes via the named method and NTIRE ranking; HKR-H/R fail because the angle is technical and far from model, agent, or product stakes. No hard exclusion, but it sits in the low-value research band.

editor take

This took 3rd at NTIRE 2026; parameter counts aren't disclosed, so the lightweight claim stays unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

Brown Zaz and four coauthors propose Transductive Sharpening, a loss-level change that minimizes prediction entropy on unlabeled nodes while counterbalancing it on labeled nodes, and the 19-page arXiv paper reports node-classification gains across benchmarks with 4 figures and 17 tables.

#Benchmarking#Brown Zaz#Mar Gonzàlez I Català#Moshe Eliasof

why featured

HKR-K passes for a concrete mechanism and reported experiments. HKR-H/R fail because the story is narrow graph-learning research with no product, open-source tool, or industry impact hook, so it sits in the low-value band.

editor take

Transductive Sharpening changes only the loss, with 17 tables; I buy the angle, pending low-label-rate robustness.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

19d ago

arXiv · cs.LG· atomEN04:00 · 05·21

→Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

The paper evaluates ensemble RL trading strategies combining A2C, PPO, and SAC with SVM, decision trees, and logistic regression, comparing them against base RL models on cumulative returns, Sharpe ratio, Calmar ratio, and maximum drawdown; the RSS snippet does not disclose the dataset, backtest period, or exact return figures.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-K passes on method detail, but the post lacks dataset, return numbers, and reproducible conditions. The quant-finance angle sits far from core AI product or model-industry concerns, so it stays in the low-value band.

editor take

A2C/PPO/SAC get three classifiers; no dataset or returns disclosed, so don’t buy “consistently outperform” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

03:44

19d ago

HuggingFace Papers (takara mirror)· rssEN03:44 · 05·21

→Bounding-Box Trajectories Matter for Video Anomaly Detection

TrajVAD models multi-class bounding-box trajectories with normalizing flows; TrajVAD-T reaches 87.7% AP on ShanghaiTech without pose estimation, while TrajVAD-P adds pose features and reports 88.6% AUROC and 90.9% AP on ShanghaiTech.

#Vision#Benchmarking#TrajVAD#ShanghaiTech

why featured

HKR-K passes on a concrete method and benchmark numbers. HKR-H/R are weak because this is niche video-anomaly research with no product rollout, open-source artifact, or broad practitioner debate hook.

editor take

TrajVAD-P reports 90.9% AP on ShanghaiTech; box trajectories beating pose-heavy baselines is a useful slap at feature bloat.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

02:44

19d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN02:44 · 05·21

→MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN converts raw videos into multi-task training data with CoT traces and labels over 5,300 traffic videos; after fine-tuning Cosmos-Reason2-8B, the private CCTV evaluation shows a 38.8-point MCQ accuracy gain over zero-shot and surpasses Gemini 2.5 Pro and 3.1 Flash.

#Agent#Vision#Reasoning#MAVEN

why featured

HKR-H/K/R all pass: the mechanism is a multi-stage agentic annotation pipeline, with 5,300+ videos and a +38.8-point result. This is useful video-reasoning research, not a major model launch, so it fits the 78-84 band.

editor take

MAVEN’s win is the annotation factory: 5,300 traffic videos, +38.8 MCQ points. The model story is secondary.

sharp

MAVEN’s sharp claim is not that Cosmos-Reason2-8B beat Gemini 2.5 Pro. It is that video reasoning data can be produced as a debuggable pipeline. MSTED fuses three caption levels, then feeds multi-task Q&A generation. Errors are traced through a taxonomy back to prompts or pipeline stages. That is more credible than another vague “video CoT” claim. I’m cautious on the +38.8 MCQ headline. The main result is on a private CCTV set, and the snippet gives no size, sampling method, or task mix. AccidentBench is the useful pressure test: CCTV-only training adds +10.7 MCQ points, then dashcam annotations and RL close the gap. This looks like structured synthetic-data engineering for narrow video domains, not a general VLM leap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

01:17

19d ago

HuggingFace Papers (takara mirror)· rssEN01:17 · 05·21

→Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

DEX trains multi-modality medical vision foundation models with expert pools, image-wise activation, and a group EMA director; its Medical Vision Universe benchmark contains over 4 million images across 10 modalities, and evaluations cover 26 downstream tasks.

#Multimodal#Vision#Benchmarking#DEX

why featured

HKR-K passes: the paper gives DEX mechanics, 4M images, 10 modalities, and 26 evaluated tasks. HKR-H/R are weak, so this is an informative but narrow research release with no hard exclusion.

editor take

DEX trains on 4M medical images across 10 modalities; I buy expert pools, but 26-task gains lack numbers here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

papers · 2026-05-21

more

feeds

admin