papers · 2026-05-11

▸ 296 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-11 · Mon

20:03

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN20:03 · 05·11

→Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

VLA reformulates linear-attention memory updates as online regularized least squares; at T=1,000 it reduces the state norm by 109× versus standard linear attention, maintains 62% accuracy at the per-head capacity boundary, and its Triton-fused kernel becomes faster than softmax attention at about 43,000 tokens.

#Reasoning#Inference-opt#Memory#Research release

why featured

HKR-K and HKR-R pass via a concrete attention mechanism and latency/stability numbers. HKR-H is weak, and this is a single technical paper without adoption evidence, so it stays near the featured floor.

editor take

VLA attacks the right failure mode: linear attention only matters if its memory state stays sane past toy context lengths.

sharp

VLA matters because it targets the part linear attention fans kept hand-waving: memory drift. Standard linear attention lets the Frobenius norm of the state grow with T; this paper reframes the update as online regularized least squares, uses Sherman-Morrison rank-1 updates, and reports a 109× lower state norm at T=1,000. The clean hook is stronger: normalized write directions make the recurrence Jacobian spectral norm exactly 1. That is a better contribution than another sparse-attention scheduling trick. The Triton-fused kernel crossing softmax latency around 43,000 tokens gives the paper a real systems claim. I’d still discount the product narrative until it survives real LLM training curves. Multi-query associative recall is not long-document reasoning, and 62% accuracy at the per-head capacity boundary is a research signal, not a deployable long-context story.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:59

28d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 05·11

→ELF: Embedded Language Flows diffusion model research paper released

The paper proposes ELF, a diffusion language model that operates in continuous embedding space with continuous-time Flow Matching, then maps to discrete tokens only at the final step through a shared-weight network.

#Reasoning#Inference-opt#ELF#Research release

why featured

HKR-K passes: the paper introduces ELF as text generation in continuous embedding space with Flow Matching. No benchmark numbers, artifact, or deployment angle are disclosed, so it stays in the mid-low research band.

editor take

Only an arXiv cross-list title is visible—no authors, metrics, or method details. Treat ELF as a diffusion-LM research signal, not a model win.

sharp

Both sources are arXiv cross-listings, cs.AI and cs.LG, with the same title: “ELF: Embedded Language Flows.” That signals indexing breadth, not independent coverage or convergent validation. The title gives one hook: language flows in embedding space. The visible body gives no authors, benchmarks, sampling steps, perplexity, downstream evals, or latency numbers. I’d file ELF under the diffusion/flow language-modeling thread: avoid strict autoregressive token generation, operate over continuous embeddings, then map back to text. The catch is familiar. Since 2024, these systems often look elegant on formulation and then pay in discretization error or inference cost. Without a benchmark table, ELF is a research-direction ping, not evidence that flow models are ready to pressure Transformer LMs.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:58

28d ago

arXiv · cs.AI· atomEN17:58 · 05·11

→Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

The paper introduces a neural exponential tilting framework for variational inference in Lévy-driven SDEs, using neural networks to reweight the Lévy measure and adding quadratic parametrization, conditional Gaussian representation for stable processes, and symmetry-aware Monte Carlo estimators.

#Reasoning#Research release

why featured

Triggers hard-exclusion-1: Lévy-process SDEs, variational inference, and Monte Carlo estimators need deep specialty. HKR-K passes on mechanism, but there is no general AI product or agent on-ramp.

editor take

Two arXiv feeds list neural tilting for Lévy-SDE variational inference; title only, no experiments or baselines, so treat as a method stub.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:55

28d ago

arXiv · cs.CL· atomEN17:55 · 05·11

→Research Proposes Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

SLIM treats an agent’s active external skill set as a dynamic optimization variable, using leave-one-skill-out validation and three operations—retain, retire, expand—and reports a 7.1 percentage-point average gain over the best baselines on ALFWorld and SearchQA.

#Agent#Reasoning#Tools#SLIM

why featured

A standard arXiv agent paper: HKR-K has a mechanism and +7.1pp result, while HKR-R touches tool/skill reliability pain. No major lab, open-source ecosystem, or production-replacement evidence, so it stays in 60–71.

editor take

SLIM gains 7.1 points on ALFWorld and SearchQA; I buy skill retirement, but leave-one-out cost is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:52

28d ago

HuggingFace Papers (takara mirror)· rssEN17:52 · 05·11

→Research paper proposes scalable multi-agent path planning via optimal transport and Schrödinger Bridges

The paper reformulates anonymous MAPF as a Markov-structured MMOT problem, reducing an exponentially large formulation to a polynomial-size LP; it states that total unimodularity yields integral 0/1 collision-free transports, and uses a Schrödinger Bridge entropic regularization with Sinkhorn-style iterations to build a reduced LP, but the snippet does not disclose experiment sizes or numeric speedups.

#Robotics#Reasoning#Benchmarking#Research release

why featured

HKR-K passes via concrete mechanisms, but the story depends on optimal transport, Schrödinger Bridges, and LP details with no experiment scale disclosed. hard-exclusion-technical-accessibility-fail caps it below 40.

editor take

ICML 2026 spotlight casts anonymous MAPF as polynomial LP; I buy the Schrödinger Bridge template, not the scalability headline.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:51

28d ago

arXiv · cs.AI· atomEN17:51 · 05·11

→Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

The paper proposes a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition and reports 89.2% best classification accuracy on the AIBangla compound character dataset.

#Vision#Multimodal#Benchmarking#AIBangla

why featured

HKR-K passes with a concrete method and 89.2% result; HKR-H and HKR-R are weak because the topic is niche character recognition. No hard exclusion applies, but general AI-practitioner value is limited.

editor take

AIBangla hits 89.2% accuracy, but gains are undisclosed; diffusion augmentation is useful tooling, not a vision breakthrough.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:50

28d ago

FEATUREDarXiv · cs.AI· atomEN17:50 · 05·11

→Shepherd: A Runtime Substrate for Meta-Agents with a Formalized Execution Trace

Shepherd formalizes meta-agent operations as functions and mechanizes core operations in Lean; its Git-like typed execution trace forks and replays any prior state, forks the agent process and filesystem 5× faster than Docker, and achieves over 95% prompt-cache reuse on replay.

#Agent#Tools#Reasoning#Shepherd

why featured

HKR-H/K/R all pass: the mechanism is novel and the post gives 5x fork speed plus 95%+ cache reuse. It stays in the 78–84 band because this is a single arXiv paper without adoption proof or major-lab backing.

editor take

Shepherd treats agent execution like a typed repo; the Lean-backed trace matters more than the 5× Docker fork speed.

sharp

Shepherd hits the dirty layer of agent infrastructure: execution history is not logging, it is programmable state. It records each agent-environment interaction as a typed event, then forks or replays from any prior state. That is closer to a production problem than another planner paper. The concrete hooks are strong: process and filesystem forking is 5× faster than Docker, replay gets over 95% prompt-cache reuse, and CooperBench pair-coding pass rate jumps from 28.8% to 54.7%. I am still allergic to “meta-agent” as a label; too many agent papers used it to dress up schedulers. Shepherd has a sturdier claim because Lean mechanizes the core operations and the Git-like trace makes reproducibility part of the runtime. The gap is evaluation detail: the RSS body gives benchmark deltas, not model choice, task scale, failure cases, or overhead breakdown.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:49

28d ago

FEATUREDarXiv · cs.CL· atomEN17:49 · 05·11

→WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

WildClawBench introduces 60 bilingual multimodal long-horizon tasks averaging 8 minutes and more than 20 tool calls each; Claude Opus 4.7 scores the highest at 62.2% overall under OpenClaw, while harness choice shifts one model by up to 18 points.

#Agent#Multimodal#Benchmarking#WildClawBench

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark with no cross-source uptake or deployment signal. It clears featured as a useful agent eval story, not a model-release event.

editor take

WildClawBench drags agents back into real CLI runtimes; Opus 4.7 tops out at 62.2%, so short-task demos deserve less applause.

sharp

WildClawBench hits the softest lie in agent evals: short tasks, mock APIs, and final-answer grading. The set has only 60 tasks, so it is not broad, but each task averages 8 minutes and more than 20 tool calls inside real CLI harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent. That setup lets side effects pile up instead of vanishing behind a clean answer box. Claude Opus 4.7 leads under OpenClaw at 62.2%, and every other frontier model stays below 60%. The nastier number is the 18-point swing from changing the harness for one model. That says the agent wrapper is contaminating the model score. SWE-bench at least pins the patch boundary; here the exoskeleton changes the grade. Vendor demos that say “the model can use tools” need a discount.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:46

28d ago

arXiv · cs.AI· atomEN17:46 · 05·11

→Engineering Robustness into Personal Agents with the AI Workflow Store

The paper proposes an AI Workflow Store for personal agents, shifting from seconds-to-minutes on-the-fly planning to reusable hardened workflows; the RSS snippet does not disclose experimental results, performance numbers, or deployment mechanics.

#Agent#Tools#Safety#Research release

why featured

HKR-H/K/R pass, but the post gives the AI Workflow Store mechanism without results, performance numbers, or deployment details. This fits the 60–71 band for an interesting but under-evidenced agent research item.

editor take

AI Workflow Store shifts agents from seconds-level planning to reusable workflows; no eval numbers disclosed, and the Store framing smells premature.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:46

28d ago

● P1arXiv · cs.AI· atomEN17:46 · 05·11

→DataMaster: Autonomous Data Engineering for Machine Learning

DataMaster optimizes only the data side with tree-structured search, a shared Data Pool, and Global Memory; it improves the MLE-Bench Lite medal rate by 32.27% over the initial score and reaches 31.02% on GPQA in PostTrainBench versus 30.35% for the instruct model.

#Agent#Memory#Benchmarking#DataMaster

why featured

HKR-H/K/R all pass, but this is a single arXiv paper whose impact depends on code, replication, and real pipeline tests. The mechanisms and MLE-Bench Lite numbers justify a lower featured score.

editor take

DataMaster turns data wrangling into agentic search, and the 32.27% medal lift is loud; GPQA 31.02 vs 30.35 is too thin for victory laps.

sharp

Two arXiv categories carry the same DataMaster paper, with identical framing; this is one paper surfacing twice, not independent confirmation. The setup is clean: keep the learning algorithm fixed, let an agent handle external data discovery, selection, composition, cleaning, and transformation through DataTree, a shared Data Pool, and Global Memory. I buy the direction, but not the implied finish line. A 32.27% medal-rate lift on MLE-Bench Lite says branch search over data choices has signal. GPQA at 31.02% versus 30.35% on PostTrainBench is a 0.67-point edge, too narrow to treat as a robust post-training win. This smells like early AutoML: the algorithmic idea is sane, while the real bill hides in repeated downstream training and validation. The abstract does not disclose that budget.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:41

28d ago

HuggingFace Papers (takara mirror)· rssEN17:41 · 05·11

→CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

CapVector trains two converged models with distinct finetuning strategies, treats their parameter difference as capability vectors, and merges them into pretrained VLA models; the paper snippet does not disclose exact compute savings or benchmark numbers.

#Multimodal#Robotics#Fine-tuning#Research release

why featured

HKR-H/K pass: the weight-space capability-vector mechanism is testable and novel. HKR-R fails because the post lacks metrics, cost reduction, or deployment evidence, keeping it in the interesting-but-not-featured band.

editor take

CapVector trains two converged parameter sets and diffs them; no compute-savings numbers, so I read it as LoRA-style merging for VLA.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:40

28d ago

arXiv · cs.CL· atomEN17:40 · 05·11

→Research paper introduces RubricEM: meta-reinforcement learning with rubric-guided policy decomposition

RubricEM decomposes deep-research agent training into four rubric-conditioned stages—planning, evidence gathering, review, and synthesis—and trains RubricEM-8B with Stage-Structured GRPO plus a shared-backbone reflection meta-policy; the abstract claims gains across four long-form research benchmarks, but the post does not disclose exact scores.

#Agent#Reasoning#Memory#RubricEM

why featured

HKR-H and HKR-K pass: the paper targets RL beyond verifiable rewards and names a 4-stage GRPO mechanism. HKR-R is weak and scores are not disclosed, so it stays in the 60–71 band.

editor take

RubricEM-8B trains agents in 4 stages, but exact scores are missing; I don’t buy “near proprietary” without tables.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:38

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:38 · 05·11

→V4FinBench: Benchmark Dataset for Tabular Foundation Models in Corporate Bankruptcy Prediction

V4FinBench releases over 1 million V4 company-year records covering 2006-2021, 131 financial and non-financial features, six prediction horizons, and severe class imbalance with positive rates between 0.19% and 0.36%.

#Fine-tuning#Benchmarking#V4FinBench#TabPFN

why featured

HKR-K is strong thanks to concrete dataset scale and task conditions. The post discloses a benchmark release, not model rankings or a deployable mechanism, so it stays in the interesting research tier.

editor take

V4FinBench is a bad day for LLM-for-everything pitches: on 1M bankruptcy records, Llama-3-8B trails boosting while TabPFN holds up.

sharp

Both sources carry the same title and trace back to the same arXiv paper, so this is indexing breadth, not independent reporting. V4FinBench ships over 1M Visegrád company-year records, 131 features, six forecast horizons, and brutal positive rates of 0.19% to 0.36%. The sharp result is the negative one: QLoRA-finetuned Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on F1. TabPFN, after imbalance-aware finetuning, matches or beats boosting at longer horizons. That should annoy anyone selling “LLMs for financial risk” off QA benchmarks. BizFinBench-style expert Q&A is a different skill from rare-event tabular prediction, and V4FinBench makes that split harder to hand-wave away.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:35

28d ago

FEATUREDarXiv · cs.CL· atomEN17:35 · 05·11

→Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

BICR extracts hidden states twice from a frozen LVLM, using the real image and a blacked-out image with the same question, and reports the best cross-LVLM average for calibration and discrimination across 5 LVLMs and 7 baselines.

#Vision#Multimodal#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper has a clear guessing-vs-grounding hook, a black-image contrast mechanism, and tests across 5 LVLMs and 7 baselines. As a single arXiv method paper without release or broad uptake, it sits at the lower featured band.

editor take

BICR hits the LVLM blind spot: a correct answer can still be language-prior cosplay, and the black-image control is the right kind of ugly test.

sharp

BICR makes “did the model actually use the image?” a training signal, not a post-hoc story. It runs the same question with the real image and a blacked-out image, extracts hidden states from a frozen LVLM twice, then penalizes high confidence on the blind view through a ranking loss. The paper reports best cross-LVLM averages for both calibration and discrimination across 5 modern LVLMs and 7 baselines. I like the mechanism more than the headline. The abstract gives 4-18x fewer parameters than the strongest probing baseline and cluster-aware significance, but not the LVLM list, splits, or absolute ECE/AUROC numbers. Compared with logit confidence or a plain probe, BICR attacks the actual production failure: in medical images or financial documents, a fluent answer from language priors is more dangerous when it happens to be correct once.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:33

28d ago

arXiv · cs.AI· atomEN17:33 · 05·11

→Research Paper Analyzes On-Policy Distillation Effectiveness and Failure Modes

The paper introduces a training-free diagnostic framework that evaluates on-policy distillation per token, per question, and per teacher, using gradient alignment between an ideal per-node gradient and a distillation gradient; across self-distillation and external teachers, guidance aligns better on incorrect rollouts than on correct ones, and the best context varies by student capacity and task.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a technical post-training paper with impact mostly inside fine-tuning and distillation work. The summary gives a diagnostic method and gradient-alignment claim, not a model release or production-pipeline replacement.

editor take

This paper scores on-policy distillation per token; the wild part is teacher signal aligns better on wrong rollouts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:32

28d ago

HuggingFace Papers (takara mirror)· rssEN17:32 · 05·11

→Count Anything at Any Granularity

The paper defines open-world counting as five-level multi-grained counting, builds KubriCount with 3D synthesis, image editing, and VLM filtering, and trains HieraCount to use text, visual exemplars, and optional negative prompts, while the snippet does not disclose dataset size or benchmark numbers.

#Vision#Multimodal#Benchmarking#KubriCount

why featured

HKR-H and HKR-K pass: the title has a clear hook and the post gives five granularity levels, KubriCount, and HieraCount. HKR-R is weak because the impact stays mostly inside vision research.

editor take

KubriCount splits counting into 5 granularity levels; no size or scores disclosed, so I buy the framing, not the largest-dataset claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:32

28d ago

FEATUREDarXiv · cs.AI· atomEN17:32 · 05·11

→LoKA: Low-Precision Kernel Applications for Large-Scale Recommendation Models

LoKA introduces three components for applying FP8 to large recommendation models at scale. Probe profiles activation and weight statistics and per-layer errors, Mods adapts model components for stability and efficiency, and Dispatch selects the fastest FP8 kernel that meets accuracy requirements.

#Inference-opt#Benchmarking#LoKA#Research release

why featured

HKR-K passes because the post gives a concrete FP8 integration mechanism, but no latency, throughput, or accuracy gains are disclosed. The recsys-kernel angle is narrow, so it stays in all rather than featured.

editor take

LoKA is FP8 meeting recommender reality: tiny GEMMs, normalization, and comms break the neat LLM quantization story.

sharp

All 3 entries point to the same arXiv paper, 2605.10886, duplicated across cs.LG and cs.AI. That is not independent coverage; it is one ISCA’26 paper surfacing through multiple arXiv categories. LoKA’s call is right: FP8 for large recommendation models is not solved by dropping in a faster NVIDIA kernel. The abstract names three hard blockers: small GEMMs, normalization-driven numerical sensitivity, and communication-heavy training. Probe, Mods, and Dispatch map to profiling, model changes, and runtime kernel selection, so this reads like a systems paper rather than another quantization pitch. The LLM world sells FP8 as a clean throughput win; recommender stacks first need proof that layer-level accuracy does not collapse. The missing piece is concrete numbers: no AUC, throughput, or training-time delta appears in the provided abstract.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:27

28d ago

arXiv · cs.CL· atomEN17:27 · 05·11

→Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs

Neural1.5 ranked second overall among teams completing all four ArchEHR-QA 2026 subtasks, with a mean rank of 4.00; the method uses DSPy MIPROv2 for per-stage prompt optimization, self-consistency voting across stochastic inference runs, and verification mechanisms for EHR clinical QA.

#RAG#Reasoning#Tools#Neural

why featured

HKR-K passes with a concrete rank, four subtasks, and a DSPy MIPROv2 mechanism. HKR-H/R are weak because this is a narrow clinical NLP shared-task paper, so it stays in all.

editor take

Neural1.5 averaged rank 4.00 across four tasks; in clinical QA, DSPy prompt search again beat fine-tuning spend.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:27

28d ago

FEATUREDarXiv · cs.CL· atomEN17:27 · 05·11

→Self-optimizing language models improve performance through per-token compute allocation

Self-Optimizing Language Models pair a frozen LLM with a lightweight policy network that selects per-token efficiency actions for attention sparsity, MLP pruning, and quantization, improving quality at matched compute and raising MMLU accuracy by up to 7.3% over uniform budget allocation.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass: per-token compute allocation is a real hook, and the summary gives a mechanism plus a 7.3% MMLU gain. It stays at 78 because this is a single arXiv paper with no disclosed code, production use, or major-lab backing.

editor take

SOL moves inference savings from blanket compression to per-token scheduling; +7.3% MMLU is nice, but serving overhead will decide whether it ships.

sharp

SOL is the right kind of inference paper: freeze the LLM, then add a lightweight policy that picks attention sparsity, MLP pruning, and activation bit-width per token. The concrete hook is strong: up to +7.3% MMLU over uniform budget at matched compute. The training setup also matters: GRPO on teacher-forced counterfactual schedules keeps the token path fixed, so the reward compares compute choices rather than different generations. I still don’t buy the deployment story yet. A per-step policy plus dynamic sparsity, pruning, and quantization can wreck GPU kernels, batching, and KV-cache locality. The abstract gives quality-per-compute, not latency, throughput, or serving cost. vLLM and TensorRT-LLM-style stacks punish branchy decode paths; saving FLOPs in a paper does not automatically lower the bill in production.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:10

28d ago

arXiv · cs.CL· atomEN17:10 · 05·11

→DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective over multi-candidate comparisons; its reverse data improves five benchmarks by 3.2% on average, and DGPO reports average accuracy gains of up to 3.6% across multiple datasets and model families.

#Alignment#Reasoning#Fine-tuning#Research release

why featured

HKR-K passes with a concrete training mechanism and benchmark gains. HKR-H and HKR-R are weak: DGPO reads as an incremental alignment/fine-tuning paper, not a broad product or model event.

editor take

DGPO reports up to 3.6% average accuracy gain; RSS omits baselines, model sizes, and significance, so don’t crown a DPO replacement yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:10

28d ago

arXiv · cs.CL· atomEN17:10 · 05·11

→RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

The paper presents RUBEN, an interactive tool that explains retrieval-augmented LLM outputs with minimal rules. The snippet says its pruning strategies identify a rule set that subsumes all others, then use those rules to test safety-training resilience and adversarial prompt-injection effectiveness.

#RAG#Interpretability#Safety#RUBEN

why featured

HKR-K and HKR-R pass: RUBEN offers a rule-based explanation mechanism for RAG outputs and covers prompt injection plus safety-training robustness. No benchmark numbers, release details, or discussion signal, so it stays in the 60-71 research band.

editor take

RUBEN explains RAG outputs with minimal rules; only an RSS snippet, no code or scale, but audit tooling needs this shape.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:05

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:05 · 05·11

→Is Your Driving World Model an All-Around Player?

WorldLens evaluates six driving world models across five aspects and 24 standardized dimensions, finding no method dominates all axes and even the strongest models score only 2-3 out of 10 on human realism ratings.

#Vision#Multimodal#Benchmarking#WorldLens

why featured

HKR-H/K/R pass: the 2–3/10 realism result is a strong hook, and the benchmark covers 6 driving world models across 5 areas and 24 dimensions. Niche autonomous-driving focus keeps it at the featured floor.

editor take

WorldLens punctures the glossy driving-world-model demo loop: the best models score only 2-3/10 on human realism, so closed-loop trust is still thin.

sharp

WorldLens hits the exact place driving world models have been hiding: visual realism is not behavioral realism. It evaluates six models across five aspects and 24 dimensions, and the result is ugly. Texture-heavy models break physics, geometry-aware models fail under closed-loop planning, and the strongest systems still get only 2-3/10 from humans on realism. That is a harsh read for autonomy simulation. Many dash-cam demos have trained people to equate photorealistic rollout with useful synthetic driving data. WorldLens-26K adds 26,808 human preference entries with rationales, then distills WorldLens-Agent for scalable scoring, which is the right direction. I still would not confuse evaluator coverage with safety coverage. One closed-loop failure matters more than a beautiful 10-second clip.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:56

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:56 · 05·11

→BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

BabelDOC uses an intermediate representation to separate PDF layout metadata from semantic content, then re-anchors translated text through adaptive typesetting; experiments on a curated 200-page benchmark report better layout fidelity, visual aesthetics, and terminology consistency than baselines, and the open-source toolkit has 8.4K GitHub stars and 17 contributors.

#Multimodal#Tools#Benchmarking#BabelDOC

why featured

HKR-H/K/R all pass, but the scope is PDF translation and document tooling rather than a model/platform release. This fits the featured threshold, not the 78+ major-update band.

editor take

PDF translation was never just translation; it was re-rendering. BabelDOC attacks the ugly layout layer, not the fashionable agent layer.

sharp

BabelDOC picks the right fight in PDF translation: separating layout metadata from semantic content, then anchoring translated text back with adaptive typesetting. The paper reports a curated 200-page benchmark, human evaluation, and multimodal LLM-as-judge results, with gains in layout fidelity, visual aesthetics, and terminology consistency while keeping translation precision competitive. The 8.4K GitHub stars and 17 contributors matter because this is already beyond a paper-only demo. I like this much more than the multi-agent translation stack narrative, where roles multiply but the PDF still breaks on formulas, tables, and line wraps. My caution is the benchmark: 200 pages is useful, but the article does not break down coverage for two-column papers, scanned PDFs, dense tables, vertical text, or ugly corporate templates.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:50

28d ago

HuggingFace Papers (takara mirror)· rssEN16:50 · 05·11

→Transcoda end-to-end zero-shot optical music recognition system released

Transcoda trains a 59M-parameter OMR model with synthetic data, **kern normalization, and grammar-based decoding in 6 hours on one GPU, reaching 18.46% OMR-NED on a synthetic score benchmark versus 43.91% for Legato and 63.97% on historical Polish scans versus 80.16% for SMT++.

#Vision#Benchmarking#Transcoda#Legato

why featured

HKR-K is strong: model size, training condition, and benchmark numbers are concrete. HKR-H/R are weak because OMR is too vertical for the broader AI-practitioner feed; no hard exclusion applies.

editor take

Transcoda trains a 59M model in 6 GPU-hours and hits 18.46% OMR-NED; clean the label space before scaling models.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

16:46

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:46 · 05·11

→The First Drop of Ink: Nonlinear Impact of Misleading Information in Long-Context Reasoning

The paper varies the hard-distractor ratio in fixed-length contexts and finds performance drops sharply within the first small fraction, while later increases add only marginal decline; its controlled experiments attribute filtering gains mainly to context-length reduction rather than distractor removal.

#RAG#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the paper has a counterintuitive hook, a concrete fixed-context distractor-ratio setup, and a direct RAG-filtering implication. Missing model, dataset, and drop-size details keep it at 78.

editor take

RAG’s poison is not lots of noise; it is the first hard distractors. Bigger context windows just give bad evidence more room to win.

sharp

This paper moves the RAG failure story from “the window is too small” back to “retrieval is dirty.” The authors hold context length fixed, vary the hard-distractor ratio, and see a steep drop in the first small fraction; later distractors add only marginal damage. Their controlled tests also attribute filtering gains mainly to shorter context, not distractor removal itself. That is an annoying result for a lot of reranker and compression stacks. Many products sell “noise removal,” but the measured win may be token-budget reduction. The same warning applies to agent memory: a few semantically close wrong facts in the scratchpad can capture attention before the model ever reasons. The abstract does not disclose model names, tasks, or exact deltas, so I’d treat this as a sharp hypothesis, not a universal law.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:34

28d ago

HuggingFace Papers (takara mirror)· rssEN16:34 · 05·11

→Policy Gradient Methods for Non-Markovian Reinforcement Learning

The paper proposes ASMPG for non-Markovian decision processes, jointly optimizing agent state dynamics and control policy, and establishes finite-time plus almost-sure convergence guarantees under episodic and infinite-horizon discounted settings.

#Agent#Reasoning#Research release

why featured

Hard-exclusion-technical-accessibility applies: non-Markovian policy gradients and convergence proofs need specialist context, and the post gives no agent/product implication. HKR-K passes, while HKR-H/R fail, so the score stays below 40.

editor take

ASMPG jointly optimizes agent state and policy; with 39 pages, 5 figures, 1 table, reproduction matters more than the theorem.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

16:26

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:26 · 05·11

→Muown: Row-Norm Control Method for Muon Optimization

Muown treats the row-magnitude vector as an explicit optimizer variable and improves perplexity over Muon, SOAP, AdamW, and Lion in GPT-style pre-training on FineWeb-Edu from 124M to 2.7B parameters, while reducing weight-decay sensitivity and avoiding spectral-norm drift with negligible sharded step-time overhead.

#Fine-tuning#Inference-opt#Muon#Muown

why featured

HKR-K/R pass: Muown has a concrete mechanism and 124M–2.7B pretraining comparisons, with cost relevance. HKR-H is weak, and the optimizer angle is niche, so it stays in 60–71.

editor take

Muown pins Muon’s pain on row-norm drift; beating AdamW up to 2.7B matters, but an arXiv abstract is not a training-stack verdict.

sharp

Both sources trace to arXiv 2605.10797 with the same framing, so this is paper propagation, not independent validation. Muown makes a clean claim: Muon’s instability comes from the row-magnitude factor driving spectral-norm drift, while row coherence stays controlled. The concrete hook is GPT-style pretraining on FineWeb-Edu from 124M to 2.7B parameters, where it claims lower perplexity than Muon, SOAP, AdamW, and Lion, plus a wider near-optimal learning-rate plateau. I buy the diagnosis more than the replacement narrative. Optimizer papers often look strong at small-to-mid scale under one data recipe; without 7B/13B runs, long-token budgets, released code, and hard step-time measurements, production training teams will not take on new optimizer state and sharding risk just because AdamW lost in an abstract.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

16:20

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:20 · 05·11

→ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

ComplexMCP uses MCP to build 7 stateful sandboxes with more than 300 tested tools, and its evaluation finds top-tier LLM agents stay below a 60% success rate, compared with 90% human performance.

#Agent#RAG#Tools#ComplexMCP

why featured

HKR-H/K/R all pass: ComplexMCP tests agents in stateful MCP tool sandboxes and reports top agents below 60% versus humans at 90%. This is a strong practical benchmark, but still a single research release, so featured not P1.

editor take

ComplexMCP nails the agent gap inside 300+ tools and stateful sandboxes: under 60% success means single-API demos are still theater.

sharp

ComplexMCP is a useful slap because it moves agents from “can call tools” to “can finish workflows.” The setup has 7 stateful sandboxes, 300+ tested tools, seed-driven environment states, and injected API failures. Top-tier LLM agents still stay below 60% success, while humans hit 90%. That gap does not smell like a prompt fix; it smells like a systems capability gap: retrieval saturation as tools scale, skipped environment checks, and failure rationalization after bad steps. LiveMCPBench already pushed MCP evaluation toward 70 servers and 527 tools, but ComplexMCP adds state dependence and failure noise. That is closer to enterprise software than the usual clean API maze. Agent vendors should hate this benchmark because it tests whether the workflow survives after the happy path breaks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:14

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:14 · 05·11

→LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

LITMUS benchmarks behavioral jailbreaks of LLM agents in real OS environments with 819 high-risk test cases, semantic-physical dual verification, and OS-level rollback; Claude Sonnet 4.6 still executes 40.64% of dangerous operations, while execution hallucination shows agents can verbally refuse after completing harmful system actions.

#Agent#Safety#Benchmarking#LITMUS

why featured

HKR-H/K/R all pass: real-OS jailbreaks add a strong hook, 819 cases and 40.64% provide concrete signal, and agent execution risk is practitioner-relevant. It fits 78–84 because this is a benchmark paper, not a major model launch or cross-source event.

editor take

Claude Sonnet 4.6 executed 40.64% of risky OS actions in LITMUS; agent safety still looks built for chat, not system access.

sharp

Agent safety evaluation is finally hitting the OS layer, and chat refusal is exposed as a weak control. LITMUS uses 819 high-risk cases, semantic-physical verification, and OS-level rollback in real environments; Claude Sonnet 4.6 still executed 40.64% of dangerous operations. The nasty part is Execution Hallucination: the model refuses in text after the harmful system action has already happened. That undercuts the product story that more tools plus better policy prompts can contain agents. Most jailbreak benchmarks scored text; LITMUS scores files, commands, and system state. Its attacks include jailbreak speaking, skill injection, and entity wrapping. If an agent stack lacks permission sandboxing, transactional rollback, and pre-action review, model-level refusal is theater.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:59

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN15:59 · 05·11

→Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

The paper proposes UJEM-KL, an untargeted VLM jailbreak that maximizes entropy at high-entropy decision tokens and stabilizes low-entropy positions, reporting improved cross-model transferability across 3 VLMs and 2 safety benchmarks while preserving output quality under representative defenses.

#Vision#Safety#Alignment#Research release

why featured

HKR-H/K/R all pass: the entropy-maximization attack is a concrete hook, with UJEM-KL tested on 3 VLMs and 2 safety benchmarks. Strong safety-research signal, but still a paper-level result rather than a must-write product or model release.

editor take

This moves VLM jailbreaks from prompt trickery to refusal-token control; defenses that key on phrases or fixed prefixes are lagging.

sharp

UJEM-KL’s sharp move is attacking the refusal branch, not forcing a canned bad answer. The paper says refusal concentrates at high-entropy autoregressive decision tokens, while non-refusal tokens already hold meaningful probability among top candidates. It then maximizes entropy at those positions and uses KL to stabilize low-entropy positions. That smells closer to a real VLM attack surface than universal images trained to elicit a fixed prefix. The evidence is concrete enough: 3 VLMs, 2 safety benchmarks, white-box success, transfer gains, and representative defenses. The article does not disclose model names or ASR numbers, so I would not oversell it. But it does push back on the recent claim that gradient-based universal image jailbreaks barely transfer: the failure was tied to over-constrained objectives, not necessarily to multimodal jailbreak transfer itself.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:34

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN15:34 · 05·11

→Qwen-Image-2.0 Technical Report Released

Qwen-Image-2.0 uses Qwen3-VL as the condition encoder and a Multimodal Diffusion Transformer, supporting up to 1K-token instructions for generating and editing text-rich images.

#Multimodal#Vision#Qwen#Research release

why featured

HKR-H/K/R all pass, and Qwen is a domestic flagship model line. Missing benchmarks, license, and availability keep it in the 78–84 band rather than P1.

editor take

Qwen-Image-2.0’s 1K-token prompt support is a bid for production graphics, not prettier demos.

sharp

Qwen-Image-2.0 is about long instructions and typography, not another photorealism lap. It uses Qwen3-VL as the condition encoder with a Multimodal Diffusion Transformer, then claims up to 1K-token prompts for slides, posters, infographics, and comics. That design choice is telling: put a strong vision-language model in front, then make generation follow document-like semantics. I’m discounting the “substantially outperforms” line. The snippet says human evaluations, but gives no sample size, rival list, sub-scores, or inference cost. Compared with Midjourney’s visual polish and Ideogram’s text-rendering reputation, Qwen has a real opening in multilingual typography. Without a public typography benchmark, 1K-token support is an interface claim, not proof of production reliability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:33

28d ago

HuggingFace Papers (takara mirror)· rssEN15:33 · 05·11

→Kernel-Gradient Drifting Models Enable One-Step Generation Without Distillation

The paper proposes kernel-gradient drifting, replacing fixed Euclidean displacement with kernel-induced directions, and reports one-step generation without distillation across three settings: spherical geospatial data, promoter DNA, and molecule generation.

#Inference-opt#Research release

why featured

Triggers hard-exclusion-technical-accessibility: kernel-gradient drifting and kernel-induced directions require deep math background, with no engineering on-ramp. HKR-K is present, but the hard cap keeps it excluded.

editor take

Kernel-Gradient Drifting Models claim one-step generation without distillation; I buy the geometry, but 3 task types don’t dethrone diffusion.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

14:36

28d ago

HuggingFace Papers (takara mirror)· rssEN14:36 · 05·11

→Paper proposes recursive decomposition framework for causal structure learning with latent variables

The paper proposes DiCoLa, a recursive decomposition framework that splits causal discovery with latent variables into smaller subproblems and reconstructs the global structure; the post states soundness and completeness proofs plus synthetic and real-world experiments, but does not disclose dataset sizes or speedup numbers.

#Reasoning#Benchmarking#DiCoLa#Research release

why featured

Triggers hard-exclusion-technical-accessibility: latent-variable causal structure learning is specialized, with no experiment scale, speedup, product, or agent implication. HKR-K passes, but the cap keeps it below 40.

editor take

DiCoLa decomposes latent-variable causal discovery recursively; proofs are claimed, but no speedup numbers are disclosed here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:12

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN14:12 · 05·11

→MulTaBench Multimodal Tabular Learning Benchmark with Text and Image Released

MulTaBench introduces 40 datasets, split evenly between image-tabular and text-tabular tasks, to benchmark target-aware representation tuning for multimodal tabular learning across modalities, tabular learners, encoder scales, and embedding dimensions.

#Multimodal#Vision#Benchmarking#MulTaBench

why featured

HKR-K passes because the paper offers a concrete 40-dataset benchmark; HKR-H and HKR-R are weak because the angle is academic and narrow to multimodal tabular learning.

editor take

MulTaBench lands a clean punch: tabular FMs cannot keep treating text and images as frozen side features in real business tables.

sharp

Two sources carry the same MulTaBench title, and both route back to arXiv 2605.10616. Treat this as a paper-distribution signal, not independent market validation. The concrete contribution is 40 datasets, split 20 image-tabular and 20 text-tabular, with tasks selected for complementary modality signal. I buy the problem framing. Tabular FMs have spent the last year leaning on frozen text or image embeddings, then handing the result to TabPFN-style models, TabICL-style models, or boosted trees. MulTaBench’s hook is target-aware representation tuning: generic CLIP/BERT-like embeddings drop label-relevant detail in healthcare and e-commerce settings. Don’t read this as another leaderboard grab. It is a curated stress test for where multimodal tabular foundation models actually break.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:11

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN14:11 · 05·11

→PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

PRISM combines 16 lexical, structural, information-theoretic, behavioural, and contextual signals at each decoding step to detect credential leakage, achieving F1 0.832, precision 1.000, recall 0.712, and a 0.0% task-level leak rate on a 2,000-task adversarial benchmark.

#Agent#Safety#Inference-opt#PRISM

why featured

HKR-H/K/R all pass: PRISM targets secret leakage in agent pipelines with a generation-time mechanism and 2,000-task results. It is a strong research item, not a major lab product release, so 80 fits the recommendation band.

editor take

PRISM moves secret-leak defense into decoding, which is the right layer for agents; recall 0.712 says it is a tripwire, not a vault.

sharp

PRISM picks the right layer: multi-agent secret leakage is not an output-review problem. It is risk accumulation during generation after shared context keeps exposing the same credential. PRISM combines 16 signals at each decoding step and reports F1 0.832, precision 1.000, recall 0.712, and 0.0% task-level leakage across 2,000 adversarial tasks, 13 attack categories, and a four-agent pipeline. I buy the direction, not the comfort blanket. Entropy collapse and logit concentration are closer to the leak point than regex filters or LLM-as-judge cleanup. Span Tagger’s 15.0% task-level leak rate also makes the baseline look stale. But recall 0.712 still misses almost three in ten detection events. The 0.0% leak number is a benchmark result under stated conditions, not a production guarantee for messy multi-tenant agent stacks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

13:59

28d ago

HuggingFace Papers (takara mirror)· rssEN13:59 · 05·11

→CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations

CausalGS learns causal dynamics of complex 3D scenes solely from multi-view videos, jointly inferring the initial velocity field and intrinsic material properties. The framework uses a differentiable physics simulator for physics-regularized training and reports state-of-the-art long-term future frame extrapolation, while the snippet does not disclose dataset names or numeric scores.

#Vision#Reasoning#Benchmarking#Research release

why featured

HKR-H/K pass: it learns 3D physical causality from multiview video via velocity fields, material properties, and differentiable physics. No product path, open-source artifact, or benchmark numbers, so it stays in 60-71.

editor take

CausalGS infers velocity and material from multi-view video; datasets and scores are undisclosed, so treat SOTA as paper-claim only.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

13:19

28d ago

HuggingFace Papers (takara mirror)· rssEN13:19 · 05·11

→ConfoundingSHAP: Quantifying Confounding Strength in Causal Inference

The paper introduces ConfoundingSHAP, a Shapley-based method that assigns confounding strength to individual covariates in observational causal inference, and uses TabPFN-based estimation to evaluate many adjustment sets without exhaustive refitting.

#Interpretability#ConfoundingSHAP#TabPFN#Research release

why featured

Triggers hard-exclusion-1: causal inference, Shapley confounding strength, and TabPFN adjustment sets need deep specialty context. HKR-K is present via a new mechanism, but no product or agent angle keeps it below 40.

editor take

ConfoundingSHAP assigns confounding strength to covariates; TabPFN avoids exhaustive refits, but benchmark details aren’t disclosed here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

12:41

28d ago

HuggingFace Papers (takara mirror)· rssEN12:41 · 05·11

→ASIA: an Autonomous System Identification Agent

ASIA delegates model-class selection, training-algorithm choice, and hyperparameter tuning for system identification to an LLM coding agent, then evaluates it on 2 benchmarks; the paper flags implicit test leakage, reduced methodological transparency, and reproducibility concerns as current limitations.

#Agent#Code#Benchmarking#ASIA

why featured

HKR-K/R pass: the paper gives a concrete agent workflow, 2 benchmarks, and eval-risk details. Its system-identification focus raises the technical-accessibility bar, keeping it in the 60–71 research-signal band.

editor take

ASIA reports only 2 benchmarks and admits test leakage; handing system ID to an agent is not an automated-science win yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

11:52

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN11:52 · 05·11

→Remember to Forget: Gated Adaptive Positional Encoding paper published

The paper proposes GAPE, a drop-in positional-encoding augmentation that adds content-aware bias to attention logits, using a query-dependent gate to contract irrelevant context and a key-dependent gate to preserve salient distant tokens.

#Reasoning#Inference-opt#Benchmarking#GAPE

why featured

HKR-H and HKR-K pass: GAPE has a clear mechanism for forgetting irrelevant context while preserving distant key tokens. HKR-R is weak: no benchmark scores, model scale, or code are disclosed, so it stays all-tier.

editor take

GAPE adds content gates to RoPE; if the win stays in synthetic retrieval, model labs won’t touch their attention kernels.

sharp

Two sources picked up GAPE, but the chain is thin: arXiv v1 plus Hugging Face Papers, both pointing to the same paper. The mechanism is concrete: add a query gate and a key gate to RoPE, suppress irrelevant distant tokens in attention logits, and keep salient far tokens reachable inside standard scaled dot-product attention. I’m skeptical until the deployment numbers show up. RoPE extrapolation failure is real; anyone testing long-context retrieval has seen distant junk pull attention. But the abstract gives no benchmark scores, context lengths, or training-cost delta. Without those, GAPE reads like a clean attention-bias idea, not a change Qwen, Llama, or Claude teams would immediately put into a production training recipe.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

11:38

28d ago

HuggingFace Papers (takara mirror)· rssEN11:38 · 05·11

→Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

The paper introduces Sens-VisualNews, a benchmark with 9,576 news images annotated for sensational visual concepts and events, and evaluates open multimodal LLMs on prompt sensitivity, performance, and robustness under zero-shot and fine-tuned settings.

#Multimodal#Vision#Benchmarking#Sens-VisualNews

why featured

HKR-H and HKR-K pass: the angle is fresh, with 9,576 images and zero-shot/fine-tuned tests. It remains a niche multimodal benchmark with no disclosed mainstream model or product impact, so it stays in the 60–71 band.

editor take

Sens-VisualNews ships 9,576 news images; useful benchmark idea, but the snippet gives no annotation boundary for “sensational.”

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

11:36

28d ago

HuggingFace Papers (takara mirror)· rssEN11:36 · 05·11

→Phoenix-VL 1.5 Medium Technical Report

Phoenix-VL 1.5 Medium adapts Mistral Medium 3.1 into a 123B-parameter native multimodal and multilingual model for Singapore. Training uses a 1T-token localized multimodal corpus, 250B tokens for long-context extension, 22B post-training tokens, and 5B tokens for Online Direct Preference Optimization alignment.

#Multimodal#Alignment#Benchmarking#Mistral AI

why featured

HKR-K passes because the post gives concrete Phoenix-VL 1.5 Medium training-data sizes and ODPO alignment data. HKR-H is weak and HKR-R lacks open-source, pricing, or production-impact hooks, so this fits a normal research-release band.

editor take

Phoenix-VL 1.5 uses 123B params and 1T local multimodal tokens for sovereign AI; I buy the data bet, not the unscored “minimal degradation” claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

11:28

28d ago

HuggingFace Papers (takara mirror)· rssEN11:28 · 05·11

→GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

GuardAD models autonomous-driving safety as an evolving Markovian logical state and revises actions without modifying the underlying MLLM; across multiple benchmarks and AD-MLLMs, it reduces accident rates by 32.07% and improves task performance by 6.85%.

#Multimodal#Safety#Robotics#GuardAD

why featured

HKR-H/K/R pass via a concrete safety hook, Markovian mechanism, and AV liability angle. The work is still a niche research paper, not a general model or product release, so it stays in the 60–71 band.

editor take

GuardAD cuts accidents 32.07%, but benchmarks and vehicle-test details are undisclosed; don't call it an AD safety gate yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:42

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN10:42 · 05·11

→An Annotation Scheme and Classifier for Personal Facts in Dialogue

The authors annotated 2,779 personal facts from Multi-Session Chat and trained a transformer-encoder multi-head classifier; with the Gemma-300M encoder, it reaches 81.6±2.6% macro F1, above the best few-shot LLM baseline, GPT-5.4-mini at 72.92%, while the dataset and classifier are publicly available.

#Memory#Benchmarking#Gemma#GPT-5.4-mini

why featured

HKR-H comes from the small-model-over-LLM result; HKR-K has dataset size and F1, and HKR-R touches agent memory/privacy. It stays near the featured floor because this is one paper on one dataset, not a broad product shift.

editor take

Gemma-300M beating GPT-5.4-mini few-shot is less small-model hype and more proof that memory needs extractors, not chatty generalists.

sharp

Personal memory still punishes general LLM prompting. The authors label only 2,779 personal facts from Multi-Session Chat, yet a Gemma-300M encoder reaches 81.6±2.6% macro F1. GPT-5.4-mini few-shot tops out at 72.92%. That stings because many assistants still treat “memory” as long context plus a summary prompt. This paper goes the opposite way: Demographics, Possessions, Duration, Validity, and Followup turn user facts into fields you can store, expire, and use for continuation. I don’t read this as a solved memory stack. The error analysis still flags semantic boundaries, temporal interpretation, and pragmatic followup judgment. But as a cheap gate before anything enters memory, a 300M encoder is exactly the kind of boring component production systems need.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:35

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN10:35 · 05·11

→The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

The paper proposes the Alpha Blending Hypothesis, arguing that frame-based deepfake detectors mainly localize low-level compositing artifacts; BlenD trains on real facial images augmented with SBI, reaches the best average cross-dataset generalization on 15 compositional deepfake datasets from 2019 to 2025, and an ensemble achieves 94.0% AUROC.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass: the shortcut angle is clickable, the paper gives 15 datasets and 94.0% AUROC, and deepfake reliability hits a safety nerve. Single-paper scope keeps it in low featured.

editor take

This paper punctures deepfake-detector mystique: 94.0% AUROC is strong, but it also says detectors are still hunting blend seams.

sharp

The sharp claim here is that frame-level deepfake detectors are not learning much “fake semantics.” They are mostly finding alpha-blending residue. BlenD trains on real face images plus SBI, with no explicit generated deepfakes, yet gets the best average cross-dataset generalization across 15 compositional deepfake datasets from 2019 to 2025. The ensemble reaches 94.0% AUROC. I buy the diagnosis because it explains a familiar failure mode: detectors travel poorly across generators because they are betting on shared post-processing scars, not a Sora, diffusion, or GAN fingerprint. The catch is obvious. If full-frame generation or end-to-end video models remove the face-paste compositing step, this edge shrinks. Compared with ScaleDF-style scaling, with 5.8M real and 8.8M fake images, BlenD is less brute force and more shortcut forensics.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:17

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN10:17 · 05·11

→Active Tabular Augmentation via Policy-Guided Diffusion Inpainting

TAP combines diffusion inpainting with a learner-conditioned policy for tabular augmentation, using explicit gating and conservative windowed commitment; under severe data scarcity, it outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.

#Fine-tuning#Benchmarking#TAP#Research release

why featured

HKR-H/K pass: the paper states a concrete method and results on 7 tabular datasets, up to +15.6 pp accuracy and 32% lower RMSE. HKR-R is weak because the topic is narrow tabular ML research, so it stays in 60–71.

editor take

TAP attacks the right failure mode: plausible tabular samples that hurt the learner. Seven datasets is a start, not a victory lap.

sharp

Both sources point to the same arXiv/ICML 2026 paper, so the coverage is aligned through one paper-distribution chain, not independent confirmation. TAP’s useful move is coupling diffusion inpainting with a learner-conditioned policy, then using gating and windowed commitment to decide when synthetic rows enter training. The target is held-out loss, not distributional prettiness. I buy the problem framing more than the implied durability. The abstract reports seven real-world datasets, up to +15.6 classification points, and up to 32% lower regression RMSE. It does not show failure cases, compute budget, or baseline tuning strength in the provided body. Tabular augmentation has burned people before with CTGAN/TVAE-style wins that vanish under different splits. If TAP ships code, splits, and training scripts, it becomes a serious production candidate; without that, it is a strong ICML claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

09:54

28d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN09:54 · 05·11

→Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

The paper integrates probabilistic shielding with safe policy improvement for offline reinforcement learning, using only a fixed dataset plus safe and unsafe state labels, and guarantees a safe improved policy with high probability while reporting better average and worst-case performance in low-data regimes.

#Reasoning#Safety#Research release#Safety/alignment

why featured

HKR-K/R pass: the paper gives a concrete offline-RL safety mechanism and a high-probability guarantee. HKR-H is weak, and the specialist RL framing lacks a product hook or reproducible numbers, so it stays in the 60-71 band.

editor take

Two sources trace to one arXiv paper; SPI plus shielding is clean, but don’t call it a safe-RL win without task scale and violation curves.

sharp

Hugging Face Papers and arXiv both point to 2605.10293, with aligned wording from the paper abstract rather than independent confirmation. The paper combines safe policy improvement with shielding for offline RL: it uses only a fixed dataset plus known safe and unsafe states, then constrains policy-improvement actions to give a high-probability safety guarantee. I like the direction because it avoids hiding safety inside expected-cost constraints. Compared with FISOR-style work chasing hard safety in offline RL, this is closer to an engineering guardrail: block unsafe actions before optimizing reward. The catch is concrete: the body only claims better average and worst-case performance in low-data regimes, without benchmark names, violation-rate curves, or a measure of shield conservatism. Without those, “safe” can still be a neat theorem on small controlled tasks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

09:35

28d ago

HuggingFace Papers (takara mirror)· rssEN09:35 · 05·11

→Generalization Error Bounds for Picard-Type Operator Learning in Nonlinear Parabolic PDEs

The paper derives implementation-agnostic generalization error bounds for Picard-type operator learning in nonlinear parabolic PDEs, separating implementation error from estimation error, and shows that increasing Picard depth reduces truncation error without unbounded growth in entropy-based estimation error.

#Reasoning#Benchmarking#Research release

why featured

Hard-exclusion-technical-accessibility applies: nonlinear parabolic PDE bounds for Picard-type operator learning require numerical-analysis context and offer no general AI engineering on-ramp. HKR-K passes, but HKR-H/R fail.

editor take

Taniguchi and Sonoda give 39 pages of bounds for Picard operator learning; don’t overread it, no code or benchmarks disclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:29

29d ago

HuggingFace Papers (takara mirror)· rssEN08:29 · 05·11

→Joint sparse coding and temporal dynamics support context reconfiguration

The paper identifies joint sparse coding and temporal dynamics in mouse mPFC and computational networks, where sparsity reduces cross-context interference and temporal dynamics improve separability over time, with spiking neural networks showing better lifelong-learning retention without auxiliary heuristics.

#Memory#Fine-tuning#Robotics#Research release

why featured

Triggers hard-exclusion technical-accessibility and science-crossover rules: mouse mPFC, sparse coding, and spiking networks are specialist-heavy, with no product, agent, or reproducible practitioner path; HKR-K is present, but capped below 40.

editor take

2605.10178 ties sparse coding plus temporal dynamics to lifelong learning; 37-page preprint, no code disclosed, don’t port it to Transformers yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:28

29d ago

HuggingFace Papers (takara mirror)· rssEN08:28 · 05·11

→MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

MTA-RL fuses RGB images and LiDAR with a transformer to predict 3D affordances, then feeds those semantics to an RL policy in CARLA Town01-03 with 20-60 background vehicles. Trained only on Town03, it reports up to 9.0% higher Route Completion, 11.0% higher Total Distance, and 83.7% higher Distance Per Violation.

#Multimodal#Vision#Robotics#MTA-RL

why featured

HKR-K passes via reproducible CARLA settings and two reported gains. HKR-H and HKR-R are weak because this is a technical autonomous-driving simulation paper, so it sits in the 60–71 band.

editor take

MTA-RL reports 83.7% higher DPV in CARLA; Town01-03 is too narrow for the robust-driving claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:28

29d ago

HuggingFace Papers (takara mirror)· rssEN08:28 · 05·11

→When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in LLM-Driven Applications

The paper proposes a three-layer defense framework for LLM-mediated SQL injection, covering prompt sanitization, behavioral and semantic anomaly detection, and signature-based controls; the post says evaluation used prompt injection, obfuscated SQL payloads, and context-manipulation attacks, but does not disclose accuracy, false-positive rate, or dataset size.

#Safety#Benchmarking#Fine-tuning#Research release

why featured

HKR-H/K/R pass: the LLM-SQL injection framing is clickable, and the three-layer defense is concrete. Evidence is thin: no accuracy, false-positive rate, or dataset size, so it stays in 60–71.

editor take

The paper gives a three-layer LLM-SQL defense, but no accuracy, FPR, or dataset size; treat “high detection” as a claim.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

08:12

29d ago

HuggingFace Papers (takara mirror)· rssEN08:12 · 05·11

→Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images

Active-SAOOD selects instance-level sparse samples with a model-state observation module, using orientation, classification, localization uncertainty, and class diversity; at a 1% annotation ratio, it improves performance by 9% over the baseline, and the code will be public.

#Vision#Research release#Open source

why featured

HKR-K passes via the 1% annotation setting, 9% baseline gain, and planned code release. HKR-H and HKR-R fail because this is a narrow remote-sensing vision paper with little practitioner pull.

editor take

Active-SAOOD gains 9% at 1% annotation; for remote-sensing OOD, seed stability matters, and the snippet omits it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

07:36

29d ago

HuggingFace Papers (takara mirror)· rssEN07:36 · 05·11

→Explainability of Recurrent Neural Networks for P300 Brain-Computer Interfaces

The paper introduces a Post-Recurrent Module inside an RNN for classifying P300 signals from EEG data, reports a 9% performance gain over the state of the art, and uses global and local explainability methods to identify relevant brain regions and critical time intervals.

#Interpretability#Research release

why featured

Hard-exclusion technical-accessibility applies: P300 BCI and EEG explainability are too specialized, with no product, agent, or engineering adoption angle. HKR-K passes via the 9% gain and module detail, but H/R fail.

editor take

PRM lifts P300-RNN performance by 9%, but dataset scale isn’t disclosed; BCI deployment claims need cross-subject replication first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

06:43

29d ago

HuggingFace Papers (takara mirror)· rssEN06:43 · 05·11

→NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

NCO performs online pattern matching for finite hard constraints and regex constraints during decoding, avoiding the state explosion of a single automaton, and remains compatible with sampling methods, beam search, and soft masking for PII and profanity suppression.

#Safety#Inference-opt#NCO#Research release

why featured

HKR-K passes: NCO’s online constraint-matching mechanism is useful for controlled generation and inference work. HKR-H/R are weak; the title is academic and the impact looks narrow, so it stays in all.

editor take

NCO matches banned strings and regex online; no overhead numbers disclosed, so don’t retire mature guardrails yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

06:39

29d ago

HuggingFace Papers (takara mirror)· rssEN06:39 · 05·11

→MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

MAGE externalizes self-evolving agent knowledge into a four-subgraph co-evolutionary knowledge graph and evaluates a frozen execution model on 9 benchmarks; the graph, a task-level search bandit, and a skill-level routing bandit update from the same reward stream while the learner backbone stays unchanged.

#Agent#Reasoning#Memory#MAGE

why featured

HKR-H/K pass: the agent self-evolution mechanism and 9-benchmark setup are concrete. No code, result numbers, or production validation is disclosed, so this stays in the 60–71 research-release band.

editor take

MAGE uses a 4-subgraph memory across 9 benchmarks; frozen-backbone gains make agent self-evolution look more like retrieval infrastructure.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

06:12

29d ago

HuggingFace Papers (takara mirror)· rssEN06:12 · 05·11

→Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

The paper proposes C-BPO, which treats target-user data as positive feedback and other users’ data as implicit negative feedback, then uses PU learning to subtract positive bias; the post does not disclose task counts, backbone model names, or exact metric gains.

#Fine-tuning#Alignment#Research release

why featured

HKR-K passes because the post states a concrete C-BPO mechanism. HKR-H and HKR-R fail: no surprising result, no metrics, and no broad practitioner nerve beyond a narrow personalization-tuning audience.

editor take

C-BPO uses target data as positives and others as implicit negatives; no task count or gains disclosed, so treat as a neat objective.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

05:06

29d ago

HuggingFace Papers (takara mirror)· rssEN05:06 · 05·11

→StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

StereoPolicy improves robotic manipulation policies with synchronized stereo image pairs, using pretrained 2D vision encoders and a Stereo Transformer, and outperforms RGB, RGB-D, point-cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson.

#Robotics#Vision#Reasoning#Research release

why featured

HKR-H/K pass: stereo perception and three simulation benchmarks add signal. HKR-R is weak, and the post does not disclose real-robot results, effect size, or code, so it stays in the mid all band.

editor take

StereoPolicy beats baselines on 3 sim benchmarks plus real robots; I buy stereo, but no gains disclosed, so don’t dunk on RGB-D yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·11

→Paper introduces normalizing flow models for trajectory modeling

The paper introduces Normalizing Trajectory Models, which model each reverse step as a conditional normalizing flow, train with exact likelihood, and match or outperform strong text-to-image baselines in four sampling steps while retaining exact likelihood over the generative trajectory.

#Multimodal#Inference-opt#Research release#Benchmark

why featured

HKR-H/K/R pass, but this is a single arXiv paper with only 4-step sampling, strong-baseline comparison, and exact likelihood disclosed; model scale, datasets, and code status are not given, so it stays near the featured threshold.

editor take

Both hits are the same arXiv chain; NTM’s sharp claim is four-step generation with exact likelihood. Don’t bury diffusion until code and compute-matched runs land.

sharp

Two listed sources point to the same arXiv cs.LG record, with identical framing, so this is paper-surfacing signal rather than independent validation. NTM models each reverse step as a conditional normalizing flow, claims four-step text-to-image sampling against strong baselines, and keeps exact likelihood over the trajectory. I buy the target: few-step diffusion usually gets speed from distillation, consistency training, or adversarial losses, then loses the clean likelihood story. NTM is trying to tie fast sampling back to a tractable density objective, which matters for diagnosable generators. The abstract gives no FID, CLIP, latency, VRAM, or training-compute numbers; the fair fight is against Latent Consistency Models and Rectified Flow under matched budgets.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Scaling Categorical Flow Maps for Text Generation

The authors train a 1.7B-parameter base flow model on 2.1T tokens and self-distill it into a CFM that generates diverse text in as few as 4 inference steps.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass: the paper has a concrete 4-step generation hook, specific scale numbers, and a clear inference-cost angle. It stays below must-write level because it is an arXiv research item at 1.7B scale, with no product adoption shown.

editor take

Both entries are the same arXiv paper: 1.7B, 2.1T tokens, 4-step generation. CFM just left toy-scale territory for language modeling.

sharp

Both items point to the same arXiv record, so this is not independent corroboration; the shared numbers are still concrete: 1.7B parameters, 2.1T tokens, and 4 inference steps. My read: Categorical Flow Maps have earned a seat in the LM-scaling conversation, not a license to replace autoregressive decoding. The hook is not “continuous generation for discrete text.” It is the self-distilled CFM keeping near-data-level token entropy after few-step sampling. That matters because low-step generators often collapse into fluent low-entropy mush. The abstract only says benchmark scores land in the same range as discrete diffusion methods, with no head-to-head against production models like GPT-5 or Claude Sonnet 4.5. That gap keeps this in research-signal territory.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Louver reformulates sparse attention as range searching for efficient KV cache indexing

The paper introduces Louver, a KV cache index that reformulates sparse attention as halfspace range searching and guarantees zero false negatives under a specified threshold; experiments report better accuracy and runtime than prior sparse attention methods, and faster execution than optimized dense attention such as FlashAttention.

#Inference-opt#Louver#FlashAttention#Research release

why featured

HKR-H/K/R all pass: the concrete mechanism is halfspace range search with zero false negatives and a faster-than-FlashAttention claim. As a single arXiv systems paper needing reproduction, it stays at the featured threshold, below 78.

editor take

Louver reframes sparse attention as indexing, and zero false negatives is the right target. “Faster than FlashAttention” needs code and model details first.

sharp

Both event members point to the same arXiv 2605.06763 paper, so the coverage is aligned but not independently corroborated. Louver’s serious hook is the halfspace range-searching formulation and a zero-false-negative guarantee for KV retrieval above a chosen threshold. I like the direction more than the abstract’s victory lap. The paper’s claim that missing even one critical key causes sharp error spikes in long reasoning matches what sparse attention keeps running into. But “faster than highly optimized dense attentions such as FlashAttention” needs the full setup: model size, context length, batch shape, GPU kernels, and index update cost. Sparse attention has lost many engineering fights because the saved attention FLOPs come back as retrieval, maintenance, or memory-movement overhead.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→SpikingBrain: Spiking Brain-inspired Large Models

SpikingBrain introduces a 7B linear LLM and a 76B hybrid-linear MoE trained for weeks on hundreds of MetaX GPUs, using about 150B tokens for continual pre-training; the 7B model reports over 100x faster time to first token on 4M-token sequences, and its spiking scheme reaches 69.15% sparsity.

#Inference-opt#Reasoning#MetaX#SpikingBrain

why featured

HKR-H/K/R all pass: the paper gives testable claims on 4M-token TTFT, sparsity, and 76B MoE scale. It stays in the 78–84 band because it is still an arXiv research release without independent replication or product adoption.

editor take

SpikingBrain makes non-NVIDIA training the headline; 100x TTFT is loud, but a 4M-token setup is a narrow arena, not a general win yet.

sharp

SpikingBrain’s strongest claim is not the brain-inspired branding; it is that MetaX hardware trained a 7B linear LLM and a 76B hybrid-linear MoE. The paper says training stayed stable for weeks on hundreds of MetaX GPUs, with about 150B tokens of continual pretraining. It also reports over 100x faster TTFT on 4M-token sequences and 69.15% sparsity from the spiking scheme. I would discount the 100x number until the eval setting is unpacked. Linear attention always looks best when the prompt is absurdly long, and 4M tokens is not a normal serving path for most teams. Mamba and RWKV already taught the field that long-context efficiency is the easy pitch; quality, kernels, batching, and deployment ergonomics decide adoption. The MetaX stability claim carries more weight than the “spiking brain” wrapper.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→DurableUn: Quantization-Induced Recovery Attacks in Machine Unlearning

DurableUn evaluates seven machine-unlearning methods on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU, finding that INT4 quantization restores forgotten content by up to 22x even when models pass BF16 compliance audits.

#Fine-tuning#Safety#Benchmarking#LLaMA

why featured

HKR-H/K/R all pass: INT4 quantization weakens machine unlearning, backed by 7 methods and a 22x recovery claim. This is practical safety research, not a same-day must-write model release.

editor take

INT4 turns “unlearned” into a deploy-time illusion; a 22x recovery rate makes BF16 compliance audits look dangerously staged.

sharp

Unlearning that passes only at BF16 breaks once the deployed model is INT4. DurableUn tests seven methods on LLaMA-3-8B-Instruct across TOFU, MUSE-News, and WikiBio-WPU; INT8 stays benign, while INT4 restores forgotten content by up to 22x. That is not benchmark dust. The uncomfortable part is the setting: NF4+LoRA is close to how teams actually ship cheap adapters, not a toy attack. DURABLEUN-SAF pushes gradients through INT4 rounding with an STE and reports Q-INT4=0.043±0.002 with a 3/3 certificate; SalUn gets 1/3 under its own published hyperparameters. Privacy audits that stop at BF16 are certifying a lab artifact, then deploying a different system.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents

The paper introduces ULSPB with 350 settings, five assistance categories, seven interaction patterns, and 24-turn routine interactions. StateGuard audits state diffs at the writeback boundary and reduces Harm Score to near zero across evaluated models.

#Agent#Memory#Safety#Research release

why featured

HKR-H/K/R all pass: the paper frames routine chats as long-term state poisoning, adds a 350-setting benchmark, and targets a real safety risk for memory agents. It stays below P1 because this is a single arXiv paper, not a major lab release or cross-source event.

editor take

Agent memory is a persistent attack surface before it is a personalization moat; writeback auditing beats another layer of prompt glue.

sharp

This paper puts the target in the right place: long-term agent state, not one-shot jailbreak text. ULSPB tests 350 settings, five assistance categories, seven interaction patterns, and 24-turn routine chats. The failure mode is nasty because the poisoning comes from normal conversation, then lands in memory artifacts, confirmation boundaries, tool defaults, and autonomous behavior. StateGuard’s design choice is the useful part. It does not try to divine intent at the input layer; it audits state diffs at the writeback boundary and rolls back dangerous edits. The paper says Harm Score drops to near zero across evaluated models, with high false positives under a safety-first policy. I buy that tradeoff. A bad memory write is not one bad response; it becomes ambient compromise for every future tool call.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

CASCADE formalizes deployment-time learning as a third LLM lifecycle stage and uses evolving episodic memory plus a contextual bandit for experience reuse without parameter updates; across 16 tasks, it raises macro-averaged success rate by 20.9% over zero-shot prompting and outperforms gradient-based and memory-based baselines.

#Agent#Memory#Tools#CASCADE

why featured

HKR-H/K/R all pass: CASCADE proposes deployment-time adaptation without parameter updates and reports +20.9% across 16 tasks. It stays in the 78–84 band because this is still an arXiv method paper, with no disclosed release or adoption.

editor take

CASCADE gives deployment-time learning a clean frame, but that 20.9% gain lives or dies on memory hygiene, not the bandit wrapper.

sharp

CASCADE is useful because it turns agent experience reuse into a testable mechanism, not another prompt recipe. It keeps LLM weights fixed, stores episodic cases, and uses a contextual bandit for selection. Across 16 tasks, it reports a 20.9% macro success-rate gain over zero-shot prompting. That maps to the pain every production agent team has hit: long tasks make raw context expensive, while static RAG misses behavioral feedback. CASCADE gives the cleaner abstraction. I’m less sold on the self-improving-agent framing. The abstract does not spell out memory pollution, bad-case retirement, or negative transfer handling. Without an audit layer, deployment-time learning starts as adaptation and quietly becomes automated technical debt.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

Weblica uses HTTP-level caching and LLM-based environment synthesis to build reproducible web environments, scaling RL training to thousands of environments and tasks; its Weblica-8B model outperforms similar-size open-weight baselines on multiple web navigation benchmarks while using fewer inference steps.

#Agent#Vision#Tools#Weblica

why featured

HKR-H/K/R all pass, but this is a single arXiv paper and the feed does not disclose release status, authorship, or replication details. Its agent-training infrastructure angle fits the 78–84 band.

editor take

Weblica nails the web-agent bottleneck as environment reproducibility; that matters more than another navigation leaderboard bump.

sharp

Weblica’s useful move is refusing to treat web-agent training as more trajectory scraping. It turns websites into repeatable RL environments: HTTP-level caching fixes visual state, LLM synthesis expands task coverage, and the authors claim thousands of environments and tasks. Weblica-8B beats similar-size open-weight baselines on multiple navigation benchmarks with fewer inference steps. I buy the direction before I buy the score. Web agents have been poisoned by flaky evals and shifting pages; MiniWoB and WebArena both exposed that pain. The hard question is whether replay preserves ugly web behavior: auth walls, popups, async loading, broken selectors. The abstract gives no exact benchmark numbers, so the ablations matter more than the headline win.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Post-training makes large language models less human-like

The paper introduces the Psych-201 dataset to measure behavioral alignment at scale, and finds that post-training reduces alignment with human behavior across model families, sizes, and objectives.

#Alignment#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass: the claim is counterintuitive, the post names Psych-201 and cross-model findings, and it targets post-training alignment tradeoffs. Single arXiv paper with no replication or cluster, so it stays in the 78–84 research-release band.

editor take

Post-training makes models better assistants and worse human proxies; Psych-201 punctures the lazy equation between alignment and human-likeness.

sharp

Stop calling RLHF polish “more human.” Psych-201 makes the split explicit: post-training reduces behavioral alignment with humans across model families, sizes, and objectives. The hard hook is the dataset itself: 201 psychology tasks measuring behavioral choices, not exam-style accuracy like MMLU. That lands badly for anyone using chat models as cheap social-science subjects. Chatbot Arena and SWE-bench reward usefulness, compliance, and task completion; Psych-201 tests biases, heuristics, and noisy human inconsistency. Product post-training sands those traits down on purpose. So using instruction-tuned Claude or GPT as a human subject pool increasingly smells like a cost-cutting simulation artifact, not behavioral evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

PhoneSafety evaluates eight phone-use agents on 700 safety-critical moments from real phone interactions across more than 130 apps, and finds harmless outcomes must separate safe choices, unsafe choices, and failures to do anything useful.

#Agent#Safety#Benchmarking#PhoneSafety

why featured

HKR-H/K/R all pass: the title has a real tension, PhoneSafety tests 700 phone-risk moments across 8 agents, and the result taxonomy separates safe choice, unsafe choice, and no effective action. Strong agent-safety benchmark, not a major model release.

editor take

PhoneSafety exposes a cheap safety illusion in phone agents: no harm often means the agent failed to act, not that it judged well.

sharp

PhoneSafety’s useful cut is separating harmless outcomes into safe choice, unsafe choice, and no useful action. The benchmark uses 700 risky phone moments, more than 130 apps, and eight phone-use agents, so it targets the click where product incidents happen. I buy this framing. Phone agents do not fail like chatbots; they fail by tapping transfer, allow, delete, or buy. The paper says stronger general phone-use ability does not reliably produce safer choices, while no-action failures cluster on visually and operationally harder screens. Put that beside WebArena or OSWorld-style task-success scores, and a lot of “safety” has been polluted by incapability. If a safety dashboard rewards refusals or harmless final states, it can end up grading a dumb agent as a cautious one.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

The paper introduces MathlibPR, a benchmark built from real Mathlib4 pull request histories, and reports that DeepSeek, Qwen, Goedel, Kimina, Codex, and Claude Code struggle to distinguish merge-ready PRs from build-passing PRs that were revised or never merged.

#Reasoning#Code#Benchmarking#Mathlib

why featured

HKR-H/K/R all pass: real Mathlib4 PRs separate test-passing from merge-readiness, a concrete coding-agent reliability claim. The formal-math-library scope is narrow and no cross-source cluster is shown, so it stays high all.

editor take

MathlibPR hits the sore spot: passing Lean isn’t the bar; surviving Mathlib review is where LLM reasoning still looks amateur.

sharp

Both listed sources are the same arXiv entry, so the coverage is aligned through a single source chain. MathlibPR uses real Mathlib4 PR histories and evaluates DeepSeek, Qwen, Goedel, Kimina, Codex, and Claude Code on merge-readiness. I think this is a sharper benchmark than another Lean theorem set. It moves the task from “does the proof compile” to “does this patch belong in shared mathematical infrastructure.” The abstract says both models and agents struggle to separate merge-ready PRs from build-passing PRs that were revised or never merged. That failure matters: formal-math automation is no longer just proof search. It also has to learn Mathlib conventions, naming taste, abstraction boundaries, and maintenance cost. MiniF2F and ProofNet test problem-solving muscle; MathlibPR tests maintainer judgment.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Zero-Shot Quantization via Weight-Space Arithmetic

The paper introduces a quantization vector extracted from donor-task weight differences, patches receiver models without receiver training data or QAT, and reports up to a 60-point Top-1 gain after 3-bit PTQ across four ViT scales and 22 image classification tasks.

#Inference-opt#Vision#Research release

why featured

HKR-H/K/R all pass: the 60-point gain is a strong hook, donor weight-delta transfer is testable, and 3-bit zero-shot PTQ hits inference cost. As a single arXiv paper needing reproduction, it fits the 78–84 band.

editor take

A 60-point 3-bit PTQ rescue from donor weight deltas is elegant; I’d hold the hype until it survives LLMs, not just ViT classifiers.

sharp

The sharp part is treating quantization robustness as a transferable weight direction, not a per-model QAT chore. The quantization vector comes from donor-task weight differences, then patches a receiver model; the paper reports up to a 60-point Top-1 gain after 3-bit PTQ across four ViT scales and 22 image-classification tasks, with no receiver training data. If that holds, some edge deployments lose a whole calibration loop. I’m cautious because the evidence is still ViT classification. LLM deployment has messier failure modes: MoE routing, KV-cache pressure, activation outliers, and long-context drift. The abstract gives no 2-bit result, no INT4 LLM run, and no latency or memory numbers. Nice algebra; not yet a deployment recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ThinKV compresses KV cache by assigning token precision by thought importance and evicting lower-value tokens, keeping near-lossless accuracy with under 5% of the original KV cache on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason while raising inference throughput by up to 5.8x over state-of-the-art baselines.

#Reasoning#Inference-opt#DeepSeek#NVIDIA

why featured

HKR-H/K/R all pass: the <5% KV cache and 5.8x throughput claims are concrete, and the mechanism targets reasoning-model inference cost. As an arXiv research release, it fits the 78–84 recommendation band rather than same-day must-write.

editor take

ThinKV attacks the ugliest cost in long reasoning: KV memory. The 5.8x throughput claim is great; vLLM integration decides the impact.

sharp

ThinKV is selling a sharper idea than generic KV compression: treat CoT tokens as unevenly valuable, then quantize or evict by “thought” importance. On DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason, the paper reports under 5% of original KV cache, near-lossless accuracy, and up to 5.8x higher inference throughput. I buy the direction because long-reasoning cost has shifted hard into decode-time memory pressure. The wild part is the PagedAttention extension: it reuses evicted tokens’ memory slots and avoids compaction overhead. The caveat is also obvious: 5.8x is against stated baselines, and the abstract does not show mixed online batching, short-long request interference, or tool-call interruptions. If this lands cleanly in vLLM-style serving, it matters.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Rubric-based On-policy Distillation

The paper introduces ROPD, a rubric-based on-policy distillation framework that derives prompt-specific rubrics from teacher-student contrasts, scores student rollouts without teacher logits, and reports up to a 10x gain in sample efficiency over logit-based OPD methods across most tested scenarios.

#Fine-tuning#Alignment#Research release#Open source

why featured

HKR-H/K/R pass: ROPD offers a concrete rubric-generation mechanism and a 10x sample-efficiency claim. As a single arXiv paper without adoption evidence, it fits the 78-84 research-release band.

editor take

ROPD moves OPD from white-box logits to black-box responses; that cuts closer to proprietary-teacher moat erosion than another distillation trick.

sharp

ROPD’s sharp move is lowering OPD’s interface from teacher logits to teacher responses. It derives prompt-specific rubrics from teacher-student contrasts, scores student rollouts, then runs on-policy optimization. The paper claims up to 10x sample-efficiency gains over logit-based OPD, and the code is public. If that holds outside the paper setup, proprietary models become easier to use as teachers without white-box access. I’d discount the 10x headline until the full eval details are inspected. The scraped body gives no task list, student sizes, teacher names, or failure cases beyond “most scenarios.” The closest lineage is RLAIF and Constitutional AI: turn judgment into a structured reward signal. The upside is black-box compatibility; the failure mode is also obvious. If the rubric drifts, the student optimizes the scoring sheet instead of the teacher’s behavior.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

The paper tests 13 reasoning-mode configurations on MMLU, ARC-Challenge, and GPQA, finding that 12 show positive partial correlation between reasoning trajectory length and Position Bias Score after accuracy control, with coefficients from 0.11 to 0.41 and all p-values below 0.05.

#Reasoning#Benchmarking#DeepSeek#Qwen

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, and the post gives 13 configs, 3 benchmarks, and r=0.11–0.41. This is a practical reasoning-eval warning, not a model-release-level event.

editor take

Long CoT takes another hit: 12 of 13 configs showed higher position bias with longer traces, so “think more” is not a robustness patch.

sharp

Longer reasoning did not scrub away shallow bias; it accumulated position preference inside the trace. The paper tests 13 reasoning-mode configurations across MMLU, ARC-Challenge, and GPQA, and 12 keep a positive partial correlation after accuracy control. The coefficients run from 0.11 to 0.41, with all p-values below 0.05. The truncation result is the sharper hook: R1-Qwen-7B shifts toward position-preferred answers more often when resumed later, rising from 16% to 32% across buckets. That is a nasty result for MCQ evaluation. DeepSeek-R1 671B has aggregate PBS at only 0.019, but its longest quartile still reaches 0.071. Scale is masking the expression, not deleting the mechanism. If a benchmark keeps fixed option order while scoring reasoning models, part of the score is trace-length noise wearing a lab coat.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Exact Is Easier: Credit Assignment for Cooperative LLM Agents

C3 fixes the full history at each decision point in cooperative LLM systems, samples alternative actions under a frozen behavior policy, and outperforms learned critics, trajectory-level baselines, and agent-removal counterfactuals across six math reasoning and code generation benchmarks.

#Agent#Reasoning#Code#EIT-EAST-Lab

why featured

HKR-H/K/R all pass: the title has a contrarian hook, C3 gives a concrete credit-assignment mechanism, and the 6-benchmark result targets multi-agent reliability. It remains an arXiv paper without production evidence, so it fits 78–84.

editor take

C3 hits a real sore spot: in LLM agent teams, deleting an agent is a crude credit signal when text history can be restored exactly.

sharp

C3’s useful move is attacking a bad MARL habit: LLM-agent state is visible text, so each decision point can be restored from full history instead of estimated by a learned critic. The evidence is specific enough: six math and code benchmarks, two model families, two multi-agent topologies, and wins over learned critics, trajectory baselines, and agent-removal counterfactuals. I buy the direction, but not the cleanest version of “exact.” The guarantee leans on a frozen behavior policy and reproducible text history. Add tool calls, retrieval, sandbox execution, or external APIs, and hidden state creeps back in. For AutoGen- or LangGraph-style systems, C3 reads more like a diagnostic scalpel than a production training loop you can drop in tomorrow.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→The Position Curse: LLMs Struggle to Locate the Last Few Items in a List

The paper defines the Position Curse: LLMs can retrieve a single fact from hundreds of thousands of irrelevant tokens, yet fail on the last items of short lists; Claude Opus 4.6 misidentifies the second-to-last line in a two-line code snippet most of the time.

#Reasoning#Code#Fine-tuning#Claude Opus 4.6

why featured

HKR-H/K/R all pass: the title has a counterintuitive failure mode, and the summary gives the tail-item mechanism plus a Claude Opus 4.6 two-line-code miss. Practical for prompting, RAG, and code review, but not a model or platform release.

editor take

Needle-in-haystack demos look cheap when Claude Opus 4.6 often misses the second-to-last line in a two-line snippet.

sharp

Position Curse exposes the fake comfort in long-context evals: models can retrieve a buried fact, then fail at nearby positional indexing. The paper’s sharp hook is brutal: Claude Opus 4.6 misidentifies the second-to-last line in a two-line code snippet most of the time. The setup tests both item-from-position and position-from-item queries, with backward retrieval consistently lagging forward retrieval. That matters for coding agents. Real code edits depend on locating the Nth line, the last argument, or the token after an anchor, not just recalling a fact from a repo. LoRA on PosBench improves both directions and transfers to PyIndex, but absolute performance stays unsaturated. I don’t buy million-token context demos as evidence of code understanding until models stop tripping over list endings.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents

SHARP replaces free-form prompt optimization with human-readable condition-action rules for trading agents, evaluates across three equity sectors and four LLM backbones, and reports average gains of 10 to 20 percentage points for compact models such as GPT-4o-mini.

#Agent#Reasoning#Alignment#SHARP

why featured

HKR-H/K/R all pass: the hook is auditable self-evolving rules for trading agents, with 10-20 point gains across 3 sectors and 4 LLMs. As a single arXiv paper needing replication, it stays at 80.

editor take

SHARP cages trading-agent self-improvement inside auditable rules; I like the shape, but don’t buy 10–20 points until costs and slippage show up.

sharp

SHARP targets the right failure mode: free-form self-editing turns noisy P&L into fake lessons. Its condition-action rubric, cross-sample attribution agent, and walk-forward validation are a cleaner loop than the usual “reflect on losing trades” agent recipe. I still discount the 10–20 percentage-point gain. The abstract names three equity sectors, four LLM backbones, and compact models like GPT-4o-mini, but gives no trading costs, slippage, rebalance cadence, or date range. Finance benchmarks can make auditability read like alpha. Compared with the pile of LLM-trading papers from the last year, the useful part here is the bounded rule-editing mechanism, not the headline return lift.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Direction-Flipped Influence Audits Reveal Hidden Structure in Moral Choices of LLMs

The paper tests five LLM families with direction-flipped influence audits across moral triage, BBQ, and DailyDilemmas, finding that short contextual cues shift per-condition choice rates by 12–18 percentage points on average and that 78% of significant backfire trials show stated-versus-revealed inconsistency.

#Reasoning#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass: the method is counterintuitive, and the post gives 5 families, 12-18 pp shifts, and 78% mismatch. As a single arXiv paper it is not must-write, but it is strong safety-eval signal.

editor take

Stop treating moral benchmark scores like model personality tests; a 12–18 point cue swing is enough to puncture the leaderboard theater.

sharp

This paper lands because it turns “LLM moral preference” back into a prompt-sensitivity problem. The authors test five LLM families across moral triage, BBQ, and DailyDilemmas with direction-flipped influence pairs. Short contextual cues shift choice rates by 12–18 percentage points on average, and about 40% of baseline-neutral triage and BBQ conditions show directional asymmetry. That is too large to wave away as evaluation noise. The sharper hit is the 78% figure. In significant backfire trials, models often recognize the cue, then deny it affected their choice. A lot of safety reporting still leans on model self-explanations as if they reveal the decision process. This result says the explanation text and the revealed behavior are separate artifacts. Reasoning does not remove the sensitivity; it changes which cues bite, weakening social pressure while strengthening few-shot demonstrations.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→A Systematic Investigation of the RL-Jailbreaker in LLMs

The paper decomposes RL-jailbreaker into reward functions, action spaces, episode length, RL algorithms, training data, and reward shaping, and reports that it compromised all targeted models and safeguards, with dense rewards and longer episode lengths identified as the main drivers of attack success.

#Safety#Alignment#Reasoning#Research release

why featured

HKR-H/K/R all pass: the attack claim is clickable, the paper names testable RL variables, and jailbreak robustness is a practitioner nerve. Single-source arXiv and no disclosed target-model list keep it at the lower good-quality band.

editor take

RL jailbreaks don’t need magic; dense rewards and longer episodes did the work. Single-turn safety evals are fighting last year’s attack surface.

sharp

This paper pulls RL jailbreaking out of prompt folklore and into knobs defenders can test: reward function, action space, episode length, RL algorithm, training data, and reward shaping. The authors say the RL-jailbreaker compromised every targeted model and safeguard, and the main drivers were dense rewards plus longer episodes, not some exotic RL trick. The snippet does not disclose target model names, success rates, or episode lengths, so “all targeted models” should not be casually projected onto GPT-5 or Claude Sonnet 4.5. The uncomfortable part is the eval mismatch. Many safety dashboards still score static prompts and one-shot refusal behavior. This attack class optimizes across a feedback loop. It sits closer to PAIR and Tree-of-Attacks than to old jailbreak strings; the power comes from iteration, not a clever phrase.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Don't Retrain, Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment

REPR-ALIGN aligns each DLM layer’s hidden states to a frozen autoregressive model using cosine similarity while keeping the standard masked denoising objective, and reports up to 4x training acceleration without adapters or architectural changes beyond the attention mask.

#Fine-tuning#Inference-opt#Fred Zhangzhi Peng#Alexis Fox

why featured

HKR-H/K/R all pass: no-retrain AR-to-DLM adaptation is a strong hook, with layerwise cosine alignment and up to 4x speedup. Single arXiv paper with no broad replication keeps it in the lower 78–84 band.

editor take

REPR-ALIGN’s 4x speedup is tempting, but it buys DLM training efficiency by leaning hard on an identical AR teacher.

sharp

REPR-ALIGN makes a clean bet: DLMs should inherit language geometry from AR models and relearn generation order. I buy the direction, not the full implied win. The method freezes an identical AR model, aligns every DLM layer’s hidden states with cosine similarity, and still trains with masked denoising. No adapters, no architecture change beyond the attention mask, and the paper reports up to 4x training acceleration, especially in low-data regimes. The constraint is doing a lot of work. This is not a native-DLM recipe in the Dream or LLaDA sense; it is a migration path for existing AR checkpoints. The abstract gives the 4x number, but not the scale, benchmark mix, or quality trade-off behind it.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Rep2Text: Decoding Full Text from a Single LLM Token Representation

Rep2Text maps a target model’s last-token representation into a decoding model’s embedding space with a trainable adapter, and across Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, and other pairings it recovers roughly half of the tokens in 16-token sequences on average.

#Interpretability#Safety#Benchmarking#Llama

why featured

HKR-H/K/R all pass: the paper claims one-token hidden-state text recovery, gives an adapter method, and reports about half recovery on 16-token sequences. It is still an arXiv research item, not a product incident or major lab release, so it stays in 78-84 featured, not p1.

editor take

Rep2Text punctures the comforting myth that hidden states are harmless: half of a 16-token prompt coming back is a privacy bug, not interpretability trivia.

sharp

Rep2Text is less an interpretability trick than a warning label for activation logging. The method trains an adapter from a target model’s last-token representation into a decoder model’s embedding space, then autoregressively reconstructs text. Across Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, and other pairings, it recovers roughly half the tokens in 16-token sequences on average, while keeping semantics coherent. That is not enough to rebuild every prompt. It is enough to wreck the line that “we only store hidden states, not user text.” The clinical OOD result is the nasty part: this leakage is not confined to toy web text. Any product caching activations for routing, observability, personalization, or audits should treat them closer to plaintext than telemetry.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

The paper challenges the Biasing Features metric for judging CoT faithfulness on multi-hop reasoning tasks, where over 50% of samples flagged as unfaithful in some models are judged faithful by other metrics, and its faithful@k metric shows larger inference-time budgets raise hint verbalization to 90% in some settings.

#Reasoning#Interpretability#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: the paper has a contrarian CoT-explainability hook, concrete 50% and 90% claims, and clear resonance for eval trust. As a single arXiv paper without major-lab backing or artifact detail, it stays in the 78–84 band.

editor take

Calling CoT unfaithful just because it omits the hint was always too brittle; this paper fixes a metric, not the trust problem.

sharp

This paper lands a clean hit on hint-based CoT faithfulness tests. Biasing Features labels a CoT unfaithful when a prompt-injected hint affects the answer but is not verbalized; the authors show that, on multi-hop tasks, over 50% of those flagged samples are judged faithful by other metrics in some models. Their faithful@k result is the sharpest hook: larger inference budgets push hint verbalization up to 90% in some settings. I buy the attack on the metric; I do not buy the broader comfort people will draw from it. Causal Mediation Analysis showing a non-verbalized hint flowing through CoT is evidence of partial causal relevance, not audit-grade explainability. For safety work, the useful bar is whether CoT reliably exposes failure modes under pressure, not whether a hidden feature can be recovered with more tokens.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

The paper tests Qwen3 models at three scales on GSM8K, MATH-500, and five BIG-Bench Hard tasks, finding that non-thinking mode matches or beats thinking mode on GSM8K and MATH-500 at every budget up to 2048 tokens, while split-budget generation raises full MATH-500 accuracy to 83.6% with a fixed SC+IRIS gate.

#Reasoning#Benchmarking#Inference-opt#Qwen

why featured

HKR-H/K/R all pass: the title is contrarian, and the article gives testable numbers across 2048-token runs, GSM8K/MATH-500/BBH, and 83.6%. Single arXiv paper with no cluster or major-lab release keeps it in the quality featured band.

editor take

Visible CoT is not free reasoning; under 2048 tokens, Qwen3 often does better when it stops narrating and saves room for the answer.

sharp

This paper lands a clean hit on visible CoT: under a fixed output cap, reasoning text competes with the answer. The authors test three Qwen3 sizes on GSM8K, MATH-500, and five BIG-Bench Hard tasks; up to 2048 tokens, non-thinking matches or beats thinking on GSM8K and MATH-500. The replication on DeepSeek-R1-Distill-Llama-8B matters because it weakens the “Qwen interface quirk” excuse. The mitigation is also concrete, not vibes. Split-budget generation gets IRIS to 74.0% on full MATH-500, a stronger extraction variant to 78.8%, and a fixed SC+IRIS gate to 83.6%. I buy the framing: test-time reasoning is token accounting before it is model intelligence theater.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Topic Is Not Agenda: A Citation-Community Audit of Text Embeddings

The paper audits four embedding models on a 3.58M-paper citation graph, finding only 15-21% top-10 same-rate for L2 research agendas, so about 8 of 10 retrieved papers are off-agenda under cosine similarity.

#RAG#Embedding#Benchmarking#Gemini

why featured

HKR-H/K/R all pass: the hook is counterintuitive, the paper gives 3.58M papers and 15–21% same-class rates, and it matters to RAG retrieval quality; as a single arXiv paper, it fits the 78–84 band.

editor take

A 3.58M-paper audit punches a hole in the lazy RAG bet: semantic-near is not agenda-near, and single-vector cosine is too blunt for science search.

sharp

This paper nails a failure mode many scientific RAG demos hide: embeddings catch topic, not agenda. On a 3.58M-paper citation graph, Gemini, Qwen3-8B, Qwen3-0.6B, and SPECTER2 reach 45-52% top-10 same-rate at L1 sub-field level. At L2 research-agenda level, they fall to 15-21%, so roughly 8 of 10 cosine neighbors miss the agenda. The awkward part is SPECTER2: even citation-based contrastive training does not save it. A plain citation-count rerank hits 57.7% top-1 L2 over LLM-expanded Boolean retrieval and 59.6% over BM25, beating Gemini cosine at 50.6%. I read this as a product warning, not just an IR paper: paper copilots need graph signals and explicit retrieval constraints in the main path, not another round of vector-store tuning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development

MobileDev-Bench comprises 407 real-world issue-resolution tasks from 19 production mobile apps, and four frontier LLMs achieve only 3.23% to 4.23% end-to-end resolution under automated retrieval.

#Code#Benchmarking#Agent#Claude

why featured

HKR-H/K/R pass: a 3.23%-4.23% solve rate on real mobile app fixes gives coding agents a concrete stress test. Single arXiv paper limits reach, so it lands at the lower featured band.

editor take

MobileDev-Bench is a brutal reality check: Claude Sonnet 4.5 and GPT-5.2 still solve real mobile app bugs at roughly 4%.

sharp

MobileDev-Bench exposes the part coding-agent demos keep dodging: real mobile apps are build systems, resources, configs, and framework glue, not neat Python patches. The benchmark has 407 real issues from 19 production apps, with fixes touching 12.9 files and 334.6 lines on average. More importantly, 41% require coordinated edits across source, build configuration, and resource artifacts. Claude Sonnet 4.5, Qwen3-Coder, GPT-5.2, and Gemini 2.5 Flash land at only 3.23% to 4.23% end-to-end resolution with automated retrieval; oracle retrieval tops out at 5.69%. SWE-bench trained the market to talk about patch generation. Mobile work demands environment survival across Android Native, React Native, and Flutter. That 4% number is ugly, but it matches the mess practitioners actually ship into.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

The study tested 33 frontier LLMs from eight model families on 1,500 MMLU items and found aggregate metacognitive scores hide domain-level variation: Applied/Professional knowledge had mean AUROC .742, while Formal Reasoning and Natural Science were the hardest monitored domains.

#Reasoning#Benchmarking#Safety#Anthropic

why featured

HKR-H/K/R pass: the 33-model atlas is a hook, 1,500 MMLU items and AUROC .742 add concrete evidence, and domain calibration maps to eval, safety, and deployment risk. Single arXiv paper, so 78–84 fits.

editor take

Stop shipping one confidence score as if it travels; across 33 models, self-monitoring fractures by domain.

sharp

A single confidence policy gets exposed here. The paper tests 33 frontier LLMs on 1,500 MMLU items, yielding 47,151 observations, and shows models monitor themselves far better in Applied/Professional knowledge, with mean AUROC .742. Formal Reasoning and Natural Science are the ugly zones: one of them lands in the bottom two domains for 27 of 33 models. That matters for agent gating. Plenty of products still treat “self-rated confidence plus a refusal threshold” as a portable guardrail. This paper gives a nasty counterexample: three models marked Invalid on binary KEEP/WITHDRAW probes produced normal profiles under 0-100 verbalized confidence. Format and domain both change the safety signal. OpenAI also lacks significant within-family profile clustering here, while Anthropic, Gemini, and Qwen show it at permutation p < .0001. Aggregate calibration is a risk-masking device.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Research paper proposes KV cache offloading method for context-intensive tasks

The paper releases the Text2JSON benchmark and evaluates KV cache offloading on Llama 3 and Qwen 3 for context-intensive tasks that require extracting structured information from long prompts. It reports significant accuracy degradation, identifies low-rank key projection and unreliable landmarks as two failure causes, and proposes a simpler strategy that improves accuracy across multiple model families and benchmarks.

#Inference-opt#Benchmarking#Llama 3#Qwen 3

why featured

HKR-H and HKR-K pass: the paper gives a testable warning for long-context inference optimization via a new benchmark and named failure modes. The KV-cache focus is narrow, so it stays below featured.

editor take

Two arXiv entries, one source chain; the hit is not memory savings, it is a clean warning that long-context compression gets exposed on extraction-heavy work.

sharp

Both entries point to the same arXiv paper, 2604.08426, with the same title and abstract; this is a single-paper signal, not independent confirmation. The paper tests KV cache offloading on Text2JSON plus context-intensive tasks using Llama 3 and Qwen 3, and the painful claim is clear: low-rank key projection and unreliable landmarks break down when the prompt behaves like a database. I buy the skepticism here. Long-context infra has spent two years selling “same accuracy, lower memory,” while many benchmarks only ask the model to retrieve one or two needles. Text2JSON-style structured extraction is closer to enterprise RAG, contract review, and log analysis. The abstract does not disclose exact accuracy drops, so don’t overread it as a kill shot. But it raises the bar for any KV offloading paper that still hides behind easy retrieval tests.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation

FAAST compiles labeled examples into fast weights in one forward-only pass, keeps inference constant-time without memory or context dependence, cuts adaptation time by over 90%, and saves up to 95% memory across image classification and language modeling benchmarks.

#Fine-tuning#Inference-opt#Memory#FAAST

why featured

HKR-H/K/R all pass: FAAST gives a concrete mechanism and numbers for test-time adaptation. It stays at 78 because this is a single arXiv method without major-lab backing or independent replication.

editor take

FAAST compiles labeled examples into fast weights in one pass; if the 90% adaptation speedup holds, supervised few-shot tuning loses a lot of backprop tax.

sharp

FAAST’s sharp claim is that supervised adaptation can leave the training loop. The mechanism is concrete: labeled examples are parsed in one forward-only pass, compiled through a closed-form solution into fast weights, then inference stays constant-time without stored examples or context dependence. The paper reports over 90% lower adaptation time and up to 95% memory savings. I’d place it between LoRA, in-context learning, and kNN-style memory. LoRA pays backprop cost, ICL pays context cost, memory methods pay retrieval and storage cost. If FAAST keeps matching backprop adaptation on language modeling, small and resource-constrained models get a very practical path. The abstract does not expose model sizes, benchmark mix, or failure cases, so the 90% number is not a universal coupon yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

MISA replaces the DSA indexer with a mixture-of-experts router that activates eight heads, matches the dense DSA indexer on LongBench for DeepSeek-V3.2 and GLM-5, preserves Needle-in-a-Haystack results up to 128K tokens, recovers over 92% of DSA-selected tokens per layer, and speeds the indexer kernel by 3.82x on one NVIDIA H200 GPU.

#Inference-opt#Benchmarking#Ruijie Zhou#DeepSeek

why featured

HKR-H comes from the 3.82x long-context speed hook; HKR-K has mechanism and benchmark details; HKR-R hits inference cost. Single arXiv paper and no code/deployment data keep it at 78.

editor take

MISA attacks the boring tax in long-context inference: not 128K bragging, but cutting the DSA indexer cost.

sharp

MISA’s useful claim is narrow and credible: long-context cost has moved from attention slogans to the indexer bill. DeepSeek-V3.2’s DSA uses 64 indexer heads to score prefix tokens; MISA routes each query to 8 active heads, matches dense DSA on LongBench, keeps green Needle-in-a-Haystack results to 128K tokens, and recovers over 92% of DSA-selected tokens per layer. That is a real systems hook. I would not read the 3.82x as end-to-end inference speedup. The paper reports an indexer-kernel gain on one NVIDIA H200, not full decode throughput, batch settings, or KV-cache pressure. Compared with a lot of long-context papers, this smells more deployable because it is a drop-in DSA indexer replacement. The production question is simple: inside vLLM- or SGLang-style serving, how much wall time does the indexer actually own?

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Unsolvability Ceiling in Multi-LLM Routing: An Empirical Study of Evaluation Artifacts

The paper evaluates 206,000 query-model pairs across six benchmarks and finds that much of the reported unsolvability ceiling in multi-LLM routing comes from evaluation artifacts: judge bias toward verbosity, truncation under fixed generation budgets, and output-format mismatches.

#Benchmarking#Inference-opt#Gemma#Llama

why featured

HKR-H/K/R all pass: 206,000 query-model pairs across six benchmarks give concrete evidence, and the artifact finding matters for routing evals. Single arXiv paper, so it lands in the 78–84 band, not must-write.

editor take

Another routing paper knifes the eval stack: a lot of “unsolvable” queries are judge verbosity bias, truncation, and format mismatch.

sharp

Multi-LLM routing has been over-penalized by dirty labels, not just weak routers. The paper tests 206,000 query-model pairs across MMLU, MedQA, HumanEval, MBPP, Alpaca, and ShareGPT, using Gemma 4 and Llama 3.1 families. Its claim is concrete: verbosity bias in LLM judges, fixed-budget truncation, and output-format mismatches inflate the “unsolvable” bucket. The nastiest finding is the training collapse. Standard routers fall into majority-class behavior, with the smallest tier marked optimal about 79% of the time. Random-feature and shuffled-label controls reproduce the pattern, and the paper puts the opportunity cost at 13–17 percentage points. I’d file this under LLM-as-judge technical debt before blaming routing algorithms. A lot of cost-saving stacks are not missing a clever router; their eval pipeline is drawing the cost-quality frontier wrong.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Dataset Watermarking for Closed LLMs with Provable Detection

The paper introduces a dataset watermarking method for closed LLMs: it raises co-occurrence rates of randomly selected word pairs through rephrasing, then detects the signal with a statistical test on generated outputs. Experiments across multiple base models and benchmarks report reliable fine-tuning-stage detection at p<0.01, including mixtures where the watermarked dataset is about 1% of fine-tuning tokens.

#Safety#Benchmarking#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: the paper claims closed-model auditability, with word-pair co-occurrence rewriting, p<0.01, and a 1% token condition. Single arXiv work lacks broad replication or adoption, so it sits at the low end of 78-84.

editor take

Closed models now leave dataset fingerprints; 1% of fine-tuning tokens is enough to make benchmark laundering a lot less comfortable.

sharp

This paper moves dataset watermarking into the closed-API setting, and the sharp part is that it never needs weights. The method rephrases data to raise co-occurrence of random word pairs, then tests generated outputs statistically. The hard numbers in the abstract are p<0.01 detection and survival when the watermarked set is about 1% of fine-tuning tokens. I care more about benchmark-contamination pressure than copyright enforcement here. Closed labs have long hidden behind “you cannot inspect the training set.” This shifts the evidence to output distributions. The doubts are concrete: query volume, adversarial paraphrasing, and signal decay after RLHF are not in the snippet. If this enters real benchmarks, operating cost and false positives matter more than the p-value.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Why Does Agentic Safety Fail to Generalize Across Tasks?

The paper analyzes linear-quadratic control with H-infinity robustness and tests simulated quadcopter navigation plus CRM LLM agents, showing that adding safety requirements raises the Lipschitz constant of the mapping from task specification to an optimal controller.

#Agent#Safety#Alignment#Research release

why featured

HKR-H/K/R all pass: the title has a sharp failure question, the post gives a Lipschitz mechanism, and tests LQR, quadrotor simulation, and a CRM LLM agent. Single arXiv paper, technical, no effect sizes disclosed, so it stays in the lower featured band.

editor take

This paper takes “safety doesn’t generalize” out of the training-failure bucket and pins it on task-to-controller complexity.

sharp

Safety failures across tasks are not just a data or RL-tuning problem here; the paper claims a harder mechanism. Adding safety constraints raises the Lipschitz constant of the mapping from task specification to optimal controller. The authors prove it in LQ control with H-infinity robustness, then test simulated quadcopter navigation and a CRM LLM agent. That spans continuous control and tool-using agents, which makes the claim harder to dismiss. I buy the direction because it matches the agent eval mess this year: models often complete new tasks, then fail on boundary conditions. The catch is that the snippet gives no constant gap, task count, LLM name, or CRM setup details. If the empirical side is a tiny toy CRM suite, the headline claim shrinks fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→LLMSpace models carbon footprint of large language model inference on LEO satellites

LLMSpace models operational and embodied carbon for LLM inference on LEO satellites, covering launch emissions, satellite manufacturing, radiation-hardened accelerators and memory, prefill-decode behavior, and token generation, with source code disclosed on GitHub.

#Inference-opt#LLMSpace#UnchartedRLab#Research release

why featured

HKR-H and HKR-K pass: the LEO-satellite angle is novel, and the carbon model has concrete components plus code. The use case is narrow and not a major model or product release, so it stays in the 60–71 band.

editor take

Both sources are the same arXiv paper; space-based LLM inference sounds sci-fi, but the useful move is charging launch, manufacturing, and rad-hard hardware to the bill.

sharp

The 2 entries trace to the same arXiv record with the identical headline, so this is a single-paper chain, not independent media confirmation. LLMSpace models both operational and embodied carbon for LLM inference on LEO satellites; the abstract gives 12 pages, 4 figures, 6 tables, and code, but no concrete CO₂ numbers or model sizes. I like the paper because it attacks the lazy “solar compute is clean compute” pitch. Launch emissions, satellite manufacturing, peripheral subsystems, and radiation-hardened GPUs and memory all enter the carbon ledger. Ground data-center debates often get stuck at PUE; LEO inference starts with lifetime, orbit, and hardware constraints already baked in. This is a narrow systems paper, but it is a useful brake on the Starlink-style edge-AI story.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

BAM formulates positional encoding as a probabilistic prior, unifies NoPE and ALiBi, and reports accurate information retrieval at 500× the training context length while maintaining comparable perplexity; the abstract does not disclose model size or datasets.

#Reasoning#Inference-opt#Benchmarking#arXiv

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper and the post does not disclose model scale or datasets. Strong long-context hook, capped at the featured lower band.

editor take

500× extrapolation is a loud claim, but no model size or dataset is disclosed; BAM reads like a strong PE thesis, not a long-context product plan.

sharp

BAM’s sharp move is not proposing another positional encoding. It folds NoPE and ALiBi into one probabilistic-prior view, then uses a Generalized Gaussian prior to claim 500× training-length extrapolation. That number is huge because long-context papers often look clean on needle retrieval while perplexity or real tasks fall apart; BAM at least claims retrieval accuracy and comparable perplexity together. I’m cautious here: the RSS text gives no model size, training length, or dataset. It also does not say whether retrieval is single-needle, multi-needle, or distractor-heavy. RoPE scaling, YaRN, and LongRoPE all rode similar benchmark wins before engineering reality hit KV cache cost and task transfer. If BAM only wins on small synthetic retrieval, it is a useful theory paper. If it reproduces at 7B or 32B scale, it belongs in the engineering menu.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Reliable Chain-of-Thought via Prefix Consistency

The paper introduces prefix consistency, which truncates a CoT trace, regenerates the remainder, and weights candidate answers by reappearance frequency; across five reasoning models and four math and science benchmarks, it matches standard majority-vote plateau accuracy with a median 4.6x fewer tokens and up to 21x fewer tokens.

#Reasoning#Benchmarking#Research release#Open source

why featured

HKR-H/K/R all pass: the method is concrete, the token-saving numbers are testable, and reasoning cost is a live practitioner pain point. Single arXiv paper status keeps it below the 78–84 band.

editor take

Prefix consistency is a clean inference hack: don’t ask the model if it’s confident; cut the reasoning and see whether the answer survives.

sharp

Prefix consistency turns CoT confidence into a stability test, and I buy the engineering shape of it. The method truncates a reasoning trace, regenerates the suffix, and weights answers by reappearance frequency. Across five reasoning models and four math/science benchmarks, it reaches standard majority-vote plateau accuracy with 4.6x fewer tokens at median, up to 21x. The clean part is that it needs no token logprobs and no self-rating prompt, which matters for closed APIs where probabilities are missing or unreliable. My caution is scope: the abstract only names math and science benchmarks, not coding, long-horizon agents, or tool-use workflows. If the signal only holds on short-answer reasoning, it is a benchmark-side inference trick. If it holds under tools, it becomes a serious budget scheduler for test-time compute.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→IntentGrasp: A Comprehensive Benchmark for Intent Understanding

The paper introduces IntentGrasp, a benchmark with 262,759 training instances, 12,909 All Set test cases, and 470 Gem Set cases, where 20 evaluated LLMs score below 25% on Gem Set and 17 fall below a 15.2% random-guess baseline.

#Benchmarking#Fine-tuning#Reasoning#IntentGrasp

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark with no disclosed code, replication setup, or cross-source pickup. The sub-25% result across 20 LLMs puts it at the high end for research benchmarks, below p1.

editor take

IntentGrasp exposes a nasty gap: 20 LLMs under 25% on Gem Set, so agent reliability claims deserve less swagger.

sharp

IntentGrasp lands because it turns “the model understands the user” into a failing test. The benchmark pulls from 49 open corpora across 12 domains, with 262,759 training instances and a 470-case Gem Set. On that small hard set, all 20 tested models stay below 25%, including GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7. The ugly number is 17 of 20 falling under the 15.2% random-guess baseline, while estimated human performance sits at 81.1%. I don’t buy the idea that general reasoning gains will clean this up by themselves. Intentional Fine-Tuning adds 30+ F1 on All Set and 20+ on Gem Set, so the gap smells like missing supervised signal, not raw model size. For agent builders, this is more damaging than another weak math benchmark.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

Star Elastic adds N nested submodels to a parent reasoning LLM through one post-training run, then selects submodels per thinking and answering phase; on NVIDIA Nemotron Nano v3 30B/3.6A, it produced 23B/2.8A and 12B/2.0A variants with 160B training tokens, reporting 360x less compute than pretraining from scratch and up to 1.9x lower latency.

#Reasoning#Inference-opt#Fine-tuning#NVIDIA

why featured

HKR-H/K/R all pass, but this is still a single arXiv paper with summary-level evidence. The NVIDIA/Nemotron 160B-token setup and 360x compute-saving claim clear featured, not must-write.

editor take

NVIDIA is selling one model as a latency ladder; 160B tokens for 12B/23B variants is neat, but router overhead decides whether this lands.

sharp

Star Elastic’s sharp claim is not compression; it turns a reasoning model into a selectable latency ladder. One 160B-token post-training run on Nemotron Nano v3 30B/3.6A yields 23B/2.8A and 12B/2.0A variants. The paper reports 360x less compute than pretraining, 7x less than SOTA compression, up to 1.9x lower latency, and 16% higher accuracy. I buy half of it. Picking different submodels for thinking and answering matches how reasoning inference cost actually shows up better than a static small model. It also sits near speculative decoding in the serving playbook, but attacks the model width instead of token acceptance. The missing bits matter: router overhead, batch size, hardware, and task mix are not in the snippet. The NVFP4 and FP8 QAD angle tells you NVIDIA is aiming at its deployment stack; if the win stays inside Nemotron, it is platform leverage, not a general recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Hallucination Detection via Activations of Open-Weight Proxy Analyzers

The paper proposes a proxy-analyzer framework that detects hallucinations by reading generated text with a small local open-weight model and using its internal activations, training a stacking ensemble on 72,135 samples and beating ReDeEP’s 0.73 token-level AUC on RAGTruth by 7.4 to 10.3 points across seven 0.5B to 9B analyzers.

#RAG#Interpretability#Safety#Qwen

why featured

HKR-H/K/R pass: the paper uses local open-weight analyzers and internal activations, with concrete sample count, model sizes, and RAGTruth AUC gains. It remains a research item without product adoption or broad coverage, so it sits mid-featured.

editor take

Stop worshipping generator self-checks; a 0.5B proxy reading activations is closer to a deployable RAG QA layer.

sharp

This paper pushes hallucination detection back into sidecar monitoring, and that is the useful part. It trains a stacking ensemble on 72,135 samples, then uses activations from seven open-weight analyzers from 0.5B to 9B. All seven beat ReDeEP’s 0.73 token-level AUC on RAGTruth by 7.4 to 10.3 points. The awkward result for scale maximalists: an 18x model-size gap produces only a 2.3-point AUC spread, and Qwen2.5-0.5B reaches 0.706 F1 versus Qwen2.5-7B at 0.717. I buy this direction more than generator self-critique. Production RAG needs cheap local monitors that can score GPT-4 or any closed API output, without asking the generator to grade itself. The caveat is dataset comfort: RAGTruth and LLM-AggreFact include multiple generator families, but enterprise retrieval noise is uglier than benchmark hallucination labels.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Prune-OPD monitors teacher-student prediction compatibility with signals such as top-k overlap, truncates rollouts after prefix drift, and reduces training time by 37.6%–68.0% on AMC, AIME, and HMMT while preserving or improving performance.

#Reasoning#Fine-tuning#Inference-opt#Research release

why featured

HKR-H/K/R pass, but this is still a single arXiv methods paper with no disclosed code artifact or cross-source discussion, so it sits above the featured threshold but below the 78+ band.

editor take

Prune-OPD attacks the lazy “longer rollout is better” habit; 37.6%–68.0% less training time is pipeline-level bait if it reproduces.

sharp

Prune-OPD makes OPD waste observable instead of treating it as a rollout-length knob. Once the student prefix drifts from the teacher’s reasoning path, dense rewards lose local value; the paper tracks teacher-student compatibility with top-k overlap, down-weights later rewards, and truncates rollouts. It reports 37.6%–68.0% less training time on AMC, AIME, and HMMT while preserving or improving scores. That is cleaner than just shortening generations, because it cuts segments where supervision has already gone stale. I’d check variance across teacher-student pairs first. If the gain lives mostly in math benchmarks or threshold choices, production OPD still needs its own drift calibration.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

The paper proposes Prefix Sampling to steer binary-reward RL toward an approximately 50% rollout pass rate by replaying self-generated trajectory prefixes; on SWE-bench Verified, it delivers 2.01x and 1.55x end-to-end speedups on Qwen3-14B and Qwen3-32B.

#Agent#Code#Fine-tuning#Qwen

why featured

HKR-H/K/R all pass: the paper gives a concrete mechanism, a 50% pass-rate target, and SWE-bench speedups tied to code-agent RL cost. Capped at 76 because it is a specialized arXiv method without open-source or cross-source validation.

editor take

Binary-reward RL is finally paying for the half-solvable cases; Prefix Sampling is unglamorous, but 2.01x on SWE-bench is real leverage.

sharp

Prefix Sampling attacks the dumbest spend in agent RL: rollouts that all pass or all fail, leaving binary rewards with little gradient. The paper forces groups toward an approximately 50% pass rate by replaying self-generated prefixes: successful prefixes help mostly failing groups, failing prefixes handicap mostly passing groups. Replayed tokens are masked from loss, so training hits only current-policy continuations. On SWE-bench Verified, Qwen3-14B gets a 2.01x end-to-end speedup, Qwen3-32B gets 1.55x, and the 14B peak moves from 0.274 to 0.295. That is a cleaner engineering win than another leaderboard bump: same budget, fewer dead rollouts. I still want the boundary conditions. AIME 2025 only shows the same pattern on 4B and 8B, which is not the same as messy long-horizon tool use.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→PaT: Planning-after-Trial for Efficient Test-Time Code Generation

PaT invokes a planner only after verification failure in code generation, uses a cheaper model for generation attempts and a stronger model for targeted planning, and reduces inference cost by about 69% while matching a large homogeneous model across reported benchmarks.

#Code#Reasoning#Inference-opt#Research release

why featured

HKR-H/K/R all pass: PaT gives a concrete planning-after-failure mechanism and a ~69% cost cut for code generation. Single arXiv paper limits reach, so it stays below the 78+ good-quality band.

editor take

PaT’s punch is not better planning; it is refusing to plan until failure. Code-agent cost competition is moving into routing policy.

sharp

PaT changes the default code-generation move to “try first, plan after failure,” and the reported 69% cost reduction comes from that ordering. A cheaper model handles initial attempts, while a stronger planner enters only after verification fails. That is more deployable than another long-CoT recipe. This smells like the production shape of code agents: tests, compilation, and verifier feedback route compute instead of sending every task to the priciest model. The abstract says PaT matches a large homogeneous model across multiple benchmarks and model families, but it does not name the benchmarks or absolute scores. I’d wait for the PDF before trusting the 69% cost number.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

The paper introduces a diagnostic that builds a pairwise mutual-information graph from agents’ hidden states and applies spectral partitioning to detect coalition boundaries, validating it in two settings: multi-agent reinforcement learning environments and a large language model prompted with team descriptions and reassignment scenarios.

#Agent#Interpretability#Safety#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper; the provided text gives the method and two validation settings, not code, lab backing, or deployment evidence. That keeps it near the featured threshold.

editor take

This moves coalition detection inside hidden states, but don’t sell it as an ops safety gauge yet; the tests are still toy MARL plus prompted LLM setups.

sharp

The useful move here is shifting coalition detection from behavior traces into internal representations. The method is concrete: build a pairwise mutual-information graph from agent hidden states, then use spectral partitioning. The paper validates it in two settings: multi-agent RL environments and one LLM prompted with team descriptions and reassignment scenarios. I buy the diagnostic direction, not the “scalable safety tool” framing yet. Rejecting false positives from behavioral coordination without informational coupling is a real win over trajectory-only monitoring. But the LLM setup leans on explicit team labels, and the authors say those labels dominate conflicting interaction patterns. For deployed agent fleets, this reads like an offline probe, not a production alarm for covert collusion.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→FactoryBench: Evaluating Industrial Machine Understanding

FactoryBench builds more than 70,000 Q&A items from roughly 15,000 normalized industrial telemetry episodes, then evaluates six frontier LLMs zero-shot; no model exceeds 50% on structured causal levels or 18% on decision-making questions.

#Benchmarking#Reasoning#Robotics#FactoryBench

why featured

HKR-H/K/R all pass: FactoryBench gives dataset size, six-model zero-shot results, and a sharp under-18% decision-task failure. The industrial-benchmark niche keeps it below must-write range.

editor take

FactoryBench turns factory telemetry into 70k questions, and frontier LLMs stay under 18% on decisions; robot intelligence still breaks on sensor causality, not chat.

sharp

FactoryBench hits the gap vendors keep talking around: LLMs do not understand machine state from telemetry. The benchmark builds 70k-plus Q&A items from about 15k normalized industrial episodes, spanning state, intervention, counterfactual, and decision levels. Six frontier LLMs are tested zero-shot; none clears 50% on structured causal levels or 18% on decision questions. That failure stings more than another small SWE-bench delta. Code agents can lean on tests, tools, and retries. A UR3 cobot or KUKA KR10 arm produces multivariate sensor streams where the model first has to infer state, then predict intervention effects. Free-form answers use LLM-as-judge voting, but the four structured formats are scored deterministically. That makes the usual “benchmark noise” excuse much weaker here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Test-Time Compute Games

The paper introduces a reverse second-price auction mechanism for LLM-as-a-service, where providers bid price and expected quality, and evaluates it with Llama, Qwen, and DeepSeek-R1-distilled models on math and science benchmarks.

#Reasoning#Inference-opt#Benchmarking#Llama

why featured

HKR-H/K/R all pass: the auction framing is fresh, the paper gives a reverse second-price mechanism plus model benchmarks, and it speaks to inference-cost trust. No major-lab or cross-source signal, so it stays in the 72–77 band.

editor take

This paper names the ugly incentive: token-priced reasoning lets providers sell “thinking longer” as a default tax.

sharp

Test-time compute has a pricing problem before it has a latency problem. The paper’s sharp hook is the incentive: LLM-as-a-service providers charge for generated compute, so they gain when they add reasoning steps even when quality barely moves. The proposed reverse second-price auction makes providers bid both price and expected quality, then charges users by the winner’s marginal value over the runner-up. The model set is the right kind of boring: Llama, Qwen, and DeepSeek-R1-distilled models on math and science benchmarks. That is closer to production procurement than another leaderboard flex. My issue is verification. The snippet does not disclose how expected quality is audited. Without a hard quality signal, the market moves from token bloat to quality-claim bloat.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

The paper introduces One-Step-Train, which ranks samples by marginal utility from a simulated one-step update on a lightweight proxy; on Qwen multimodal mathematical reasoning benchmarks, its top-50 subset cuts training cost by 43% and beats an LLM-as-a-Judge baseline by 1.8 points.

#Multimodal#Reasoning#Fine-tuning#Qwen

why featured

HKR-H/K/R all pass, but scope stays within multimodal data-selection research. The 43% cost cut and +1.8-point result are useful, not a major model or product release.

editor take

One-Step-Train is a cleaner bet than piling on synthetic data: top-50 cuts training cost 43% and gains 1.8 points, but Qwen math may be a narrow win.

sharp

One-Step-Train hits the ugly part of multimodal fine-tuning: synthetic data volume still poisons reasoning. It skips semantic judging and scores each sample through a simulated one-step update on a lightweight proxy. On Qwen multimodal math benchmarks, the top-50 subset cuts training cost 43% and beats an LLM-as-a-Judge baseline by 1.8 points. Under fixed compute, top-20 is 5.6 points higher. I buy the direction, not the whole victory lap. Math reasoning gives cleaner gradient signals than open-ended VQA, chart QA, or tool-heavy multimodal tasks. The Full-SFT gap is the useful warning: training on everything underperforms by 8.8 points, so data selection has moved from cost control to damage control. The missing proof is whether this utility ranking survives outside Qwen math.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Research Proposes Tail-Aware Divergence for Efficient Language Model Distillation

The paper proposes a tail-aware divergence that decouples a teacher model’s top-K probabilities from lower-probability predictions while keeping the same computational profile as KL divergence.

#Fine-tuning#Inference-opt#Research release

why featured

HKR-H/K/R pass narrowly: the paper gives a concrete distillation mechanism and cost condition, relevant to compression cost. No gains, model sizes, or reproduction setup are disclosed, so it stays in the 60–71 research-note band.

editor take

Both sources are the same arXiv paper; tail-aware distillation is useful, but the “same compute profile” claim needs replication before the accuracy story matters.

sharp

Both entries point to the same arXiv:2602.20816 paper, so the coverage is identical, not independent confirmation. The method splits standard KL into top-K teacher probabilities and the remaining tail, then downweights the teacher modes while claiming the same compute profile as KL. I buy the problem framing more than the result claim. Distillation often teaches the student to mimic the teacher’s high-probability modes, especially when a small decoder only gets strong signal from top-1 or top-5 tokens. The credible hook is ICML 2026 plus claimed tests across pre-training and supervised distillation datasets. The weak spot is the abstract gives no benchmark numbers, K choice, teacher/student sizes, or budget figure. Until those are visible, I’d treat this as a neat loss-function trick, not evidence that academic-budget distillation caught up with industrial pipelines.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Benchmarking World-Model Learning with Environment-Level Queries

The paper introduces WorldTest and instantiates it as AutumnBench with 43 interactive grid-world environments, 129 tasks, and three environment-level query families; experiments with 517 human participants and five frontier models show humans substantially outperform the models.

#Agent#Reasoning#Benchmarking#WorldTest

why featured

HKR-H/K/R all pass, but this is an arXiv benchmark rather than a model launch or production tool. The 43 worlds and 517-human comparison clear featured, while staying in the 72–77 band.

editor take

WorldTest hits the right bruise: five frontier models lose to 517 humans when the test asks about the environment, not rollout trivia.

sharp

WorldTest attacks a lazy habit in agent evaluation: scoring next-frame prediction, task return, or observed rollouts, then calling that a world model. AutumnBench asks environment-level questions instead: reachability, intervention effects, and global structure across 43 interactive grid worlds, 129 tasks, and three query families. In that setup, 517 humans substantially beat five frontier models. I like the direction, but I would not oversell it as a universal world-model test. Grid worlds are clean and controllable; real browsers, codebases, and robots add tool limits, hidden state, and messy observation channels. Its value is sharper than that: it exposes agent demos that use long context and repeated rollouts to cosplay environmental understanding.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Discovering Learning-Friendly Generation Orders for Sequential Computation

The paper uses early-stage loss profiling to rank intermediate-state generation orders, raising success rates on six order-sensitive tasks from about 10% to near 100% under reported settings.

#Reasoning#Benchmarking#arXiv#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv training-method paper without deployment or tooling proof. The six-task jump from about 10% to near 100% puts it at the featured threshold.

editor take

Don’t file this as a trick: six tasks move from ~10% to near 100%, exposing generation order as a training variable we’ve mostly hand-waved.

sharp

The sharp part here is that generation order becomes an optimization target, not hand-tuned folklore. The paper uses early-stage loss profiling to rank candidate intermediate-state orders, then reports success moving from about 10% to near 100% across six order-sensitive tasks. It reaches L=13 from random initialization and L=40 from structured initialization. The mechanism is refreshingly plain: run a short training probe, measure which order drops loss faster, then search block-level and within-block permutations instead of the full factorial space. The integer-multiplication result matters because it rediscovers the reverse-digit order reported in prior work. That keeps this from looking like benchmark carpentry. I’m still cautious on generalization: the abstract does not give model scale, compute budget, or results on messy natural-language CoT. But for practitioners, the warning is useful: some reasoning failures come from a bad learnable path, not missing capacity.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→When Does Embedding Magnitude Matter? A Cross-Task Functional-Symmetry Framework

The paper proposes a 2×2 framework that normalizes query and document embeddings independently. Across four retrieval encoders on MS MARCO, BEIR, BRIGHT, and multi-hop QA, unilateral normalization variants beat cosine and dot product, with up to +72% out-of-domain relative gain and +24% downstream RAG gain.

#Embedding#RAG#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: the paper turns embedding magnitude into a query/document 2×2 test and reports gains on MS MARCO, BEIR, and BRIGHT. It matters to RAG/search teams, but it is not a major model or product release.

editor take

Cosine is no longer the safe default; splitting query/doc normalization gets up to +72% OOD, so RAG stacks need a rerun.

sharp

Cosine takes a clean hit here: the paper keeps the encoder fixed and only toggles query-side and document-side normalization, yet reports up to +72% relative OOD gain on BEIR, BRIGHT, and multi-hop QA, plus +24% downstream RAG. For practitioners, that is annoying in the useful way: the lever is not a new embedding model, it is the scoring function you probably hardcoded months ago. The mechanism is testable rather than hand-wavy: document magnitude scales inference scores, query magnitude modulates training gradients, and the Fisher Information Matrix condition number predicts which side to normalize. I would rerun existing MS MARCO-tuned retrievers on real traffic before buying the universal framing. The paper is still under review, and the +72% number needs the PDF tables for baselines and absolute scores.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Limitations on Accurate, Trusted, Human-level Reasoning

arXiv:2509.21654v2 proves that, under strict mathematical definitions, an accurate and trusted AI system cannot also satisfy human-level reasoning, because some task instances are easily and provably solvable by humans but not by the system.

#Reasoning#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass: the theorem-style incompatibility is clickable, the new fact is a formal unsolvability claim, and the safety/trust angle resonates. Sparse arXiv-level metadata keeps it below the 78–84 band.

editor take

This paper hits the safety wish-list directly: never false, trusted, and human-level reasoning do not coexist under its formal setup.

sharp

This 19-page paper punches at the cleanest safety story: an AI system that can abstain, never makes false claims, is trusted as accurate, and always matches or exceeds human reasoning cannot exist under the authors’ definitions. The concrete hook is strong: Panigrahy and Sharan frame the proof through Gödel-style incompleteness and Turing’s halting-problem undecidability, then show task instances that humans can easily and provably solve while the system cannot. I would not read this as a direct ceiling on GPT-5, Claude, or any deployed agent stack. The result lives inside strict formal definitions, far from messy product choices like calibration, tool use, and verifier loops. But it attacks a lazy AGI claim people keep making: that stronger reasoning, refusal, and trustworthiness can all be optimized together. Under this setup, one side of that triangle breaks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

ExpThink compresses chain-of-thought with experience-guided reward shaping and difficulty-adaptive advantage; across multiple mathematical reasoning benchmarks, it reduces average response length by up to 77% while improving accuracy and reaches up to 3× the accuracy-efficiency ratio of a vanilla baseline.

#Reasoning#Inference-opt#Benchmarking#ExpThink

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with benchmark claims only; no disclosed open-source artifact or production validation, so it sits at the lower featured band.

editor take

ExpThink makes CoT compression look like RL curriculum design; 77% fewer tokens is strong, but math benchmarks are not agent workloads.

sharp

ExpThink’s useful move is replacing static length penalties with per-problem shortest-correct experience. It tracks the shortest correct solution found so far, gives full credit to concise correct answers, discounts verbose correct ones, and gives zero to wrong answers. Then correct-count normalization amplifies gradients on hard problems and suppresses easy-problem verbosity. The headline result is up to 77% shorter average responses and 3× the accuracy-efficiency ratio versus a vanilla baseline. I buy the training direction, but not the broad extrapolation yet. Math benchmarks give clean correctness and repeatable shortest paths; code agents, tool calls, and long-horizon workflows do not. The last year of reasoning models made “long CoT buys accuracy” painfully expensive at inference time, so this is more serious than a token-trimming trick. The snippet does not disclose model scale or per-benchmark tables, and that matters for judging the 77% claim.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→ESSAM: A Competitive Evolution Strategies Approach to Memory-Efficient LLM Fine-Tuning

ESSAM achieves 78.27% average accuracy on GSM8K fine-tuning, reduces average GPU memory use by 18x versus PPO and 10x versus GRPO, and the authors release code on GitHub.

#Reasoning#Fine-tuning#Inference-opt#ESSAM

why featured

HKR-H/K/R all pass: the paper offers concrete VRAM cuts, a GSM8K result, and open code for RL fine-tuning. It stays in the featured-threshold band because it is still an arXiv method paper without production validation or major-lab backing.

editor take

ESSAM’s GSM8K score is merely fine; the 18x vs PPO and 10x vs GRPO memory cut is the part that bites for reasoning finetunes.

sharp

ESSAM’s sharp claim is economic, not benchmark glory: it makes full-parameter reasoning finetuning look reachable without PPO-class memory burn. The paper reports 78.27% average GSM8K accuracy, above PPO’s 77.72% and almost tied with GRPO’s 78.34%, while cutting average GPU memory 18x versus PPO and 10x versus GRPO. That is a serious trade if it survives outside the paper setup. I don’t buy a broad RL replacement story yet. The evidence is GSM8K plus six generalization datasets, and the abstract does not show code tasks, long-horizon reasoning, or scaling behavior across much larger models. GRPO won mindshare because it was simple enough to operationalize after DeepSeek-style reasoning runs. ESSAM has to prove the ES+SAM loop stays stable when batch sizes, model size, and reward noise stop being friendly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Mechanistic Interpretability Research Must Disclose Identification Assumptions for Causal Claims

The paper audits 10 mechanistic interpretability papers and a two-human-coded sample of 30 papers, finding no dedicated identification-assumptions sections and frequent use of validation metrics such as faithfulness, completeness, monosemanticity, alignment, or ablation effects as causal support.

#Interpretability#Alignment#arXiv#Research release

why featured

HKR-K/R pass: the audit numbers and “no identification-assumptions section” make a testable critique of causal evidence in mech interp. The niche topic and dry title keep it near the featured floor.

editor take

Mechanistic interpretability needs to stop borrowing causal language; 10 audited papers had zero identification-assumption sections.

sharp

This position paper hits a real weak spot in mechanistic interpretability: papers say circuits, mediators, and causal abstraction, then skip the identification story. The authors audit 10 papers across four methodological strands and a two-human-coded sample of 30 papers. They find no dedicated identification-assumptions sections. Faithfulness, completeness, monosemanticity, alignment, and ablation effects often get used as causal support. I buy the critique. The field has been too comfortable turning “the metric moved after an intervention” into “we found the mechanism.” Ablation shows local sensitivity; it does not automatically identify a causal structure. If this NeurIPS Position Track norm lands, plenty of mech-interp papers will need softer claims: not “we found the causal circuit,” but “under these assumptions, this circuit explains the behavior.”

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→The Context Gathering Decision Process: A POMDP Framework for Agentic Search

The paper introduces CGDP, a POMDP framework that models LLM agent search as approximate Thompson Sampling; across four methods and three question-answering domains, its predicate-based belief state improves multi-hop reasoning by up to 11.4% and its exhaustion gate cuts tokens by up to 39% without performance loss.

#Agent#Reasoning#Memory#Research release

why featured

HKR-K and HKR-R pass: the paper offers a testable framework and concrete eval numbers, tied to agent search cost. HKR-H is weak, and the POMDP framing adds technical friction, so this sits at the low featured band.

editor take

CGDP brings agent search back to old POMDP machinery; 11.4% better reasoning and 39% token savings are useful, not a general agent leap.

sharp

CGDP is useful because it turns messy agent search into a controlled state machine, not another long-context sales pitch. The paper models LLM search as approximate Thompson Sampling, then swaps implicit working memory for a predicate-based belief state. Across four methods and three QA domains, it reports up to 11.4% better multi-hop reasoning and up to 39% token savings via an exhaustion gate without performance loss. I buy the direction. In enterprise databases, codebases, and long chat histories, the failure is often corrupted search state, not raw reasoning. The catch is scope: the evidence is still QA-domain evidence. The abstract does not show real repo search, ticket systems, permissioned databases, or latency costs. This reads like a practical agent-harness patch, not proof that open-ended agent loops are suddenly reliable.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Rachel Ma and coauthors propose conditional optimal transport for PRM calibration. The method estimates monotonic conditional quantiles over PRM success probabilities from hidden states, extracts confidence bounds at arbitrary levels, and is evaluated on MATH-500 and AIME, where it improves calibration against uncalibrated PRMs and quantile regression when ranking signals are reliable.

#Reasoning#Inference-opt#Benchmarking#Rachel Ma

why featured

HKR-K/R pass: the paper gives a concrete mechanism and benchmark setting tied to PRM reliability. HKR-H is weak, and the article lacks code or a production-level claim, so it stays interesting-not-featured.

editor take

Two hits are the same arXiv paper, not consensus. PRM calibration is the unglamorous bottleneck in inference scaling, and CondOT still needs agent-grade proof.

sharp

Both entries point to arXiv:2605.06785 with the same title, so this is a single-source chain, not independent confirmation. The paper applies conditional optimal transport to PRM calibration: condition on PRM hidden states, estimate a monotone conditional quantile function, then feed arbitrary confidence bounds into IAS. I buy the problem before I buy the result. PRMs often overstate “this step looks promising” as “this trajectory will solve,” and AIME-style OOD math is exactly where that failure bites Best-of-N and verifier-guided search. The abstract only says MATH-500 and AIME “generally” improve downstream IAS, with no concrete lift or breakdown by weak versus strong PRMs. Until those numbers are visible, this is a plausible calibration tool, not a scaling recipe.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

The paper compares standard top-k routing with sampled equal-compute alternatives in four MoE models, finding that fragile reasoning tokens often have lower-loss routes inside the frozen model that the trained router does not select.

#Reasoning#Interpretability#Benchmarking#Qwen

why featured

HKR-H/K/R pass: the routing-failure hook is concrete, the 4-model counterfactual setup is testable, and MoE reasoning reliability is practitioner-relevant. Still an arXiv research item without tooling or production evidence, so it stays at the featured threshold.

editor take

MoE failures are not just weak experts; this paper pins part of hard-reasoning loss on top-k routers picking bad routes inside frozen models.

sharp

MoE weakness shows up at the token level here: hard reasoning tokens often have lower-loss routes, and the top-k router misses them. The paper tests equal-compute alternative routes across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, so the claim is not “use more experts.” It is “the same compute picked the wrong experts.” The sharp hook is the intervention: update only the final-layer router, freeze every expert and every other router, and pass@K moves on AIME 2024+2025 and HMMT 2025 for Qwen3-30B-A3B and GPT-OSS-20B. The paper does not give the gain in the snippet, so I won’t pretend scale. But the mechanism lands: LM loss scores only the executed route, while load balancing sees aggregate traffic. The router never gets trained against the better counterfactual route it skipped.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

RateQuant fits a per-quantizer distortion curve from a small calibration set and, on Qwen3-8B at 2.5 average bits, cuts KIVI perplexity from 49.3 to 14.9 while taking 1.6 seconds of calibration on one GPU and adding zero inference overhead.

#Inference-opt#Fei Zuo#Qwen#KIVI

why featured

HKR-H/K/R all pass, but this is an inference-optimization arXiv paper, not a model or product release. The Qwen3-8B result at 2.5 average bits, KIVI 49.3→14.9, supports a featured-threshold score.

editor take

RateQuant makes KV-cache bit allocation look embarrassingly under-modeled; KIVI at 2.5 bits going 49.3→14.9 PPL is too large to ignore.

sharp

RateQuant lands because it attacks the sloppy part of KV-cache quantization: treating head importance as enough. The paper’s concrete claim is that each quantizer has its own distortion curve, D(b)=alpha*beta^{-b), with beta ranging from 3.6 to 5.3. Use the wrong curve, and the allocation order flips; performance can drop below uniform quantization. The Qwen3-8B numbers are hard to hand-wave away. At 2.5 average bits, calibrated RateQuant cuts KIVI perplexity from 49.3 to 14.9, improves QuaRot by 6.6 PPL, calibrates in 1.6 seconds on one GPU, and adds zero inference overhead. A lot of KV-cache work sells “lower bits” as the story. This paper says the serving bug is mismatched quantizer physics, not just memory pressure.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

The Starling team uses an LLM-based pipeline to tag 22.5 million PubMed papers covering 2.5 trillion tokens and 4.5 billion entities, then generates about 6.3 million structured records across six biomedical tasks with supporting passages and reported rejection rates of 0.6–7.7%.

#Agent#RAG#Embedding#Starling

why featured

HKR-H/K/R all pass, but this is a vertical biomedical data paper with limited source authority and sparse reproducibility detail in the feed. It clears featured, not the 78+ band.

editor take

Starling’s PubMed factory is serious; the “replace curated databases” claim is where I brake hard. Rejection rate is not biological truth.

sharp

Starling’s sharp move is turning biomedical extraction into an assembly line, not another benchmark demo. It tags 22.5M PubMed papers, 2.5T tokens, and 4.5B entities, then has agents emit about 6.3M records across six tasks. That scale matters because most curated biomedical databases are slow and context-thin. I don’t buy the “more accurate than curated databases” claim without a harder audit. Their concrete evidence is 0.6–7.7% frontier-model rejection, versus 16.5% on BBB_Martins and 7.3% on Bioavailability_Ma. Useful signal, but the judge is still a model, not replicated assays or blinded human curation. The strongest part is the supporting passage design: fed versus fasted oral bioavailability is exactly the nuance tabular datasets strip out.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→ModelLens: Finding the Best for Your Task from Myriads of Models

ModelLens learns a performance-aware latent space from 1.62M evaluation records spanning 47K models and 9.6K datasets, then ranks unseen models for unseen datasets without running candidate models on the target dataset.

#Benchmarking#Inference-opt#Reasoning#ModelLens

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with mechanism and scale only; no major adoption or open-source impact is disclosed, so it sits in the 72–77 research-tool band.

editor take

ModelLens turns leaderboard exhaust into model search. Useful for teams, but “rank without running” needs scars before trust.

sharp

ModelLens makes a practical bet: model selection has become recommendation over messy public traces. It learns from 1.62M evaluation records across 47K models and 9.6K datasets, then ranks models for unseen datasets without target runs. That is a better starting point than hand-scanning Open LLM Leaderboard, Papers with Code, and scattered HF cards. I buy the direction, not the victory lap. The abstract claims Top-K pools improve several routing methods by up to 81% on QA benchmarks, but the RSS text gives no baseline details, K values, or cold-start split design. Recommenders built on leaderboards are leak-prone, and task labels can become shortcuts. In production, this is a candidate-pool pruner before it is a model-quality judge.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Tree SAE paper introduces hierarchical feature learning for sparse autoencoders

The paper introduces Tree SAE, which combines activation coverage with a reconstruction constraint to learn hierarchical structures inside Sparse Autoencoder feature sets; the abstract says it outperforms existing SAEs on hierarchical-pair learning and stays competitive on several benchmarks, but the RSS snippet does not disclose scores.

#Interpretability#Benchmarking#Research release#Benchmark

why featured

HKR-H/K pass: “Tree SAE” has a concrete structural hook and the post names activation coverage plus reconstruction constraints. No benchmark scores are disclosed, and the niche interpretability angle keeps it below featured.

editor take

Tree SAE adds reconstruction constraints to hierarchy mining; useful cleanup for SAE interpretability, not a breakthrough claim yet.

sharp

Both entries point to the same arXiv paper, 2605.07922, so the coverage is fully aligned but not independently corroborated. The concrete hook is Tree SAE’s added reconstruction condition: child features must satisfy parent activation and a functional reconstruction link, aimed at feature absorption and splitting. I buy the problem framing more than the performance story. SAE interpretability has burned plenty of people by treating co-activation as semantic structure, so adding a reconstruction test is a stricter filter. But the abstract only says “significantly surpass” and “competitive,” without benchmark names, model size, sparsity regime, or human-labeling protocol. Until those are visible and replicated, Tree SAE reads like a useful diagnostic hypothesis, not a settled hierarchy learner.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails

The paper shows that binary-reward GRPO with group-mean-centered advantage receives zero learning signal when all group responses are correct or wrong, and reports a 0.69 degeneracy rate at group size 4 in logged Qwen3.5-9B GSM8K training. The fixed-reference Sign advantage A=2r-1 reaches 73.8% GSM8K accuracy across seven seeds versus 28.4% for normalized group-mean DrGRPO.

#Reasoning#Alignment#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the paper names a GRPO zero-advantage mechanism and gives Qwen3.5-9B/GSM8K results across seven seeds. The training-algorithm focus narrows reach, so it lands in low featured.

editor take

GRPO’s binary-reward failure was hiding in plain sight; a 0.69 degeneracy rate at group size 4 says the trainer, not the model, is often killing signal.

sharp

This paper punches through a lazy default in RLVR: binary rewards plus group-mean centering produce zero advantage when a group is all correct or all wrong. In logged Qwen3.5-9B GSM8K training, group size 4 hits a 0.69 degeneracy rate. That is not a corner case; it is the trainer muting most of the useful signal. The funny part is how crude the fix is. Sign advantage, A=2r-1, reaches 73.8% GSM8K accuracy across seven seeds, versus 28.4% for normalized group-mean DrGRPO, with p<0.0001. If this replicates, a lot of “RL creates reasoning” narratives get downgraded: the gain looks like compressing existing pass@k search into one output, not adding much new capacity. The MATH-500 result is only positive but underpowered, so don’t overgeneralize it yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

The paper models recursive generative retraining with multiple reward functions and proves that, under specified conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions while satisfying a weighted Nash bargaining solution.

#Alignment#Fine-tuning#Research release#Safety/alignment

why featured

HKR-H/K/R pass, but this is an arXiv theory paper with no disclosed system, code, or production test. It fits featured threshold, not must-write.

editor take

This paper narrows “synthetic data collapse” into a condition: one reward signal collapses; plural curation can stabilize the loop.

sharp

The useful move here is turning synthetic-data collapse from fate into mechanism. The paper studies recursive generative retraining with multiple reward functions, then proves convergence to a stable distribution under specified conditions. Probability mass lands across competing high-reward regions via a weighted Nash bargaining solution. I buy the theory direction; I don’t buy any production-training comfort yet. The disclosed result is formal convergence, not LLM-scale evidence. There is no recipe for GPT-5, Claude Sonnet 4.5, sampling ratios, or benchmark curves. It speaks to the old RLHF problem where a single reward model narrows the output distribution. That gives data-curation teams a mathematical guardrail, not permission to scale closed synthetic loops.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Lightning OPD replaces a live teacher server with one precomputed pass of teacher log-probabilities over SFT rollouts, matches standard OPD in math reasoning and code generation experiments, and reports 4.0x higher training efficiency under teacher consistency.

#Reasoning#Code#Fine-tuning#Qwen

why featured

Placed at the featured floor: HKR-K is concrete with precomputed teacher log-probs and a 4.0x efficiency claim; HKR-R hits post-training cost. HKR-H is muted by OPD jargon, and a single arXiv paper lacks the authority for 78+.

editor take

Lightning OPD’s 4.0x efficiency claim is useful, but it lives or dies on teacher consistency; swap teachers and the story breaks.

sharp

Lightning OPD is useful because it attacks the boring cost center in OPD: the live teacher server. The paper precomputes teacher log-probs once over SFT rollouts, then reuses them during training. The concrete numbers are solid enough to care about: Qwen3-8B-Base hits 69.9% on AIME 2024 in 30 GPU hours, and Qwen3-30B-A3B reaches 71.0% on one 8xH100 node. They also report 4.0x higher training efficiency. I buy the engineering direction, but not a broad “offline OPD just works” read. The whole result hangs on teacher consistency: the same teacher must generate SFT and OPD signals. Break that, and the paper says gradient bias hurts both offline and online OPD. That is great for clean academic pipelines and awkward for industrial post-training stacks that mix teachers, synthetic data sources, and refresh cycles.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Delulu introduces 1,951 verified Fill-in-the-Middle samples across 7 programming languages and 4 code hallucination types, using Docker checks and human review to validate failures; the evaluation of 11 open-weight FIM models across 0.5B–32B parameters reports a best result of 84.5% pass@1, with benchmark code and containers released on GitHub.

#Code#Benchmarking#Microsoft#Qwen

why featured

HKR-H/K/R all pass, but this is an arXiv benchmark release rather than a model or product event. The 1,951 samples and 11 open FIM model results give practical signal, placing it just above the featured threshold.

editor take

Delulu drags code hallucination back into Docker, not vibes; 84.5% pass@1 is livable, but ugly for confident IDE fill-ins.

sharp

Delulu’s sharp edge is execution-verified failure, not another static preference set. It has 1,951 FIM samples across 7 languages and 4 hallucination types. The pipeline uses a frontier LLM to generate plausible bad completions, 4 judge models to filter them, embedding clustering to mine harder cases, Docker to prove the golden code compiles while the hallucinated variant hits the expected runtime error, then human review. That maps closer to Copilot and Cursor pain than HumanEval-style full-function tests: one middle-line completion invents an API, parameter, import, or variable, and the file only breaks at runtime. The best of 11 open-weight FIM models from 0.5B to 32B reaches 84.5% pass@1. Qwen2.5-Coder, DeepSeek-Coder-V2, CodeLlama, and StarCoder2 all still produce hallucination-aligned completions. I don’t read 84.5% as collapse. I read it as evidence that “fits the surrounding code” is a dangerous local objective. If IDE vendors wire Delulu-style cases into regression testing, it will matter more than another leaderboard bump.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

The paper analyzes 36 unlearned LLaVA-1.5-7B models and finds that five metrics produce conflicting rankings across three VQA benchmarks, then introduces UQS, a composite score weighted by each metric’s Spearman correlation with an oracle retrained on the retain set.

#Multimodal#Vision#Benchmarking#LLaVA

why featured

HKR-H/K/R all pass, but this is a niche arXiv evaluation paper rather than a model or product release. The 36-model study and UQS make it concrete enough for the featured threshold.

editor take

Multimodal unlearning eval is already messy: across 36 LLaVA checkpoints, FA and AD go negative, so single-metric compliance claims are shaky.

sharp

This paper lands a clean hit on multimodal unlearning eval: the same 36 unlearned LLaVA-1.5-7B checkpoints get conflicting rankings from five standard metrics across three VQA benchmarks. The ugly number is Kendall tau=-0.26 between Forget Accuracy and Activation Distance, with FA/RA/MIA splitting from AD/JS. UQS is better than another vanity leaderboard because it weights metrics by Spearman correlation to an oracle retrained only on the retain set. RA has the strongest signal at rho=0.484; FA goes negative at rho=-0.418. I still have doubts about the oracle: retain-set retraining is expensive and not automatically the legal target. But it forces unlearning papers to stop cherry-picking one flattering Forget Accuracy score.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·11

→SoftSAE paper introduces dynamic feature selection for adaptive sparse autoencoders

SoftSAE replaces fixed-K sparse autoencoders with a differentiable Soft Top-K operator that learns input-dependent sparsity k, targeting LLM and ViT interpretability where samples vary in intrinsic dimensionality; the abstract says experiments show meaningful features and per-concept feature counts, and the code is available on GitHub.

#Interpretability#SoftSAE#Research release#Open source

why featured

HKR-K passes: SoftSAE replaces fixed-K SAEs with input-dependent sparsity and releases code. HKR-H and HKR-R are weak because the paper is specialist interpretability research, so it stays in all.

editor take

SoftSAE attacks fixed-K where SAEs actually creak; I like the direction, but no benchmark numbers in the abstract means no victory lap yet.

sharp

Both entries point to the same arXiv paper, 2605.06610, with identical framing; this is a single-paper chain, not independent coverage. SoftSAE makes a clean bet: fixed-K TopK SAEs add noise on simple inputs and drop structure on complex ones, so it learns an input-dependent k through a differentiable Soft Top-K operator. I like the bet more than another “bigger dictionary” SAE paper. A lot of interpretability pain comes from forcing every activation into the same explanation length. The catch is that the abstract only says experiments confirm the claim; it gives no model names, datasets, reconstruction loss, or feature-purity numbers. Compared with Anthropic’s large-scale SAE work, SoftSAE reads like a useful objective-level tweak, not yet a deployable audit tool for frontier LLMs.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

SHRED selects the lowest-probability tokens in each forget-set instance with one forward pass, demotes their logits through KL self-distillation, and reports a better forget-utility Pareto trade-off than retain-set-dependent methods across four standard unlearning benchmarks.

#Fine-tuning#Alignment#Safety#SHRED

why featured

HKR-H/K/R pass: the paper has a concrete retain-set-free mechanism and 4 benchmark claims. Single arXiv item; no code, replication, or production evidence is disclosed, so it stays below featured.

editor take

SHRED picks high-surprisal forget tokens with one forward pass; I buy the mechanism, but need the four-benchmark tables.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

The paper evaluates six appraisal-based self-assessment dimensions across 12 LLMs and 38 tasks, finding that effort and ability match or outperform confidence in most settings, with effort more predictive on reasoning-intensive tasks and ability or confidence stronger on retrieval-oriented tasks.

#Reasoning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv evaluation paper with no tool release, adoption signal, or cross-source debate. The 12-model, 38-task result keeps it in high all.

editor take

The paper tests 12 LLMs on 38 tasks; I buy effort as signal, since confidence is already polluted by calibration and style.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

The paper tests Qwen3, Gemma-3, and Llama-3 across more than ten scales on rhyming-couplet completion, finding that only Gemma-3-27B causally moves the rhyme driver to the line boundary around layer 30, with five attention heads recovering about 90% of newline rhyme-routing capacity.

#Reasoning#Interpretability#Qwen3#Gemma-3

why featured

HKR-H/K/R pass, but this is a single arXiv mechanistic-interpretability paper with one task and no visible replication or product impact. Defaulting to the lower 60–71 band.

editor take

Gemma-3-27B reroutes near layer 30, and five heads recover ~90%; probe-visible planning still doesn’t prove causal use.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

The paper introduces HORA, a learning-free rollout allocation policy that maximizes posterior hit utility within each batch. Across four math reasoning benchmarks and three model scales, HORA matches Pass@1 and improves Pass@K over compute-matched GRPO in 10 of 12 model-benchmark settings, with one tie and one saturated exception.

#Reasoning#Benchmarking#HORA#GRPO

why featured

HKR-K/R pass: the paper gives a concrete rollout-allocation method and 10/12 equal-compute gains. Its narrow RLVR post-training scope keeps it in the 60–71 band, not featured.

editor take

HORA improves Pass@K in 10/12 settings; I buy the angle: RLVR still has rollout allocation debt before estimator swaps.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

LiteGUI proposes an SFT-free training paradigm for lightweight GUI agents. It combines Guided On-policy Distillation, oracle trajectories, dynamic retrieval, and multi-solution dual-level GRPO. The paper reports state-of-the-art results among lightweight models and competitive performance against larger models, but the RSS snippet does not disclose benchmark names or exact scores.

#Agent#Vision#Fine-tuning#LiteGUI

why featured

HKR-H/K/R all register: compact GUI agents, no-SFT training, and two-level GRPO matter to agent builders. The source gives abstract-level mechanisms only, with no benchmark numbers, code status, or reproducible setup, so it stays in 60–71.

editor take

LiteGUI trains 2B/3B GUI agents without SFT; scores and benchmarks are undisclosed, so I’m discounting the SOTA claim hard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Neural Neural Scaling Laws

NeuNeu predicts accuracy on 66 downstream tasks using observed accuracy trajectories and token-level validation losses, reaching 1.99% mean absolute error and reducing error by 44% versus logistic scaling laws at 3.56% MAE.

#Benchmarking#Reasoning#HuggingFace#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with metrics only and no code, author signal, or external replication shown. It fits the upper “interesting” band, below featured.

editor take

NeuNeu hits 1.99% MAE across 66 tasks; I trust trajectory extrapolation more than one smooth curve pretending tasks scale alike.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

LookWhen uses a selector-extractor framework for video recognition, selecting top-K space-time tokens from a scaled-down video and approximating full-video representations, with experiments across 6 tasks and 2 settings showing Pareto dominance in accuracy-FLOPs on 9 of 12 cases and 6.7x higher throughput than InternVideo2-B at equal accuracy.

#Vision#Multimodal#Inference-opt#LookWhen

why featured

HKR-H/K/R pass: 6.7x speedup, 9/12 accuracy-FLOPs wins, and selector-extractor compute routing are concrete. It remains a single arXiv video-recognition paper, with no open-source artifact or deployment, so it stays in 60-71.

editor take

LookWhen wins 9 of 12 video cases; 6.7x throughput says token selection still has room to embarrass dense video Transformers.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

The paper proposes Goldilocks, a teacher-driven sampling strategy that predicts each question’s difficulty for a student model, selects neither-too-easy nor-too-hard items during GRPO training, and reports better OpenMathReasoning performance than standard GRPO under the same compute budget.

#Reasoning#Fine-tuning#Goldilocks#OpenMathReasoning

why featured

HKR-H/K/R pass: the hook is Goldilocks difficulty sampling, the new fact is student-conditioned item selection, and the nerve is RL training efficiency. No effect size or lab authority is disclosed, so it stays in the all band.

editor take

Goldilocks beats standard GRPO on OpenMathReasoning at equal compute; gain size is undisclosed, but sampling policy beats knob-twiddling here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

The paper tests a year-triggered SQL injection backdoor on SmolLM2-360M, where “2024” triggers vulnerable code and “2023” triggers safe code; Diff-SAE reaches BIS 0.40 with 1.0 precision and zero false positives across most conditions, while Crosscoders stay below 0.02 in most cases.

#Interpretability#Safety#Fine-tuning#SmolLM2

why featured

HKR-K is strong, with concrete BIS and precision numbers; HKR-H/R pass via the backdoor-detection hook and safety concern. The SAE focus and SmolLM2-360M SQL-injection setup keep it in the 60–71 band.

editor take

Diff-SAE hits BIS 0.40 on SmolLM2-360M backdoors; Crosscoders sit below 0.02, so SAE-safety hype needs a haircut.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mean-Pooled Cosine Similarity Is Not Length-Invariant: Theory and Cross-Domain Evidence

The paper shows mean-pooled cosine similarity grows monotonically with sequence length under anisotropic transformer representations, with length ratio explaining R²=0.52–0.75 of cross-language Python proximity across four code LLMs on HumanEvalPack.

#Embedding#Benchmarking#Interpretability#HumanEvalPack

why featured

HKR-H and HKR-K pass: the paper challenges mean-pooled cosine and provides a length-bias mechanism plus R² data. As a single arXiv metric paper with a narrow embedding/eval audience, it stays in all.

editor take

Length ratio explains R²=0.52–0.75 under mean-pooled cosine; cross-lingual code proximity papers need CKA controls first.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators

CIKA uses a frozen 7B LLM as an interventional simulator for concept mastery, reaching 69.7% on Omni-MATH-Rule versus 60.5% for o1-mini, with ICP separating causally relevant concepts from negative controls on 67 screened problems.

#Reasoning#Benchmarking#CIKA#Omni-MATH

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper tied to a math benchmark. No open artifact, replication detail, or major lab release is disclosed, so it stays below featured.

editor take

CIKA’s frozen 7B hits 69.7% on Omni-MATH-Rule versus o1-mini’s 60.5%; I trust the probe, but want replication.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MIND: Monge Inception Distance for Generative Models Evaluation

The paper proposes MIND, a generative model evaluation metric using sliced Wasserstein distance; MIND with 5k samples matches FID with 50k samples, computes two orders of magnitude faster, and avoids FID’s high-dimensional mean and covariance estimation.

#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: MIND claims 10x fewer samples than FID and two-orders faster evaluation. As a single arXiv metric paper without adoption evidence, it stays in the interesting-not-featured band.

editor take

MIND matches FID-50k with 5k samples; I buy the speed claim, but leaderboard swaps need third-party replication.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

SparseRL-Sync replaces full-weight transfers with lossless sparse update payloads, reducing per-update communication from S to about S/100 when parameter-change sparsity reaches 99% in decoupled Trainer-Rollout RL systems.

#Inference-opt#SparseRL-Sync#arXiv#Research release

why featured

HKR-H/K/R pass, but the item has only title/abstract-level facts; experiment scale, tasks, code, and limits are not disclosed. This is useful systems work, so high all, not featured.

editor take

SparseRL-Sync claims ~100x less sync traffic at 99% change sparsity; I’d audit index overhead and sparsity stability first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Mage evaluates 858 Unity scene generation attempts across four open-weight 7B–30B LLMs, finding direct NL-to-C# generation reaches a 43% mean runtime-pass rate but yields structurally weak scenes with mechanism F1 around 0.12.

#Code#Benchmarking#Mage#Unity

why featured

HKR-H/K/R all pass, but the scope is niche Unity scene generation rather than a broad code-agent release. Concrete scale and failure metrics put it at the high end of 60–71.

editor take

Mage ran 858 Unity generations; 43% runtime pass with 0.12 mechanism F1 makes compile-pass bragging look lazy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

The paper introduces PND, a training-free inference framework for VLM decoding, using a positive path to amplify visual evidence and a negative counterfactual path to penalize prior-dominant generation, with reported state-of-the-art results on POPE, MME, and CHAIR.

#Multimodal#Vision#Inference-opt#Research release

why featured

HKR-H/K/R all pass, but the post gives only the mechanism and SOTA claims on POPE, MME, and CHAIR, with no gains, code, or deployment test. Single arXiv paper stays in the 60–71 band.

editor take

PND reports SOTA on POPE, MME, and CHAIR; I want the latency bill, since training-free isn't inference-free.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

The paper analyzes vision and language models and shows that leading singular vectors in pretrained weights stay stable during fine-tuning across unrelated tasks; it proposes freezing pretrained singular vectors and training only leading spectral coefficients, reaching competitive GLUE performance with 0.2% trainable parameters.

#Fine-tuning#Interpretability#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: 0.2% trainable parameters, stable top singular vectors, and coefficient-only tuning. Single arXiv paper, high spectral-analysis threshold, no code or outside replication, so it stays in 60–71.

editor take

The paper reports competitive GLUE with 0.2% trainable params; I buy the angle—another testable low-rank adaptation story beyond LoRA.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Quotient Semivalues for False-Name-Resistant Data Attribution

The paper proposes a quotient semivalue mechanism that computes Shapley, Banzhaf, or Beta-style attribution over evidence-backed clusters, and in DataMarket-Gym reduces duplicate and near-duplicate Sybil attack gain from 1.74 under baseline Shapley to 0.96.

#Benchmarking#Safety#DataMarket-Gym#Research release

why featured

HKR-H and HKR-K pass: the Sybil-payout angle is concrete, and the paper gives a mechanism plus DataMarket-Gym numbers. Niche semivalue/data-market framing keeps it below featured.

editor take

Quotient semivalue cuts Sybil gain from 1.74 to 0.96; identity-level Shapley is basically an invite to farm payouts.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Android Coach changes Android agent online RL from Single State Single Action to Single State Multiple Actions, using a critic, a process reward model, and group-wise advantage estimation; it improves success rates by 7.5% on AndroidLab and 8.3% on AndroidWorld over UI-TARS-1.5-7B, and reaches 1.4x training efficiency versus PPO and GRPO at matched success rates.

#Agent#Reasoning#Benchmarking#Android Coach

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper without major-lab backing, release details, or production evidence. It fits the 60–71 band, so tier stays all.

editor take

Android Coach gains 7.5%/8.3% on two benchmarks; reusing costly UI states beats piling on rollouts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

VESPO applies a sequence-level closed-form reshaping kernel to reduce variance in off-policy RL for LLMs, reports stable training with rollout staleness up to 64x, and outperforms matched reshaping baselines in math reasoning and code generation experiments across dense and MoE models.

#Reasoning#Code#Fine-tuning#Research release

why featured

HKR-K/R pass: 64x stale rollouts and the sequence-level kernel are concrete claims tied to RL post-training cost. Single arXiv paper, academic framing, and no disclosed artifact keep it in the 60–71 band.

editor take

VESPO reports stable off-policy LLM RL at 64x rollout staleness; finally, a cleaner answer than clipping heuristics.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Bias and Uncertainty in LLM-as-a-Judge Estimation

The paper analyzes bias in LLM-as-a-Judge estimation and uses J and ΔJ to diagnose calibration instability; its MMLU-Pro case study shows a sign reversal in model comparison under shared calibration.

#Benchmarking#Alignment#Research release#Benchmark

why featured

HKR-H/K/R all pass, but this is a single arXiv eval-method paper. The post discloses J/ΔJ and one MMLU-Pro reversal case, not a tool, scale, or broad debate, so it stays in all at 70.

editor take

J and ΔJ expose LaaJ calibration drift; MMLU-Pro already flips direction, so shared-calibration confidence needs receipts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Query-efficient model evaluation using cached responses

The paper introduces a DKPS-based evaluation method that uses cached responses from prior models to predict a new model’s benchmark performance; under specified conditions, it reduces query counts and matches baseline mean absolute error with a substantially smaller query budget.

#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass lightly, but dataset, reduction size, and reproduction conditions are not disclosed. This is useful evaluation research, not a same-day industry event, so it stays in the 60–71 band.

editor take

DKPS uses cached responses to evaluate new models; query savings are undisclosed, and benchmark dedup gets harder.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

The paper proposes MELT, which uses one shared KV cache per layer plus a learnable gating mechanism to change iterative reasoning memory from linear growth with depth to a constant footprint.

#Reasoning#Memory#Fine-tuning#MELT

why featured

HKR-H/K/R pass via constant KV memory, shared-cache gating, and inference-cost relevance. Single arXiv paper with no benchmark numbers, model scale, or deployment conditions keeps it in the 60–71 band.

editor take

MELT makes looped reasoning KV memory constant via shared caches; no benchmark numbers in the snippet, so don’t crown it over Ouro yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→RelAgent: LLM Agents as Data Scientists for Relational Learning

RelAgent handles relational learning with two phases: during search, an LLM agent uses database, validation, and evaluation tools to build SQL feature programs and select a predictive model; during inference, the resulting SQL queries and classical model run without further LLM calls.

#Agent#Tools#Inference-opt#RelAgent

why featured

HKR-H/K/R all pass, but the post only discloses the mechanism, not metrics, dataset scale, or release status. As a single arXiv paper, it stays below featured.

editor take

RelAgent uses LLMs for SQL search, then 0 LLM calls at inference; I like the shape, but benchmarks and cost are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

The paper measures finite-answer preference stabilization with Qwen3-4B-Instruct and finds that, in controlled delayed-verdict tasks, the contextual finite-answer projection stabilizes 17–31 tokens before the answer becomes parseable in the main templates.

#Reasoning#Interpretability#Qwen#Research release

why featured

HKR-H/K/R pass: the paper asks when a model commits and gives a concrete Qwen3-4B-Instruct result, 17–31 tokens before parseability. Single arXiv paper with no replication or multi-model evidence, so it stays in the 60–71 band.

editor take

Qwen3-4B-Instruct locks answer preference 17–31 tokens early; don't call it belief probing, it tracks eventual output.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts

CaRE uses a bi-level routing MoE to handle 100 to over 300 non-overlapping continual learning tasks, releases OmniBenchmark-1K and code, and the abstract says it outperforms all baselines on very long class-incremental learning sequences.

#Fine-tuning#Benchmarking#CaRE#OmniBenchmark-1K

why featured

HKR-H/K pass: 300+ tasks, bi-level routing MoE, OmniBenchmark-1K, and code provide new facts. HKR-R is weak because this is a specialist training paper, so it stays in all rather than featured.

editor take

CaRE pushes CIL to 100–300+ tasks; without forgetting rates or expert cost, I’m treating it as a scaling paper.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

MASPO optimizes prompts across LLM-based multi-agent systems using joint evaluation and evolutionary beam search, and reports an average accuracy gain of 2.9 over state-of-the-art prompt optimization methods across 6 tasks.

#Agent#Tools#Benchmarking#MASPO

why featured

HKR-K and HKR-R pass: MASPO gives a concrete mechanism and +2.9 accuracy across 6 tasks, relevant to agent builders. It stays in the 60–71 band because this is a niche arXiv method paper without production impact or disclosed artifact details.

editor take

MASPO reports +2.9 average accuracy on 6 tasks; multi-agent prompt tuning is finally attacking local-goal mismatch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→In-Context Credit Assignment via the Core

The paper proposes incentive-aligned in-context credit assignment using the least core from cooperative game theory. On a web retrieval credit assignment task, its constraint seeding and separation routines approximate the least core with orders of magnitude fewer LLM calls than alternative methods, while compensating creators whose IP appears in the context window.

#RAG#Tools#Research release

why featured

HKR-K passes with a concrete mechanism and call-reduction claim; HKR-R is limited to RAG evaluation practitioners. HKR-H misses because the title is academic and not broadly clickable.

editor take

Least-core credit assignment cuts LLM calls by orders of magnitude; baseline and error are undisclosed, so don’t canonize payouts yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

The paper builds AEB and ECS to test 2 RLVER models and Qwen 1.5B/7B across 480 adversarial dialogues; RLVER-PPO-Think scores 0.963 versus 0.761 for the same-scale untuned baseline, while ECS shows no significant gain over Base-7B-Think at p=0.650.

#Agent#Alignment#Benchmarking#Qwen

why featured

HKR-H/K/R pass, but this is a single arXiv benchmark paper. It has concrete tests and scores, yet no major lab release, adoption signal, or cross-source discussion, so it stays in the 60–71 band.

editor take

AEB stress-tests RLVER over 480 dialogues; 0.963 pops, but ECS p=0.650 says empathy rewards trained performance first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions

The paper introduces SSAI, mapping sparse news into 4 auditable axes, and tests it on 30 NASDAQ-100 stocks from 2019 to 2023, reporting 307.2% cumulative return and a 1.067 Sharpe for the four-factor portfolio while stating the gains fail coverage-stratified controls and reverse at costs of at least 0.2%.

#Agent#Reasoning#Interpretability#arXiv

why featured

HKR-H/K/R pass on the 307.2% LLM trading backtest and 4-axis SSAI mechanism, but this is a single arXiv finance paper with no live results, code, or cross-source validation, so it stays in the 60-71 band.

editor take

SSAI reports 307.2% on 30 NASDAQ-100 names, but flips at 0.2% costs; treating it as alpha is a trap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Response Time Enhances Alignment with Heterogeneous Preferences

The paper adds user response time to binary preference datasets and models each decision with a Drift-Diffusion Model. Its estimator recovers population-average heterogeneous preferences, with a proof of asymptotic convergence even when each anonymous labeler contributes only one choice.

#Alignment#Benchmarking#Research release#Safety/alignment

why featured

HKR-H/K/R pass, but this is an arXiv methods paper whose impact depends on experiments and replication. The one-choice-per-anonymous-annotator convergence claim keeps it interesting, not must-write.

editor take

Response time fixes binary preference bias with one label per user; I buy the math, not deployment—RLHF UI latency will poison DDM.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Attention Transfer Is Not Universally Effective for Vision Transformers

The paper evaluates 20 teachers from 11 ViT families and finds that Attention Transfer fails in 4 families, falling up to 5.1% below the from-scratch no-transfer baseline.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the sample size and “below scratch” result add signal. The topic is a ViT distillation benchmark with narrow practitioner resonance, so it stays in the 60–71 band.

editor take

Across 20 teachers, Attention Transfer loses to baseline by 5.1% in 4 ViT families; attention maps are not a portable API.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MAVEN: Multi-Agent Verification-Elaboration Network with In-Step Epistemic Auditing

MAVEN uses a three-role Skeptic-Researcher-Judge loop to audit reasoning during generation, and the abstract reports better results than GEMINI-3.1-Pro and ReConcile on four benchmarks: OpenBookQA, TruthfulQA, HALUEVAL, and StrategyQA.

#Agent#Reasoning#Benchmarking#MAVEN

why featured

HKR-H/K/R pass, but this is a single arXiv method paper. The post gives the mechanism, benchmark names, and opponents, but no scores, code, or replication details, so it stays below featured.

editor take

MAVEN reports wins over GEMINI-3.1-Pro on four benchmarks. No scores or cost disclosed; treat it as an expensive prompt scaffold.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→The Translation Tax Is Not a Scalar: A Counterfactual Audit of English-Source Cue Inheritance in Chinese Multilingual Benchmarks

The paper audits Translation Tax in English-to-Chinese benchmarks: three proxy estimators disagree, a six-model native-control comparison shows model-family effects rather than uniform benchmark effects, and an LLM naturalization stress test leaves only a residue dose-response after a prompt-construction bug is corrected.

#Benchmarking#arXiv#Research release#Benchmark

why featured

HKR-H/K/R pass through a clear benchmark-audit hook, concrete estimator/model counts, and relevance to Chinese eval trust. Still, this is a single arXiv paper with no artifact or broad industry uptake disclosed, so it stays in all.

editor take

This audit uses 3 proxy estimators, and they disagree; treating Translation Tax as one penalty number looks lazy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

Echo replaces attention layers with Spectral Koopman Attention for KV-cache-free retrieval using O(r²) streaming state; at 50M parameters, SKA-augmented models reach 100% accuracy on tested Multi-Query Associative Recall settings, including 4,096-token distractor gaps with 32 KV pairs.

#Reasoning#Memory#Inference-opt#Echo

why featured

HKR-H/K/R pass: the mechanism and numbers are concrete, and KV-cache cost matters to practitioners. Still, it is a single arXiv paper validated on a 50M-model synthetic recall setup, so it stays below featured.

editor take

Echo 50M hits 100% at 4,096-gap, 32-KV recall; I buy the mechanism, not generalization without open long-context runs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

The Themis paper introduces Themis-CodeRewardBench and Themis-RM, covering five preference criteria, eight programming languages, evaluations of 50+ reward models, more than 350k preference pairs, and model sizes from 600M to 32B parameters.

#Code#Alignment#Benchmarking#Themis

why featured

HKR-K and HKR-R pass: the paper gives concrete benchmark scale, language coverage, and model ranges, and it matters for code-model evaluation. HKR-H is weak, and an unknown-team arXiv release fits the 60–71 band.

editor take

Themis-RM trains up to 32B on 350k preference pairs; code RMs need to escape execution-pass tunnel vision.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

ResRL modulates negative gradients with projection residuals from an SVD-based low-rank positive subspace, outperforming strong baselines on average across 12 benchmarks covering mathematics, code, agent tasks, and function calling.

#Reasoning#Agent#Code#ResRL

why featured

HKR-K is clear: a new SVD-based gradient mechanism and 12 math/code/agent/function-calling benchmarks. HKR-R exists for reasoning-RL practitioners, but this is a single arXiv method paper without major-lab weight, artifact detail, or production claim.

editor take

ResRL beats NSR by 9.4% Avg@16 on math; I buy the trick, but averages hide per-task regressions.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

The paper proposes CTPO, which uses the cumulative token importance-sampling ratio up to position t for prefix correction and scales log-space clipping bounds by √t; in tool-integrated mathematical reasoning benchmarks, it reports the best average performance across two model scales versus GRPO and GSPO baselines.

#Fine-tuning#Reasoning#Benchmarking#Research release

why featured

HKR-K/R pass: the paper gives a concrete CTPO gradient correction and clipping rule, then claims stronger math-reasoning benchmarks than GRPO/GSPO. The topic is narrow post-training methodology, so it stays in the 60–71 band.

editor take

CTPO fixes each token gradient with prefix cumulative IS and √t clipping; scores are undisclosed, so don’t crown PPO’s successor yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→LLMs are not consistently Bayesian: Quantifying internal inconsistencies in probabilistic beliefs

The paper introduces the information processing gap to evaluate how LLMs update probabilistic beliefs from evidence; across multiple approaches, some updates are nearly Bayesian, while others use learned heuristics, and non-Bayesian heuristic updates often outperform exact Bayesian computation on downstream tasks.

#Reasoning#Benchmarking#Interpretability#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper and the provided text lacks model list, task scale, and reproducible setup details; keep it in all below featured.

editor take

The paper proposes an information processing gap, but omits model lists; LLM heuristics often beat exact Bayes, awkward for calibration purists.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→An Interpretable and Scalable Framework for Evaluating Large Language Models

The paper proposes a majorization-minimization-based framework for LLM evaluation and tests it on MATH-500 plus six Open LLM Leaderboard benchmarks; the abstract says it achieves orders-of-magnitude speedups over competing methods while keeping comparable or higher estimation accuracy.

#Benchmarking#Interpretability#arXiv#Open LLM Leaderboard

why featured

HKR-K passes with a concrete evaluation mechanism, benchmark scope, and an orders-of-magnitude speed claim. HKR-R is moderate around eval cost; HKR-H fails, and a single arXiv paper stays in the 60–71 band.

editor take

MM-IRT runs on MATH-500 plus 6 leaderboard benchmarks; if the speedup reproduces, mean accuracy deserves demotion.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Theoretical Limits of Language Model Alignment

The paper derives the maximum expected reward gain under a fixed KL budget, gives a closed-form expression governed by Jeffreys divergence, and evaluates the KL-reward Pareto frontier on two LM tasks, safety and summarization, where best-of-N approaches the theoretical limit while PPO and GRPO remain suboptimal.

#Alignment#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a theory-heavy arXiv paper centered on KL/Jeffreys derivations and limited task tests. Useful for alignment readers, not a same-day must-write.

editor take

This paper bounds reward gain under fixed KL; best-of-N nears the limit while PPO/GRPO lag—RLHF training tax looks exposed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Gradient Extrapolation-Based Policy Optimization

GXPO approximates multi-step lookahead with three backward passes, improves sampled pass@1 by 1.65 to 5.00 points over GRPO in Qwen2.5 and Llama math-reasoning experiments, and switches back to single-pass GRPO when the lookahead signal becomes unstable.

#Reasoning#Fine-tuning#Inference-opt#Qwen

why featured

HKR-K is solid and HKR-R is moderate: the paper gives a concrete GXPO mechanism and Qwen2.5/Llama math gains, but it is still a single optimization paper without a major release or production proof.

editor take

GXPO buys up to +5.00 pass@1 with three backward passes; I buy it if the fallback holds beyond math benchmarks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

LKV formulates KV cache compression as end-to-end differentiable optimization, combining LKV-H for task-optimized global budgets and LKV-T for intrinsic KV importance, and reports near-lossless LongBench performance with only 15% KV cache retention.

#Inference-opt#Enshuai Zhou#Yunji Chen#arXiv

why featured

HKR-H/K/R all pass, but this is an arXiv inference-optimization paper with limited source authority in the excerpt. The 15% KV-cache claim is useful signal, not a same-day featured story.

editor take

LKV keeps 15% KV on LongBench with near-lossless scores; I buy learned budgets, not throughput claims from an abstract.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→On the Invariance and Generality of Neural Scaling Laws

The paper proposes transferable neural scaling laws that use information resolution ρ to connect source and target domains, validates the invariants across language, vision, and speech, and reports time-series data-scaling exponent recovery within 3% error under varying noise injection levels.

#Benchmarking#Reasoning#Research release

why featured

HKR-K is solid: ρ, language/vision/speech validation, and <3% error are testable claims. HKR-R is present via training-cost forecasting, but HKR-H is weak and this remains a single arXiv paper.

editor take

The paper ports scaling laws via information resolution ρ, with <3% time-series exponent error; promising, but EHR transfer needs replication.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MIPIAD: Multilingual Indirect Prompt Injection Defense with Qwen, TF-IDF, and Meta-Ensembles

MIPIAD evaluates indirect prompt-injection defense for English and Bangla RAG and tool-using LLM settings, combining a Qwen2.5-1.5B LoRA classifier, TF-IDF features, and meta-ensembles on 1.43 million synthetic samples, with the best hybrid ensemble reaching 0.9205 F1 and boosting reaching 0.9378 AUROC.

#RAG#Tools#Safety#Qwen

why featured

HKR-K and HKR-R pass: the paper gives a concrete defense setup and F1 for multilingual prompt injection, relevant to RAG/agent security. Single arXiv paper with synthetic data keeps it below featured.

editor take

MIPIAD hits 0.9205 F1 on 1.43M synthetic samples; only English and Bangla are tested, so don’t buy the 200-language aura.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

The paper evaluates prompt-injection defenses on 480 educational tutoring queries, where a multi-layer pipeline reaches 46.34% bypass, 0.00% false positive rate, and 2.50 ms average latency.

#Safety#Alignment#Benchmarking#Prompt Guard

why featured

HKR-K/R pass: the paper gives test size, bypass rate, false positives, and latency. HKR-H is weak, and a single arXiv benchmark lacks the spread needed for featured.

editor take

The multilayer pipeline hits 46.34% bypass on 480 queries; zero FPR is nice, but half-open defenses are shaky for tutors.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Flock: A Knowledge Graph Foundation Model via Learning on Random Walks

Flock uses probabilistic node-relation equivariance and sampled random walks for zero-shot link prediction on knowledge graphs, perfectly solves the Petals diagnostic dataset, and reports state-of-the-art entity and relation prediction results across 54 knowledge graphs from diverse domains.

#Reasoning#Embedding#Benchmarking#Flock

why featured

HKR-H/K pass: the KG foundation-model angle is fresh, and the abstract gives 54 graphs plus zero-shot SOTA. HKR-R is weak because this is a niche link-prediction paper, so it stays in the 60–71 research-signal band.

editor take

Flock reports SOTA on 54 KGs; random-walk symmetry breaking is smart, but Petals is self-made, so replication matters.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

GameGen-Verifier decomposes game specifications into keypoints, injects concrete runtime states, and verifies bounded interactions, reaching up to 92.2% accuracy against human judgments on 100 VeriGame titles across seven genres versus 58.8% for the coverage-enforced Agent-as-a-Verifier baseline.

#Agent#Code#Benchmarking#GameGen-Verifier

why featured

HKR-H and HKR-K pass: the mechanism and metrics are concrete, and LLM-made game verification is a fresh angle. The paper remains a niche eval story without major-lab release, open-source pull, or production adoption, so it stays in 60–71.

editor take

GameGen-Verifier hits 92.2% human agreement on 100 games; state injection beats pretending agent playthroughs verify mechanics.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Flexible Routing via Uncertainty Decomposition

The paper presents an uncertainty-aware router that decomposes total uncertainty into reducible and irreducible components, then adapts to different loss functions and cost parameters through hyperparameter changes without retraining.

#Reasoning#Inference-opt#Ahdritz et al.#Research release

why featured

HKR-K and HKR-R pass: the mechanism is concrete and relevant to inference routing costs. HKR-H is weak, and the post gives no metrics, benchmarks, or artifact, so it stays in the 60–71 band.

editor take

Ahdritz et al. bind routing to multi-annotation classification; no-retrain cost tuning is neat, but correlation decides its range.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Fluxion combines output-aware KV budgeting, head-specific sparse configuration, and priority scheduling for CPU-resident KV caches, delivering 1.5×-3.7× speedups over the strongest fixed sparse hybrid baseline across 2 models, 3 benchmarks, and 40 tasks, with worst average quality degradation of -0.26 versus FULL.

#Inference-opt#Fluxion#Research release#Benchmark

why featured

HKR-K and HKR-R pass: the paper gives concrete speedup numbers and scheduling mechanics tied to long-context inference cost. HKR-H is weak, and no open-source artifact or production adoption is disclosed, so it stays in 60-71.

editor take

Fluxion claims 1.5×-3.7× on 40 tasks; with a 0.05 KV-budget baseline, I want vLLM-style serving numbers first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation

AdaHOP applies IHT and OE by three outlier patterns in LLM training matrix multiplications, enabling from-scratch MXFP4 training with BF16-level quality while reaching up to 3.6x memory compression and 1.46x end-to-end speedup over BF16.

#Fine-tuning#Inference-opt#AdaHOP#Triton

why featured

HKR-K/R pass: the paper gives a concrete low-precision training mechanism and measurable cost gains. HKR-H is weak because the angle is ML-systems-heavy, so it stays in the interesting-not-featured band.

editor take

AdaHOP trains MXFP4 from scratch at BF16 quality; 3.6x compression and 1.46x speedup look good, but OE overhead is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Scalable Option Learning in High-Throughput Environments

The paper introduces Scalable Option Learning for hierarchical RL, reports about 35x higher throughput than existing hierarchical methods, trains agents on 30 billion NetHack frames, validates the method on MiniHack and MuJoCo, and releases code at facebookresearch/sol.

#Agent#Meta#NetHack#MuJoCo

why featured

HKR-H/K pass on the 35x throughput, 30B NetHack frames, and open code. HKR-R is weak: this is a hierarchical-RL paper, not an LLM-agent product update, so it stays in 60-71.

editor take

SOL trains on 30B NetHack frames at ~35x throughput; I care less about scores than option learning finally surviving scale.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→VDCook: DIY video data cook your MLLMs

VDCook provides a configurable video data construction platform where users submit natural-language requests plus scale, retrieval-synthesis ratio, and quality-threshold parameters to generate in-domain data packages with provenance, metadata, and reproducible Notebooks.

#Multimodal#Vision#Tools#VDCook

why featured

HKR-K and HKR-R pass: the article gives a concrete data-building mechanism for video MLLMs. No benchmark gains, open-source link, or deployment evidence are disclosed, so it stays in the 60–71 research-tool band.

editor take

VDCook takes natural-language requests plus 3 parameters for video data packs; no benchmarks disclosed, so don’t confuse plumbing with model progress.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

SAEgis inserts a sparse autoencoder into a pretrained VLM and trains it with standard reconstruction objectives, using sparse latent features to classify adversarially perturbed images across in-domain, cross-domain, and cross-attack settings.

#Vision#Safety#Interpretability#SAEgis

why featured

Single arXiv safety paper with a clear mechanism but no metrics, code, or independent reproduction disclosed, so it stays in the 60–71 band. HKR-H/K/R pass, but not enough for featured.

editor take

SAEgis plugs reconstruction-trained SAEs into VLMs, but metrics aren’t disclosed; “firewall” sounds inflated for an attack detector.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

The paper tests LLM behavioral coherence with latent-profile questions and multi-agent conversations, finding significant inconsistencies across model families and sizes; the RSS abstract does not disclose sample counts, model names, or benchmark scores.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but the post gives the mechanism and inconsistency claim without sample size, model list, or effect sizes. A single arXiv paper stays in the 60–71 band.

editor take

The paper tests LLM-agent behavioral coherence, but sample counts and model names are undisclosed; synthetic social science still lacks a spine.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

TS-DFM trains an 8-step student for 170M-parameter language modeling that achieves 32% lower perplexity than a 1,024-step teacher and runs 128x faster, while its lightweight energy compass shapes trajectories only during training and leaves inference cost unchanged.

#Inference-opt#Fine-tuning#Reasoning#Research release

why featured

HKR-H/K/R pass: the contrast is sharp, the numbers are concrete, and inference cost matters. Still, this is a specialized arXiv method tested at 170M scale, with no disclosed large-model or production result, so it stays in all.

editor take

TS-DFM’s 8-step student beats its 1,024-step teacher by 32% perplexity; I buy training-time energy guidance, not 170M extrapolation yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Multilingual Safety Alignment via Self-Distillation

The paper proposes Multilingual Self-Distillation, a cross-lingual safeguard transfer framework that uses only multilingual queries and two on-policy/off-policy variants to transfer safety behavior from high-resource languages such as English to low-resource languages such as Javanese.

#Safety#Alignment#Fine-tuning#Research release

why featured

HKR-K and HKR-R pass: the mechanism is specific and the topic maps to multilingual safety gaps. No metrics, model list, or reproducible results are disclosed, and HKR-H is weak, so this stays in 60–71.

editor take

MSD transfers safety using only multilingual queries; no model names or gains disclosed, so I’d file it as a low-resource safety patch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→TopoPrune: Robust Data Pruning via Unified Latent Space Topology

TopoPrune prunes datasets with a dual-scale topology pipeline, using manifold approximation and differentiable persistent homology to rank samples by structural complexity, and reports high accuracy at 90% pruning while improving robustness to latent feature noise and transfer across network architectures.

#Fine-tuning#Benchmarking#TopoPrune#arXiv

why featured

HKR-H/K/R pass, but this is a single arXiv methods paper with a high topology/persistent-homology bar. Datasets, model sizes, code, and reproducibility details are not disclosed, so it stays in the all tier.

editor take

TopoPrune reports high accuracy at 90% pruning; datasets and baselines aren’t disclosed, so don’t buy topology yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→A Comparative Analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

The paper compares layer-wise representations in LLaDA, Qwen2.5, and Dream-7B, using cosine similarity and static inference-time layer skipping, and finds native diffusion LLMs keep over 90% performance on math-reasoning and coding benchmarks while reducing FLOPs by up to 18.75%.

#Reasoning#Code#Inference-opt#LLaDA

why featured

HKR-K and HKR-R pass: 18.75% FLOP reduction with >90% retained math/code performance is testable and cost-relevant. HKR-H is weak, and the layer-representation angle is too narrow for featured.

editor take

LLaDA skips layers for 18.75% FLOPs savings at 90%+ performance; Dream-7B still smells AR, so initialization bias survives diffusion training.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

The paper proposes Direct Reasoning Optimization for unverifiable tasks, combining a token-level dense Reasoning Reflection Reward with rollout-group rubric-gating constraints, and reports stronger, faster, more sample-efficient learning than baselines across four datasets: scientific writing, medicine, legal contracts, and finance.

#Reasoning#Alignment#Fine-tuning#Research release

why featured

HKR-K is solid: it gives a concrete training mechanism and four test domains. HKR-R applies to alignment for unverifiable work, but this is a single arXiv paper with no artifact, adoption, or cross-source cluster.

editor take

DRO beats strong baselines on 4 unverifiable-task sets; I like variance-picked tokens, but no table numbers here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Tracing Uncertainty in Language Model "Reasoning"

The paper treats reasoning traces as evolving model states and uses uncertainty trace profiles to predict answer correctness across five LMs on GSM8K and ProntoQA, reaching AUROC up to 0.807 and AUROC 0.801 from only the first few hundred tokens.

#Reasoning#Interpretability#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives a testable uncertainty-profile method and AUROC 0.807 for reasoning reliability. HKR-H is weak, and a single arXiv paper stays below featured.

editor take

Five LMs hit AUROC 0.807 on GSM8K and ProntoQA; I trust early uncertainty curves over post-hoc CoT stories.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SWaRL: Safeguard Code Watermarking via Reinforcement Learning

SWaRL introduces a reinforcement-learning co-training framework for code watermarking, using compiler feedback, a confidential verifier reward, and LoRA fine-tuning; the abstract says experiments preserve functional correctness and resist refactoring and adversarial transformations, but it does not disclose benchmark names or exact accuracy numbers.

#Code#Fine-tuning#Safety#SWaRL

why featured

HKR-K/R pass: the abstract gives testable mechanisms and a refactoring-resistance claim tied to AI-code safety. HKR-H is weak, and this is a single arXiv paper with no artifact or visible debate, so it stays in 60–71.

editor take

SWaRL uses RL co-training for code watermarking, but gives no benchmarks or accuracy; I’d doubt its refactor resistance survives real cleanup.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

The paper proposes Shadow Mask Distillation to compress KV cache during RL post-training rollouts, and its abstract says PPO, GRPO, and Online DPO face a memory wall on long-context reasoning tasks because rollout sampling uses large KV-cache footprints.

#Reasoning#Alignment#Inference-opt#Research release

why featured

HKR-K/R pass: SMD targets KV-cache memory walls in PPO, GRPO, and Online DPO long-context rollouts. HKR-H is narrow, and the summary gives no compression ratio, speedup, or reproduction detail, so this stays all.

editor take

Shadow Mask Distillation targets rollout KV cache; naming PPO, GRPO, and Online DPO makes this an RL bias paper, not mere memory plumbing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→DGPO: Distribution Guided Policy Optimization for Fine-Grained Credit Assignment

DGPO replaces the token-level KL penalty with bounded Hellinger distance and entropy gating, then reports 60.0% Avg@32 on AIME2024 and 46.0% Avg@32 on AIME2025 using Qwen2.5-32B.

#Reasoning#Alignment#Fine-tuning#Qwen

why featured

HKR-K/R pass: the mechanism and AIME numbers are concrete, and the topic maps to post-training competition. HKR-H fails; this is a single arXiv item with no code, external debate, or production impact disclosed.

editor take

DGPO hits 60.0% AIME2024 on Qwen2.5-32B; I buy bounded Hellinger stability more than the credit-assignment framing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

FAME outperformed frontier LLM evaluators on prospective multidimensional impact forecasting across 3,200 arXiv papers from three fast-evolving subfields, using textual features, a verified knowledge-flow graph, and dynamic latent-space trajectories to model scientific topic evolution.

#Reasoning#Benchmarking#FAME#arXiv

why featured

HKR-H and HKR-K pass: the claim beats frontier LLM evaluators and names a 3,200-paper, 3-subfield setup. HKR-R is weak because this is academic-impact forecasting, not a product or practitioner workflow story.

editor take

FAME beats frontier LLM judges on 3,200 arXiv papers; I buy trajectory signals, not the clean proxy of “impact.”

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

The paper proposes Dr. Post-Training, which builds a feasible update set from general data at each training step and projects the target-data update into it; experiments cover SFT, RLHF, and RLVR, but the snippet does not disclose model sizes or metric values.

#Fine-tuning#Alignment#Inference-opt#Research release

why featured

HKR-K/R pass: the mechanism is concrete and spans SFT, RLHF, and RLVR. Model scale and metric values are not disclosed, and the paper remains fairly technical, so it fits the 60–71 research-release band.

editor take

Dr. Post-Training projects target updates each step; no model sizes or metrics, so I read it as data selection recast as regularization.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass

The paper introduces the convergence gap diagnostic and compares six paired pretrained and instruction-tuned checkpoints; instruction-tuned models stay farther from their final next-token distribution deeper in the stack, and late MLP swaps change late KL by +0.34 nats for IT grafts into PT hosts and -0.51 nats for PT-late swaps into IT hosts.

#Interpretability#Fine-tuning#arXiv#Gemma

why featured

HKR-H/K pass: the title has a counterintuitive layer-dynamics hook, and the paper gives 6 checkpoint groups plus KL deltas. HKR-R is weak because this is niche interpretability research without direct product or safety impact.

editor take

Six checkpoint pairs show instruction tuning delays convergence; late-MLP swaps moving KL by ~0.5 nats is a hard handle, not alignment folklore.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Rollback-Free Stable Brick Structures Generation

The paper proposes a reinforcement-learning method for stable brick-structure generation, moving physical validity from inference-time rejection and rollbacks to training-time policy optimization with assembly-level rewards for collision avoidance, connectivity, interlocking, and shape conformity; the authors report state-of-the-art quality and orders-of-magnitude faster inference, with code, dataset, and models released.

#Robotics#Reasoning#miniHuiHui#Hugging Face

why featured

HKR-H and HKR-K pass: the “rollback-free stable bricks” angle is concrete, and the post states training-time RL, assembly-level rewards, and open code/data/models. HKR-R is weak because this is a niche robotics-generation paper, so it stays in 60–71.

editor take

STABLE moves brick stability into training-time RL; speedup is only “orders,” so I’d inspect simulator overfitting first.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Predictive but Not Plannable: RC-aux for Latent World Models

The paper introduces RC-aux for reconstruction-free latent world models, adding multi-horizon open-loop prediction and budget-conditioned reachability supervision to LeWorldModel, and reports improved LeWM-style planning on goal-conditioned pixel-control tasks and a LIBERO-Goal extension with modest additional cost.

#Reasoning#Robotics#Tools#LeWorldModel

why featured

HKR-H and HKR-K pass: the paper has a sharp planning-vs-prediction hook and concrete training mechanisms for pixel control and LIBERO-Goal. No gain numbers, artifact detail, or product impact keep it in the 60-71 band.

editor take

RC-aux adds multi-horizon prediction and budgeted reachability to LeWorldModel; I buy the diagnosis, but LIBERO sim wins still leave robot transfer open.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Graph Representation Learning Augmented Model Manipulation on Federated Fine-Tuning of LLMs

The paper proposes AugMP against federated fine-tuning of LLMs, using graph representation learning and an augmented Lagrangian dual algorithm to generate malicious updates that reduce global LLM accuracy by up to 26% and local agent average accuracy by up to 22%.

#Fine-tuning#Safety#Benchmarking#arXiv

why featured

HKR-K/R pass: the paper gives a concrete attack mechanism and 26%/22% accuracy-drop claims, with relevance to federated fine-tuning security. HKR-H is weak due to jargon and narrow reach, so it stays in the 60–71 band.

editor take

AugMP drops global accuracy by up to 26%; federated fine-tuning needs attack benchmarks beyond privacy and update-similarity filters.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents

The paper introduces DIAL, a sparse gate that learns state-feature utility direction from signal-agnostic counterfactual exploration across six environments and three backbones; fixed-direction gates reverse across settings and can reduce success by selecting states where extra rollout compute harms the base policy.

#Agent#Reasoning#Inference-opt#DIAL

why featured

HKR-H and HKR-K pass: the paper has a counterintuitive gating claim and concrete tests across 6 environments and 3 backbones. HKR-R is weak because the post gives no production impact, safety stakes, or broad industry conflict.

editor take

DIAL tests 6 environments and 3 backbones; fixed uncertainty gates look neat until rollout compute actively hurts.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Generating Training Datasets for Legal Chatbots in Korean

The researchers used local grammar graphs and the open-source Unitex platform to generate 700 million labeled Korean legal chatbot utterances, then trained a DIET classifier for LIGA that reached a 91% F1 score and selects links to public Korean government case pages.

#Agent#Fine-tuning#Benchmarking#Unitex

why featured

HKR-H/K pass: 700M synthetic utterances and 91% F1 add concrete signal. HKR-R is weak because the Korean legal-chatbot scope is narrow, keeping it below featured.

editor take

LIGA generated 700M Korean legal utterances with Unitex and hit 91% F1; I don’t buy generalization without real-user validation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Structural Rationale Distillation via Reasoning Space Compression

The paper proposes D-RPC, which constrains a teacher model with a dynamic bank of reusable reasoning paths; across five math and commonsense reasoning benchmarks and two student models, it outperforms chain-of-thought distillation, freeform rationale generation, direct distillation, and structured-supervision baselines.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete distillation mechanism and benchmark setup, tied to small-model reasoning cost. As a single arXiv method paper without a release artifact or production-scale claim, it stays in 60–71.

editor take

D-RPC wins on 5 reasoning benchmarks and 2 students; I buy rationale compression, but absolute gains are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Arrow: A Foundation Model for Causal Discovery

Arrow performs zero-shot causal discovery on observational tabular data by factorizing a directed acyclic graph into an undirected skeleton and a topological order, training on synthetic datasets with ground-truth graphs and using the skeleton-order construction to guarantee acyclicity.

#Reasoning#Arrow#Research release

why featured

HKR-H and HKR-K pass: the title brings a foundation-model angle to causal discovery, and the summary gives mechanisms. No metrics, code, or product impact are disclosed, so it stays in the 60–71 research band.

editor take

Arrow guarantees DAG acyclicity via skeleton-order; trained on synthetic ground truth, so hidden confounding is the stress test.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Learning Visual Feature-Based World Models via Residual Latent Action

The paper proposes RLA-WM, a visual feature-based world model that learns Residual Latent Action from DINO residuals and predicts RLA values with flow matching; it reports stronger results than feature-based and video-diffusion world models on simulation and real-world datasets, with orders-of-magnitude faster inference than video diffusion.

#Vision#Robotics#Reasoning#DINO

why featured

HKR-K and HKR-R pass: the mechanism is concrete, and the orders-faster-than-video-diffusion claim matters for robotics world models. HKR-H is weak; this single arXiv paper remains too niche for featured.

editor take

RLA-WM learns latent actions from DINO residuals and claims orders-faster inference; I buy the efficiency angle, pending offline-video RL replication.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mixture of Masters: Sparse Chess Language Models with Player Routing

The paper introduces Mixture-of-Masters, a chess mixture-of-experts model that uses small GPT experts to emulate world-class grandmasters and a post-hoc learnable gating network to select a persona for each move based on game state.

#Reasoning#Interpretability#Stockfish#GPT

why featured

HKR-H and HKR-K pass: the paper offers a concrete sparse expert/persona-routing setup. Impact stays inside chess modeling and interpretability, with no product or general-agent implication, so it lands in all.

editor take

MoM routes each chess move to a grandmaster persona; no win rate or expert count is disclosed, so don’t call this reasoning yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→No Forgetting Learning: Buffer-Free Continual Learning Classification

NFL matches memory-based continual learning methods on CIFAR-100, Tiny-ImageNet, and ImageNet-1000 across up to 50 incremental tasks, while NFL+ requires only 2.53% of their model size and uses no replay buffer.

#Vision#Fine-tuning#Benchmarking#arXiv

why featured

HKR-H/K pass: the title frames a buffer-free no-forgetting claim, and the summary gives 50 incremental tasks plus 2.53% model size. HKR-R is weak because this remains specialist continual-learning research without product or developer impact.

editor take

NFL+ matches replay methods over 50 incremental tasks at 2.53% model size; I'd audit task-head cost and class-incremental protocol first.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→The Proxy Presumption: From Semantic Embeddings to Valid Social Measures

The paper introduces the Construct Validity Protocol to validate embedding-based social measures with three validity tests, and uses LLM-based Counterfactual Neutralization to reduce confounding from topic, style, and authorship.

#Embedding#Alignment#Benchmarking#Research release

why featured

HKR-K is clear: the paper offers a protocol, three test classes, and a deconfounding method. HKR-R is present for embedding bias/evaluation concerns, but the work is methodological with no product or broad industry trigger.

editor take

CVP adds three validity tests for embedding-based social measures; I buy the problem, but LLM neutralization reproducibility is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

MaPPO incorporates prior reward estimates into a maximum a posteriori preference optimization objective and reports consistent alignment gains on three benchmarks: MT-Bench, AlpacaEval 2.0, and Arena-Hard, without adding hyperparameters.

#Alignment#Fine-tuning#Benchmarking#MaPPO

why featured

HKR-K/R pass: the mechanism and three benchmark claims are concrete, but no uplift numbers are disclosed. As an arXiv optimization paper without a major lab or artifact hook, it stays in all.

editor take

MaPPO reports gains on 3 benchmarks; the snippet gives no deltas, so I’d treat it as a DPO-family patch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Conformal Agent Error Attribution

The paper proposes a conformal prediction framework for MAS error attribution, gives finite-sample, distribution-free coverage guarantees for agent trajectories, and uses contiguous-sequence prediction sets to support rollback-based correction.

#Agent#Reasoning#Alignment#Layer6 AI Labs

why featured

HKR-K/R pass: the paper links multi-agent failure attribution to conformal guarantees and rollback correction. HKR-H misses; as a single technical arXiv paper without experiment scale, code status, or production proof, it stays in all.

editor take

Layer6 applies conformal prediction to MAS traces; the useful bit is contiguous rollback intervals, not another agent-debugging story.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Outlier Smoothing with Closed-Form Rotations for W4A4 Large Language Model Quantization

SingleQuant applies ART and URT closed-form Givens rotations to smooth activation outliers for W4A4 LLM quantization; on LLaMA-2-13B, it reports a 1,400× quantization speedup and a 0.57% average task performance gain over the selected best baseline.

#Inference-opt#SingleQuant#LLaMA-2#Research release

why featured

HKR-K/R are strong thanks to the concrete W4A4 mechanism and metrics. Accessibility is narrow: ART/URT rotations and quantization internals keep it in the 60–71 band.

editor take

SingleQuant reports 1,400× faster LLaMA-2-13B quantization; +0.57% accuracy is thin, so replication carries the claim.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Toeplitz MLP Mixers are Low-Complexity, Information-Rich Sequence Models

The paper introduces Toeplitz MLP Mixer, which replaces attention with triangular-masked Toeplitz multiplication over sequences. It reports O(dn log n) training time, O(dn) training space, O(dn) inference prefill cost, and better copying, retrieval, and in-context learning benchmark accuracy than comparable sub-quadratic architectures.

#Inference-opt#Benchmarking#Reasoning#Research release

why featured

HKR-H/K/R pass: the mechanism, complexity, and benchmark claim are concrete, and the topic touches attention cost. Kept in 60-71 because it is an arXiv architecture paper with no code, scale evidence, or adoption disclosed.

editor take

TMM swaps attention for triangular Toeplitz multiplication: O(dn log n) training, O(dn) prefill; I doubt prefill is the whole pain.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models

The paper proposes a unified framework for VLM layer skipping, using experimentally verifiable redundancy conditions to judge pruning benefits without downstream task metrics, and validates that early and late vision tokens are redundant across models.

#Multimodal#Vision#Inference-opt#Research release

why featured

HKR-H/K/R all register: VLM layer skipping ties to inference cost, and the paper offers testable redundancy conditions. Missing speedup, accuracy loss, and model names keep it in the interesting band.

editor take

This paper gives VLM layer skipping testable redundancy conditions; I buy the direction, but no models or speedup numbers are disclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Emergent Manifold Separability during Reasoning in Large Language Models

The paper applies Manifold Capacity Theory to 2 compositional reasoning tasks and finds that several open-weight models briefly untangle concept manifolds into linearly separable subspaces just before computation, while linear-probe accuracy remains high after the computation step.

#Reasoning#Interpretability#Research release

why featured

HKR-K passes on 2 combinatorial tasks, Manifold Capacity Theory, and probe dynamics; HKR-H has the odd “before computation” hook. HKR-R is weak, and the technical bar keeps it below featured.

editor take

The paper tests 2 compositional tasks; MCT pulses beat linear probes as pre-computation signal, but calling it a mechanism feels premature.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→End-to-end PDDL Planning with Hardcoded and Dynamic Agents

The paper presents an LLM-driven PDDL planning framework tested across more than 10 domains with GPT-4o, GPT-5-mini, GPT-5.4, and Gemini-2.5/3-flash, using hardcoded agents for predefined fixes, dynamic agents for domain-specific abstraction revision, and external planners such as Fast Downward, LPG, POPF, VAL, and uVAL.

#Agent#Reasoning#Tools#OpenAI

why featured

HKR-K and HKR-R pass: agent planning plus PDDL is useful, and 10+ domains with named models are concrete. HKR-H fails; no result, win rate, or failure mode is disclosed, so this stays in the 60–71 band.

editor take

The paper spans 10+ domains and five PDDL tools; LLMs write specs, then old-school planners do the hard part.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

TSRBench introduces a time-series reasoning benchmark with 4,125 problems across 14 domains, covering four dimensions: perception, reasoning, prediction, and decision-making, and evaluates more than 30 proprietary and open-source LLMs, VLMs, and TSLLMs.

#Reasoning#Multimodal#Benchmarking#TSRBench

why featured

HKR-K and HKR-R pass: the paper gives concrete dataset scale and model coverage, and targets time-series reliability. HKR-H is weak, and as a single arXiv benchmark without visible community pull, it stays in the 60-71 band.

editor take

TSRBench tests 30+ models on 4,125 tasks; prediction breaks scaling, a nastier finding than another reasoning leaderboard.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

The paper introduces the Linear Centroids Hypothesis, replacing intermediate activations with centroid spaces; experiments cover DINO ViTs, GPT2-Large, a controlled task, and gradient-based saliency maps, with code released on GitHub.

#Interpretability#Vision#DINO#GPT2-Large

why featured

HKR-H/K pass via a concrete interpretability hypothesis, DINO ViTs/GPT2-Large tests, and code. Impact stays in the 60–71 band because no benchmark numbers or production-facing claim are disclosed.

editor take

LCH swaps activation spaces for centroids on DINO ViTs and GPT2-Large; I buy the function-first angle, but sparsity gains lack numbers.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

MinervaRL improves the mean score by 15.8 percentage points over base models and by 4.3 points over GRPO across four backbones and 12 cyber threat intelligence benchmarks.

#Reasoning#Fine-tuning#Benchmarking#Minerva

why featured

HKR-K and HKR-R pass: MinervaRL gives concrete benchmark scope and gains, with security relevance. HKR-H is weak, and this is a niche arXiv research item without product adoption, so it stays in 60–71.

editor take

MinervaRL gains 15.8 points on 12 CTI benchmarks; RLVR looks useful where IDs and schemas make rewards checkable.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention

Mask2Cause recovers causal graphs during the forecasting forward pass and uses adjacency-constrained masked attention; across benchmarks, the inferred causal structures reduced forecasting model parameter counts by more than 70% on average while maintaining predictive accuracy.

#Reasoning#Benchmarking#Mask2Cause#Research release

why featured

HKR-K passes via a concrete mechanism and the over-70% parameter-reduction claim; HKR-H/R are weak because this is a narrow arXiv methods paper with no broad practitioner hook.

editor take

Mask2Cause cuts forecasting parameters by 70% via in-pass causal graphs; I’d verify real-system runs before trusting synthetic-chaos wins.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization

The paper proposes a dual-agent framework that co-trains a health coach agent and a client simulator, using DPO on Pareto-dominant response pairs selected by a multidimensional LLM judge while training the client adversarially by reversing those preferences.

#Agent#Alignment#Benchmarking#Research release

why featured

HKR-H/K pass: dual-agent co-training and reversed preferences are novel. The paper stays niche to health coaching and lacks product impact, scale data, or an artifact, so it sits in 60–71.

editor take

Dual-agent co-training covers coach and client, but metrics and baselines aren’t disclosed; I’d suspect the LLM judge trains performative empathy.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Less Random, More Private: What Is the Optimal Subsampling Scheme for DP-SGD?

The paper proves that Balanced Iteration Subsampling outperforms Poisson subsampling at both σ→0 and σ→∞, and across more than 60 practical DP-SGD configurations it reduces the required noise multiplier by up to 9.6%.

#Fine-tuning#Safety#Benchmarking#Research release

why featured

HKR-K is solid and HKR-R is narrow: BIS cuts noise by up to 9.6% across 60+ DP-SGD setups. The story sits in subsampling math, so technical accessibility keeps it below featured.

editor take

BIS cuts DP-SGD noise by up to 9.6% across 60+ configs; Poisson’s default randomness is now privacy tax.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→A Reproducible Optimisation Protocol for Calibrating Prompt-Based LLM Workflows in Evidence Synthesis

The arXiv paper presents a reproducible calibration workflow for prompt-based LLM evidence-synthesis tasks, using DSPy and GEPA in the example code and preserving the calibrated artefact with its specification, metric, settings, and evaluation traces.

#Tools#Benchmarking#arXiv#DSPy

why featured

HKR-K and HKR-R pass: the paper offers a reproducible calibration mechanism for prompt-based LLM workflows. HKR-H fails because the angle is academic and evidence-synthesis-specific, so it stays in the interesting-but-not-featured band.

editor take

DSPy and GEPA calibrate evidence-synthesis prompts here; I buy the protocol, not broad transfer—no lift numbers disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Coupling Models for One-Step Discrete Generation

Coupling Models learns a direct coupling between discrete sequences and Gaussian latents, trains a purpose-built decoder for single-step generation, and improves the strongest one-step baselines by reducing LM1B perplexity by 33%, Fly Brain enhancer-design FBD by 18%, and MNIST-Binary FID by 46%.

#Inference-opt#Benchmarking#Research release#Open source

why featured

HKR-H and HKR-K pass: one-step discrete generation plus three benchmark drops. HKR-R is weak: no product path, release artifact, or deployment cost data, so it stays in 60–71.

editor take

Coupling Models cuts LM1B perplexity 33%. A non-distillation path for one-step discrete generation, but LLM-scale text quality is unproven.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→UFT: Unifying SFT and RLHF/DPO/UNA Fine-Tuning via a Generalized Implicit Reward Function

UFT merges SFT and RLHF/DPO/UNA alignment into one training stage using the same objective and loss functions via an implicit reward function; the abstract reports significant gains on ifeval and truthful, but the RSS snippet does not disclose exact scores or model settings.

#Fine-tuning#Alignment#Research release#Safety/alignment

why featured

HKR-K/R pass: the paper unifies SFT and preference optimization under one objective with workflow relevance. Specific IFEval/TruthfulQA gains are not disclosed, and HKR-H is weak, so it stays in the 60–71 band.

editor take

UFT folds SFT and RLHF/DPO/UNA into one stage, but gives no scores or models; useful math, not a post-training roadmap.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Bloom Filter Encoding for Machine Learning

The paper presents a Bloom filter transform for machine-learning preprocessing, evaluating fixed-length bit-array encodings on six text, time-series, tabular, and image datasets with four classifier types, reporting comparable performance to raw data or standard dimensionality reduction while reducing memory use and obfuscating original feature values.

#Embedding#Inference-opt#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and evaluation setup, with cost/privacy relevance. HKR-H is weak, and this is a single arXiv method paper without an implementation or production-replacement claim, so it stays in 60–71.

editor take

Bloom filter encoding spans 6 datasets and 4 classifiers; without keyed hashing, call it memory-saving obfuscation, not privacy.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→R³L: Reasoning 3D Layouts from Relative Spatial Relations

R³L reduces accumulated reference-frame errors in multi-hop relative spatial reasoning for 3D layout generation, using invariant spatial decomposition, an imagine-and-revise consistency loop, and global-to-local coordinate re-parameterization; the abstract says experiments across diverse scene types produced more physically feasible and semantically consistent layouts, but the snippet does not disclose benchmark scores.

#Reasoning#Multimodal#R³L#Research release

why featured

HKR-K passes because the paper gives concrete mechanisms for reducing frame errors in 3D relative spatial reasoning. HKR-H and HKR-R are weak, and the body discloses no metrics or reproducible results, so this stays in all.

editor take

R³L targets multi-hop frame drift and ships code; scores aren’t disclosed, so I don’t buy “extensive experiments” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

The paper introduces SPEAR, an online federated LLM fine-tuning algorithm that builds contrastive pairs per prompt through a feedback-guided self-play loop and trains with partial non-answer feedback instead of ground-truth contexts or expensive group generations.

#Fine-tuning#Agent#SPEAR#Research release

why featured

HKR-K passes: SPEAR gives a testable mechanism with self-play contrastive samples and partial non-answer feedback, plus open code. HKR-H/R are weak because the angle is dense and mainly relevant to federated fine-tuning researchers.

editor take

SPEAR trains federated online fine-tuning from partial non-answer feedback; no benchmark numbers disclosed, so I’d audit edge-device cost first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware

McNdroid releases a longitudinal multimodal Android malware benchmark spanning 2013–2025, excluding 2015, with three aligned modalities: static manifest and smali features, dynamic sandbox behavior, and function-call graph features, plus public splits and code.

#Multimodal#Benchmarking#McNdroid#Android

why featured

HKR-K is clear: a 12-year Android malware benchmark with static, dynamic, and call-graph features plus code. HKR-R is limited to security-eval readers, so this stays in the all band.

editor take

McNdroid spans 2013–2025 with three Android malware modalities; random-split security scores look lazier after this.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→When Losses Align: Gradient-Based Composite Loss Weighting for Efficient Pretraining

The paper proposes a gradient-based bilevel method that learns pretraining loss weights online, reducing tuning overhead to about 30% above a single training run, and reports results on event-sequence modeling and self-supervised computer vision that match or improve carefully tuned baselines.

#Fine-tuning#Vision#Benchmarking#Research release

why featured

HKR-K is clear via the bilevel weighting method and 30% cost figure; HKR-R touches pretraining tuning cost. The paper stays academic and lacks a product, tool release, or broad industry trigger, so it lands in all.

editor take

This learns pretraining loss weights online at ~30% extra training cost; I buy the attack on random/Bayes tuning waste.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Evaluating Large Language Models in Scientific Discovery

The paper introduces SDE, a scenario-grounded benchmark for evaluating LLMs in scientific discovery across biology, chemistry, materials, and physics, scoring models at question level and project level where they must propose testable hypotheses, design simulations or experiments, and interpret results.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes: SDE adds 4 domains and a two-level scoring mechanism. HKR-H/R are weak because the title is a routine arXiv evaluation frame and no model ranking or deployment conflict is disclosed.

editor take

SDE covers 4 science domains; sample size is undisclosed, so trust the project tasks, not the “scientific superintelligence” framing.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

BoHA partitions frozen weights W0 into a b×b grid and learns an independent low-rank Hadamard factor per block, while keeping LoRA-equivalent rank budgets and merged inference; on a Llama-3.2-3B commonsense-to-arithmetic continual-learning diagnostic, it retained 57.66% first-stage accuracy and beat the W0-free additive-control mean by 15.23% under matched second-stage plasticity.

#Fine-tuning#Reasoning#Llama#Mistral

why featured

HKR-K passes with a concrete mechanism and metric; HKR-R passes on fine-tuning cost and forgetting. HKR-H is weak, and this is a single arXiv method without code, production impact, or cross-source pickup.

editor take

BoHA keeps 57.66% retention on Llama-3.2-3B continual tuning; don't dump LoRA, but add block granularity to ablations.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types

UNA proposes a generalized implicit reward function to train LLM alignment across binary, pairwise, and score-based feedback, and the arXiv abstract says experiments on classical benchmarks with typical LLM base models show consistent gains, but it does not disclose benchmark names or numeric results.

#Alignment#Fine-tuning#Benchmarking#UNA

why featured

HKR-K passes via UNA’s unified reward mechanism across three feedback types, and HKR-R passes for alignment cost and data reuse. No code, scale, or external replication is disclosed, so this stays in the lower research-news band.

editor take

UNA trains on 3 feedback types; no benchmark names or numbers disclosed, so I’d treat it as a DPO-family loss paper.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Robust and Reliable AI for Predictive Quality in Semiconductor Materials Manufacturing with MLOps and Uncertainty Quantification

The study benchmarks MLOps retraining strategies on five years of semiconductor manufacturing data and finds that fixed retraining every five production batches without hyperparameter retuning performs best across drift conditions while reducing computational overhead.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K is strong and HKR-R is moderate: 5 years of real fab data and fixed 5-batch retraining are concrete. The domain is narrow MLOps for semiconductor quality, so it stays in 60-71.

editor take

Five years of fab data favor retraining every 5 batches; stop worshipping HPO when drift rewards cheaper discipline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition

The paper presents FLAIR, a rank-1 decomposition method for periodic time-series forecasting, and reports relMASE 0.838 on aggregate GIFT-Eval across 97 configurations with 28 scalars for hourly series, 57 for weekly series, one CPU core, no GPU, no pre-training, and no per-task tuning.

#Benchmarking#GIFT-Eval#FLAIR#PatchTST

why featured

HKR-H comes from the counterintuitive title; HKR-K from the named method and 97-config result. HKR-R is weak, and rank-1 decomposition plus relMASE makes this a niche research signal, not featured.

editor take

FLAIR hits 0.838 relMASE with 28 hourly scalars; I buy the jab at PatchTST for periodic forecasting.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

The paper introduces PRPO and a reasoning-annotated deepfake detection dataset, aligning LLM reasoning with image evidence at the paragraph level and reporting a top reasoning score of 4.55/5.0.

#Multimodal#Vision#Reasoning#Research release

why featured

HKR-K is supported by a named method, dataset, and 4.55/5.0 score; HKR-R fits deepfake safety concerns. HKR-H is weak, and without product or open-source impact this stays in the 60s.

editor take

PRPO reports 4.55/5 reasoning, but the snippet omits dataset size; I’d treat this as annotation quality work first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→MatryoshkaLoRA: Learning Accurate Hierarchical Low-Rank Representations for LLM Fine-Tuning

MatryoshkaLoRA inserts a fixed diagonal matrix P between LoRA adapters to scale sub-ranks, supports dynamic rank selection with minimal accuracy degradation, and proposes AURAC as a metric for evaluating hierarchical low-rank adapters across ranks.

#Fine-tuning#Inference-opt#Benchmarking#IST-DASLab

why featured

HKR-K passes via a concrete LoRA mechanism and AURAC metric; HKR-R is narrow, tied to tuning cost and deployment flexibility. HKR-H misses, so this stays in the 60–71 band.

editor take

MatryoshkaLoRA adds one fixed diagonal P to cover multiple LoRA ranks; I buy the attack on rank grid-search waste.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Hammer and Anvil: Toward a Theory of Backdoors in Federated Learning

The paper introduces Hammer and Anvil, a theory for federated-learning backdoors that classifies attacks by update deviation δ from the mean update and splits defenses into Type 1 outlier or robust aggregation and Type 2 removal-based methods. Its experiments report that single-type and unprincipled combined defenses often fail against one malicious client, while three principled combined variants remain undefeated under a full-information adaptive adversary.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K/R pass: the paper gives a δ-based mechanism plus concrete attack/defense results. HKR-H is weak, and federated-learning backdoor theory is too narrow for featured.

editor take

Hammer and Anvil frames FL backdoors by δ; one malicious client often breaks single-type defenses, while three HA+CSFT variants held.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Understanding Robustness of Model Editing in Code LLMs

The paper builds a code LLM editing benchmark with 2,040 problems and 140 synthetic API modifications, then evaluates three models under single-edit and successive-edit regimes for API migration, generalization, and specificity using execution-based metrics.

#Code#Fine-tuning#Benchmarking#HumanEval

why featured

HKR-K is solid because the benchmark size and sequential-edit setup are concrete. HKR-R is moderate for code-LLM reliability, but HKR-H is weak and this is a single arXiv benchmark, so it stays in 60–71.

editor take

This 2,040-task benchmark is a cold shower: successive code-model edits drive most method-model pairs near-zero Pass@k.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

TAVIS introduces an active-vision imitation learning benchmark with 2 task suites, 8 tasks, 2 humanoid torso embodiments, three evaluation primitives, and released code, scripts, demonstrations of about 2,200 LeRobot v3.0 episodes, plus trained baselines.

#Robotics#Vision#Benchmarking#TAVIS

why featured

HKR-K passes because the benchmark discloses tasks, embodiments, and dataset size. HKR-H and HKR-R are weak: no model breakthrough, product path, or major-lab pull, so it sits in the 60–71 band.

editor take

TAVIS ships 8 tasks and ~2,200 episodes; active-vision robotics finally gets a reproducible ring, not another teleop highlight reel.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Probabilistic Object Detection with Conformal Prediction

The paper compares scaled and unscaled conformal prediction on KITTI, BDD, and CODA for probabilistic object detection; scaled CP improves interval sharpness without sacrificing coverage, reaching up to 19% higher IoU and 39% lower interval scores under autonomous-driving and cross-domain settings.

#Vision#Benchmarking#Safety#arXiv

why featured

HKR-K passes with concrete scaled-vs-unscaled conformal prediction results and dataset names. HKR-H and HKR-R are weak: the framing is dry and the appeal is mostly limited to AV perception and safety specialists.

editor take

Scaled CP gets up to 19% IoU gains on KITTI/BDD/CODA; the catch is coordinate-wise Bonferroni still smells conservative.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems

The paper evaluates VAE, GAN, and Diffusion Models for federated predictive maintenance, comparing full federation with partial component sharing; experiments on a real-world time-series dataset show DDPM decoder sharing can outperform full federation under bandwidth-constrained, non-IID conditions.

#Fine-tuning#Inference-opt#Benchmarking#Research release

why featured

HKR-K/R pass: DDPM decoder sharing beating full federated learning under bandwidth-limited, non-IID conditions is testable. The topic is narrow for general AI practitioners, so it stays in 60–71.

editor take

DDPM decoder sharing beats full federation under non-IID bandwidth limits; dataset and metrics are undisclosed, so don’t overfit the headline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

The researchers extracted symbolic directions from frozen embeddings in three health foundation models trained on about 20 million minutes of unlabeled PPG and accelerometer data from roughly 172,000 participants, then tested a held-out cohort of 30,000 subjects; symbol-based cross-modal transfer retained more than 95% of in-domain performance without retraining.

#Multimodal#Embedding#Interpretability#arXiv

why featured

HKR-K passes with dataset scale, frozen-embedding symbolic directions, and cross-modal transfer; HKR-H and HKR-R are weak. No product or artifact is disclosed, so this stays in the lower interesting band.

editor take

Three health FMs retain 95% performance via symbolic cross-modal transfer; wearable interpretability is starting to look useful.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→ADKO: Agentic Decentralized Knowledge Optimization

ADKO lets autonomous agents run collaborative black-box optimization through knowledge tokens that carry directional signals, advantage scores, and optional LM insights, while agents keep private GP surrogates and do not share raw data or model parameters.

#Agent#Reasoning#Memory#ADKO

why featured

HKR-K and HKR-R pass: the mechanism is concrete and privacy-relevant. HKR-H is weak, and the post discloses no benchmark numbers, code, or production claim, so it stays in the 60–71 research-release band.

editor take

ADKO shares tokens, not data or weights; I buy the privacy setup, but no experiment numbers are disclosed here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA

The paper proposes GLoRA, a gauge-aware server representation for federated LoRA that estimates a consensus update subspace from client projectors. Experiments on GLUE and SuperNI report gains over federated LoRA baselines under data, resource, task, rank, participation, backbone, and unseen-task heterogeneity.

#Fine-tuning#Inference-opt#Benchmarking#GLoRA

why featured

HKR-K/R pass: GLoRA uses client projections to estimate a consensus update subspace and reports gains over federated LoRA baselines on GLUE and SuperNI. The paper lacks code, numeric deltas, and a product landing path, so it stays in all.

editor take

GLoRA aggregates LoRA via projector subspaces and wins on GLUE/SuperNI; I buy the target, factor averaging has a real gauge bug.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

The paper introduces Mutual Reinforcement Learning, a concurrent RL post-training framework where heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers, and evaluates three GRPO-based probes: PRP, XGRPO, and SGT.

#Fine-tuning#Reasoning#Research release

why featured

HKR-K passes because the paper names concrete mechanisms for shared RL experience. HKR-H/R are weak: no reported gains, compute cost, or model-scale details are disclosed, so this sits in the ordinary research-release band.

editor take

MRL tests PRP, XGRPO, and SGT on GRPO; no scores disclosed, and SGT’s low-bandwidth success transfer feels likelier to survive.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics

The paper evaluates privacy leakage in tabular diffusion models using black-box and white-box membership inference attacks, testing training setup, synthesis choices, and attacker knowledge; the RSS abstract states attackers do not need exact training knowledge or massive compute, but does not disclose dataset counts or leakage rates.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: the privacy-attack framing matters for synthetic-data users. HKR-H is weak, and dataset counts, leakage rates, and reproducible details are not disclosed.

editor take

This paper tests membership inference on tabular diffusion, but gives no leakage rates; synthetic tables need attack evals before handoff.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Adaptive Memory Decay for Log-Linear Attention

The paper proposes learning λ from the input with a two-layer MLP, producing per-token and per-Fenwick-level decay while preserving log-linear complexity, and reports gains over the baseline on three tasks: associative recall, selective copying, and language modeling.

#Memory#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes: the paper gives a concrete adaptive-decay mechanism and tests it on associative recall, selective copying, and language modeling. HKR-H/R are weak, and the architecture detail is too niche for featured.

editor take

A two-layer MLP learns λ while keeping log-linear cost; no gain numbers disclosed, so I read this as a Fenwick-attention patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→The Effect of Mini-Batch Noise on the Implicit Bias of Adam

The paper introduces a framework linking batch size, β1, and β2 to Adam’s memory-driven implicit bias: the default (0.9, 0.999) works well for small batches, while moving β1 closer to β2 improves validation accuracy in many large-batch, multi-epoch training settings.

#Fine-tuning#Benchmarking#Adam#AdamW

why featured

HKR-K and HKR-R pass: the paper gives testable Adam β-setting claims for small versus large batches. The academic framing and narrow training scope keep it below featured.

editor take

Adam’s (0.9,0.999) default holds for small batches; for large-batch multi-epoch runs, pull β1 toward β2.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD models voxel occupancy as a native discrete variable for sparse voxel generation, assessment, and editing in SLat-based 3D pipelines, and uses predictive entropy to identify ambiguous voxel regions and difficult samples.

#Multimodal#Fine-tuning#Research release

why featured

HKR-K passes via a concrete 3D generation mechanism, but HKR-H and HKR-R are weak. The item is an arXiv abstract with no metrics, model size, code status, or product implication disclosed, so it stays in all.

editor take

DVD models voxel occupancy discretely and uses entropy for ambiguity; I buy the 3D prior, but no benchmark numbers are disclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Adaptive Negative Reinforcement for LLM Reasoning: Dynamically Balancing Correction and Diversity in RLVR

The paper proposes A-NSR and CW-NSR as two extensions to Negative Sample Reinforcement, using time-dependent schedules and normalized sequence-likelihood penalty weights, and evaluates them on MATH, AIME 2025, and AMC23 with the Qwen2.5-Math-1.5B architecture.

#Reasoning#Alignment#Fine-tuning#Qwen

why featured

HKR-K passes for concrete RLVR mechanisms and evaluation setup. HKR-H/R fail because the summary gives no gain numbers, code artifact, or production impact, making this a narrow research item in the 60–71 band.

editor take

A-NSR tests Qwen2.5-Math-1.5B on three math sets; no gains disclosed, so don’t crown dynamic penalties yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Mechanistic Interpretability with Sparse Autoencoder Neural Operators

The paper introduces SAE-NOs, sparse autoencoders operating in function spaces, and uses concept sparsity plus domain sparsity to model which concepts are active and where they are expressed across the input domain.

#Interpretability#Vision#Research release

why featured

HKR-K passes via the SAE-NO mechanism plus concept/domain sparsity. HKR-H and HKR-R are weak, and the post gives only abstract-level method detail with no results numbers or product implication.

editor take

SAE-NO moves SAEs into function space; cross-resolution generalization is the hook, but benchmark scale is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→From Time Series Analysis to Question Answering: A Survey in the LLM Era

The paper proposes a taxonomy for the shift from TSA to TSQA and organizes prior work into three alignment paradigms: Injective Alignment, Bridging Alignment, and Internal Alignment.

#Reasoning#Benchmarking#Research release

why featured

HKR-K passes: the paper offers a TSA-to-TSQA taxonomy and three alignment paradigms. HKR-H/R are weak, and an arXiv survey is useful research navigation rather than same-day AI industry news.

editor take

This survey buckets TSQA work into 3 alignment paradigms; useful map, but the snippet gives no benchmark evidence.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Frequency-Aware Model Parameter Explorer: A New Attribution Method for Improving Explainability

FAMPE uses an FFT-based alpha-weighted perturbation scheme for attribution and was evaluated on ImageNet across four CNN and Vision Transformer architectures; at fixed alpha=0.1, it outperforms AttEXplore by 4.25% on Inception-v3 and 12.04% on MaxViT-T.

#Interpretability#Vision#FAMPE#AttEXplore

why featured

HKR-K passes: the post gives FAMPE’s FFT-weighted perturbation mechanism and ImageNet comparison numbers. HKR-H and HKR-R are weak; attribution research has signal, but the audience fit is narrow, so it stays in the 60–71 band.

editor take

FAMPE beats AttEXplore on four ImageNet architectures; fixed α=0.1 gains 12.04%, but baseline-selection cost is not compared.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

STARFlow2 connects a pretrained VLM stream with a TarFlow stream through the Pretzel architecture, using the same causal mask so text and visual outputs enter the KV-cache directly without re-encoding.

#Multimodal#Vision#Inference-opt#STARFlow2

why featured

HKR-K passes: Pretzel links VLM flow/TarFlow under one causal mask and a KV-cache path. No metrics, artifact, product tie-in, or major lab signal; HKR-H/R miss, so this sits in all.

editor take

STARFlow2 puts VLM and TarFlow streams under one causal mask; I buy the engineering bet, but RSS gives no benchmark numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Stochastic Transition-Map Distillation for Fast Probabilistic Inference

The paper proposes STMD for faster diffusion inference, using a conditional Mean Flow model to distill full SDE transition maps into a one- or few-step stochastic sampler, with validation on MNIST, CIFAR-10, and CelebA.

#Inference-opt#STMD#Mean Flow#Research release

why featured

HKR-K/R pass: one-to-few-step stochastic sampling is relevant to diffusion inference cost, and the mechanism is concrete. The post lacks speedup or quality numbers and only lists MNIST, CIFAR-10, and CelebA, so it stays in the 60-71 research band.

editor take

STMD distills SDE transition maps into one/few-step samplers; MNIST/CIFAR-10/CelebA only, no large-scale generation numbers disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE

The paper uses TRUSTEE to diagnose static malware classifiers under controlled dataset-composition ratios. Across experiments, top-ranked features are mostly packing artifacts, PE metadata, and string-level n-grams, not malicious semantics. The authors present the framework as a reproducible way to detect dataset bias in malware models.

#Interpretability#Safety#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete shortcut-learning finding for malware classifiers. HKR-H is weak, and the static-malware focus limits audience fit, so it stays in all.

editor take

TRUSTEE found top features skewing to packing, PE metadata, and string n-grams; static malware classifiers still memorize dataset tells.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SLOPE: Optimistic Potential Landscape Shaping for Model-based Reinforcement Learning

SLOPE replaces sparse scalar reward prediction with optimistic potential landscape estimates, using distributional regression for high-confidence upper bounds; the paper evaluates it on 30+ tasks across 5 benchmarks and real-world robotic deployments, where it outperforms leading baselines under fully sparse, semi-sparse, and dense rewards.

#Robotics#Reasoning#Benchmarking#SLOPE

why featured

HKR-K passes on the new SLOPE mechanism, 5 benchmarks, 30+ tasks, and robot deployment. HKR-H and HKR-R miss because the angle is academic and narrow for the broader AI-practitioner audience.

editor take

SLOPE covers 30+ tasks on 5 benchmarks; optimistic potential landscapes are a cleaner fix than more reward-model tuning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Interpreting Reinforcement Learning Agents with Susceptibilities

The paper generalizes susceptibilities from neural network interpretability to regret in deep reinforcement learning, tests them in a gridworld model with stagewise development, validates findings with activation steering, and discusses an extension to RLHF post-training.

#Agent#Interpretability#Alignment#Research release

why featured

HKR-K passes: the paper adds a concrete method transfer plus gridworld validation. HKR-H is weak and HKR-R is limited, so this stays in the lower research-release band rather than featured.

editor take

2605.08007 ties susceptibilities to deep-RL regret; the catch is it only runs gridworld, with RLHF left as discussion.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry

The paper interviews 12 practitioners at one global semiconductor company and identifies 16 collaboration and communication challenges in ML engineering teams, with unclear roles and responsibilities ranked as the most critical issue under hardware-driven constraints.

#Research release

why featured

HKR-K/R pass: the paper gives 12 interviews, 16 challenge categories, and unclear roles as the top issue. HKR-H is weak; the single-company semiconductor sample keeps it in the interesting-but-not-featured band.

editor take

12 semiconductor practitioners flagged 16 CoCo failures; unclear ownership ranks first, so stop dressing ML engineering debt as model trouble.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation

RRCM trains a memory-reading policy with an outcome-only ranking reward, choosing among direct recommendation, collaborative evidence retrieval, item metadata retrieval, or interleaved retrieval from a lightweight user-history context; the paper says extensive top-k recommendation experiments beat traditional baselines and multiple LLM-based recommender methods, but the snippet does not disclose dataset names or exact scores.

#Agent#RAG#Reasoning#Research release

why featured

HKR-K passes: the paper gives a new training signal for memory retrieval and evidence selection. HKR-H/R are weak; no gain numbers, dataset detail, or deployment case are disclosed, so this sits in the research long tail.

editor take

RRCM trains retrieval with outcome-only ranking reward; datasets and scores are undisclosed, so I don't buy “significantly outperforms” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Text-to-CAD Evaluation with CADTests

The paper introduces CADTestBench, a test-based benchmark for Text-to-CAD that uses executable CADTests to verify whether generated CAD models satisfy geometric and topological prompt requirements, with code and data released on GitHub and Hugging Face.

#Benchmarking#Code#CADTestBench#CADTests

why featured

HKR-K passes: executable CADTests are a concrete evaluation mechanism with open artifacts. HKR-H and HKR-R are weak; the CAD-generation niche fits all, not featured.

editor take

CADTestBench evaluates Text-to-CAD with executable CADTests; that beats mesh similarity as an engineering acceptance signal.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→PSK@EEUCA 2026: Fine-Tuning LLMs with Synthetic Data for Multi-Class Toxicity Detection in Gaming Chat

PSK combined Llama 3.1 8B with 5% synthetic data augmentation to classify World of Tanks chat into six toxicity categories, achieving 0.6234 macro F1 on the test set and ranking 4th among 35 teams in the EEUCA 2026 shared task.

#Fine-tuning#Safety#Benchmarking#Llama

why featured

HKR-K passes with concrete setup and leaderboard numbers. HKR-H and HKR-R miss because this is a narrow shared-task result, not a broadly discussed AI product or safety shift.

editor take

PSK got 0.6234 macro F1 with Llama 3.1 8B plus 5% synthetic data; gaming toxicity evals still punish validation overfitting.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Toward Privileged Foundation Models: LUPI for Accelerated and Improved Learning

The paper introduces PIQL, a framework that injects two train-time privileged information types into tabular foundation models; the abstract says it improves convergence, final loss, and generalization, but the post does not disclose speedup factors, loss values, or dataset sizes.

#Fine-tuning#Inference-opt#Reasoning#PIQL

why featured

HKR-K passes for PIQL’s training-time privileged-information mechanism, but the post gives no speedup, loss number, or dataset scale. HKR-H and HKR-R are weak, so this stays in the lower all band.

editor take

PIQL injects two train-time PI types into tabular FMs; no speedup or dataset scale is disclosed, so I don't buy the compute-saving claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Spectral Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

The paper proposes Spectral Surgery, a post-hoc method that perturbs model weights along Hessian spike eigenvectors to rebalance per-class classifier accuracy without retraining, and reports encouraging balanced accuracy and standard-deviation results on CIFAR-10 and ISIC-2019, while the snippet does not disclose exact numerical gains.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via a concrete no-retraining mechanism and named benchmarks. HKR-H/R are weak: the angle is specialized classifier rebalancing, not a broad AI-product or practitioner flashpoint.

editor take

Spectral Surgery perturbs Hessian spike eigenvectors post hoc; CIFAR-10 and ISIC-2019 results exist, exact gains undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

AGWM tracks action executability with a prerequisite-dependency DAG and reports lower multi-step prediction error in game-based simulated environments; the abstract does not disclose specific datasets, numeric error reductions, or baseline names.

#Agent#Reasoning#Interpretability#AGWM

why featured

HKR-K passes for the prerequisite-DAG mechanism in agent world models. HKR-H and HKR-R stay weak because the abstract gives no dataset, error number, or baseline, so this fits a normal arXiv research item.

editor take

AGWM uses a prerequisite DAG for action executability; no datasets, error deltas, or baselines disclosed, so good idea, weak evidence.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

The paper releases UXPID, a dataset with 7,130 synthesized and anonymized user feedback branches from a public industrial automation forum, where each JSON record includes multi-post comments, metadata, and LLM annotations for UX insights, severity ratings, sentiment, and topic classes.

#Fine-tuning#Benchmarking#UXPID#arXiv

why featured

HKR-K passes: 7,130 synthetic feedback records and LLM labels give dataset users a concrete artifact. HKR-H and HKR-R are weak, so this stays in the lower interesting band with no hard exclusion.

editor take

UXPID ships 7,130 synthetic forum feedback records; I don’t buy it yet—LLM labels training models smells circular.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Approximation-Free Differentiable Oblique Decision Trees

The paper proposes DTSemNet, an invertible neural-network representation semantically equivalent to hard oblique decision trees, and trains classification and regression trees end to end with standard gradient descent without soft-boundary or STE approximations.

#Interpretability#Reasoning#Benchmarking#DTSemNet

why featured

HKR-K passes: DTSemNet gives a concrete mapping from hard oblique trees to reversible neural nets. HKR-H/R are weak, and no benchmark numbers or product impact are disclosed, so this stays in all.

editor take

DTSemNet maps hard oblique trees into invertible nets; I care whether it beats XGBoost on small tabular data.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→It Just Takes Two: Scaling Amortized Inference to Large Sets

The paper trains a mean-pool Deep Set encoder on sets of at most two elements, then finetunes the inference head on pre-aggregated embeddings, making training cost essentially independent of deployment set size N while matching or exceeding baselines on benchmarks with N in the thousands.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes via a concrete scaling claim: training cost stays nearly independent of deployment set size N. HKR-H/R are weak because this is a narrow methods paper with no product or broad practitioner hook.

editor take

Training Deep Set on sets of size ≤2 still works at N in the thousands; useful trick for scientific inference.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Active teacher selection for reward learning

The paper proposes the Hidden Utility Bandit framework and ATS algorithms for reward learning, testing active teacher selection in two real-world domains: paper recommendation systems and COVID-19 vaccine testing.

#Alignment#Reasoning#Research release

why featured

HKR-K passes: the paper offers a new framework, algorithm, and two real-domain tests. HKR-H/R are weak because the angle is academic and gives no direct RLHF, labeling-cost, or production impact.

editor take

HUB models teacher variance across rationality, expertise, and cost; I buy the direction, but the snippet gives no lift numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

PACEvolve++ uses a trainable advisor to adapt evolutionary search policy at test time, decoupling strategic decisions from candidate implementation, and reports gains over an existing frontier-model evolutionary search framework across 3 task types; the abstract does not disclose exact improvement numbers.

#Agent#Reasoning#Fine-tuning#Minghao Yan

why featured

HKR-K passes: the paper describes a trainable test-time advisor for evolutionary search. No exact gains are disclosed, HKR-H and HKR-R are weak, so it sits at the low end of all.

editor take

PACEvolve++ beats frontier evolutionary search on 3 task types; no deltas in the abstract, so treat it as architecture signal, not proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

The paper introduces Fidel-TS for time-series forecasting evaluation, listed as arXiv:2509.24789v4, and defines high-fidelity benchmarking around data sourcing integrity, leak-free design, and structural clarity while testing unimodal, multimodal, and LLM-based forecasting models.

#Multimodal#Benchmarking#Fidel-TS#arXiv

why featured

HKR-K passes via a new benchmark and leakage-free design details. HKR-H/R are weak: the title is catalog-like, and the impact is narrow to time-series evaluation; no hard exclusion, but sparse body detail keeps it low.

editor take

Fidel-TS v4 targets leak-free forecasting evals; RSS omits scale and leaderboards, so I’d treat it as benchmark hygiene first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→KL for a KL: On-Policy Distillation with Control Variate Baseline

The paper proposes vOPD, casting OPD as policy-gradient RL and using per-token negative reverse KL as a control-variate baseline, keeping the single-sample estimator unbiased while reducing gradient variance without an added critic or extra inference.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-K passes on the vOPD control-variate mechanism. HKR-H and HKR-R are weak, and the post gives no benchmark numbers, model scale, or reproducible gain, so it stays in the lower all band.

editor take

vOPD uses per-token negative reverse KL as baseline; I buy the framing, OPD instability finally gets treated as variance.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Attribution-Based Neuron Utility for Plasticity Restoration in Deep Networks

The paper introduces GXD, a utility measure using reference-based gradient attribution to estimate the first-order functional cost of replacing a unit, and uses it to guide selective resets of low-utility parameters for restoring trainability in continual learning settings.

#Fine-tuning#Interpretability#Research release

why featured

HKR-K passes: GXD estimates neuron utility via reference-gradient attribution and resets low-utility parameters to restore trainability. No experiment numbers or reproducible setup are disclosed, and the angle is narrow, so it stays in the upper 40–59 band.

editor take

GXD estimates reset cost via gradient attribution, but no benchmark numbers are disclosed; I don’t buy “more reliable” before non-toy continual learning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→FlightSense: End-to-End MLOps for Real-Time Flight Delay Prediction with Agentic Conversational AI

FlightSense trains an XGBoost delay predictor on 7.07 million BTS 2018 flight records, raising ROC AUC from 0.732 to 0.875 after adding 11 aircraft rotation-chain propagation features, then reaching 0.879 with five NOAA weather features across 10 major U.S. airports.

#Agent#Tools#Aditi J. Shelke#Renuka J. Shelke

why featured

HKR-K passes with dataset size, feature mechanism, and AUC lift. HKR-H/R fail because this is a niche applied MLOps paper for aviation, with no hard-exclusion trigger, so it sits in lower all.

editor take

FlightSense gets AUC to 0.875 with 11 rotation-chain features; the agent chat layer smells like packaging.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

The paper introduces iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning for ICL, covering six task types including forecasting, imputation, classification, anomaly detection, reconstruction, and source de-mixing.

#Reasoning#Benchmarking#iAmTime#arXiv

why featured

No hard exclusion triggered. HKR-K passes on the iAmTime mechanism and 6-task scope; HKR-H and HKR-R miss because no metrics, artifact, or product/lab tie-in is disclosed.

editor take

iAmTime covers 6 time-series tasks; baselines and gains are undisclosed here, so I’d file it as promising but under-evidenced.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Inference-Time Attribute Distribution Alignment for Unconditional Diffusion

The paper proposes inference-time attribute distribution alignment for pretrained unconditional diffusion models, casting reverse diffusion as an optimal control problem and using additive time-dependent perturbations to match target attribute distributions without retraining or fine-tuning.

#Inference-opt#Alignment#Vision#Research release

why featured

HKR-K passes for a concrete inference-time alignment mechanism without retraining. HKR-H and HKR-R are weak; no benchmark, code, or product setting is disclosed, so this stays in the lower research-signal band.

editor take

This casts unconditional diffusion inference as optimal control with no retraining; baselines, datasets, and control cost are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

The paper proposes dynamic clipping thresholds to control entropy in RLVR, evaluates three schedules—increase-then-decrease, decrease-increase-decrease, and oscillatory decay—and reports reduced entropy collapse with stronger performance across multiple benchmarks.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-K passes on a concrete RLVR entropy-control mechanism and three tested schedules. HKR-H is weak and HKR-R is narrow; benchmark names and gains are not disclosed, so it stays in all.

editor take

Dynamic clipping controls RLVR entropy here; benchmarks and gains are undisclosed, so I’d test whether this survives GRPO runs.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning

EviDep estimates depression severity plus aleatoric and epistemic uncertainty using a Normal-Inverse-Gamma distribution, adds wavelet-based Mixture-of-Experts feature extraction and disentangled evidential learning, and reports tests on AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC.

#Multimodal#Safety#Benchmarking#EviDep

why featured

HKR-K is clear via the NIG uncertainty mechanism and four datasets; HKR-R is limited to medical-AI safety. No hard exclusion, but the niche clinical-estimation angle lacks product, agent, or adoption impact.

editor take

EviDep reports SOTA on 4 depression datasets; I don't buy “trustworthy” without disclosed external clinical validation.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→On the Meta-Design of Allocation Problems

The paper defines a meta-design space for resource allocation problems and develops empirical tools for planner-level decisions; the abstract discloses two real-world case studies, covering German employment services and targeted cash transfer programs in Ethiopia, but it does not disclose implementation details or measured welfare gains.

#Research release

why featured

HKR-K passes via a new framework and 2 real cases; HKR-H is weak because the title reads like a paper heading; HKR-R is thin for AI practitioners. No hard exclusion applies, so it stays in the upper low-value research band.

editor take

The paper has 2 field cases, but no welfare gains disclosed; treating capacity, data, and service quality as variables is the useful move.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning

PerCaM-Health learns personalized dynamic causal graphs from longitudinal health data, using a knowledge-guided population temporal graph, patient-specific temporal evidence, and rolling-window updates; the abstract reports gains on a semi-synthetic dynamic health benchmark but does not disclose sample size or metric values.

#Reasoning#PerCaM-Health#Research release#Benchmark

why featured

HKR-K passes because the method has concrete mechanisms, but sample size, result numbers, and reproducible setup are not disclosed. HKR-H/R are weak, so this fits all rather than featured.

editor take

PerCaM-Health updates patient causal graphs with rolling windows; sample size and metrics are undisclosed. Without external validation, I don't buy the healthcare claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Towards Fairness under Label Bias in Image Segmentation: Impact, Measurement and Mitigation

The paper adapts Confident Learning to image segmentation and evaluates it on three datasets, detecting and mitigating group-conditional label errors without clean unbiased annotations.

#Vision#Benchmarking#Alignment#arXiv

why featured

HKR-K passes with a testable method and 3 datasets; HKR-H is weak and HKR-R is limited to vision-fairness specialists. This is a narrow arXiv research item, not a model, product, or safety incident.

editor take

The paper tests Confident Learning on 3 segmentation datasets; I buy the problem, but “equitable performance” needs effect sizes.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→TraXion: Rethinking Pre-training Frameworks for Mobility and Beyond

TraXion uses one checkpoint per dataset to beat task-specific baselines across six public mobility datasets, covering anomaly detection, next-POI recommendation, next-visit prediction, and social-link prediction.

#Embedding#Benchmarking#TraXion#Research release

why featured

HKR-K passes: 6 datasets and a single-checkpoint multi-task result are concrete. HKR-H/R miss because the title is academic and the audience fit is narrow, so it sits in the lower research-signal band.

editor take

TraXion beats baselines on 6 mobility datasets with one checkpoint; I buy MESES over forcing trajectories into sentences.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Three-in-One World Model: Energy-Based Consistency, Prediction, and Counterfactual Inference for Marketing Intervention

The paper proposes a Three-in-One world model that uses a DBM to learn frozen beliefs from demographics, time, and lagged actions and outcomes, then attaches lightweight adapters for three tasks: energy-based consistency evaluation, outcome prediction, and counterfactual inference.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via the DBM frozen-representation setup and three downstream tasks. HKR-H/R are weak: the marketing-causal-inference angle is narrow, with no product, open-source artifact, or reproducible result disclosed.

editor take

Three-in-One is validated only in controlled simulation; no real marketing data disclosed, so the world-model label feels premature.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→GAD in the Wild: Benchmarking Graph Anomaly Detection under Realistic Deployment Challenges

The paper presents a graph anomaly detection benchmark that evaluates nine GAD models on five graphs, including two industrial-scale datasets with over 3.7 million nodes, under million-node scale, 0.1% anomaly ratios, and missing node attributes.

#Benchmarking#Research release#Benchmark#Open source

why featured

HKR-K passes on concrete benchmark settings, but the topic is narrow GAD research with no product, agent, or major-model link. It stays in the 40–59 low-value band.

editor take

GAD benchmark tests 9 models on 5 graphs; at 0.1% anomalies, zero recall makes lab AUC look cheap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Bayesian Fine-tuning in Projected Subspaces

The paper proposes a Bayesian fine-tuning framework in projected subspaces, modeling weight uncertainty in low-dimensional parameter spaces while targeting better calibration and generalization; the abstract does not disclose model sizes, datasets, benchmark numbers, or training-cost metrics.

#Fine-tuning#Alignment#Inference-opt#Research release

why featured

HKR-K passes: the paper offers a concrete mechanism for Bayesian fine-tuning in projected subspaces. HKR-H/R are weak, and model scale, datasets, and metrics are not disclosed, so it stays in the low-value research band.

editor take

The paper gives a low-dimensional Bayesian fine-tuning frame, but no model sizes or metrics; treat it as a LoRA calibration patch for now.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Knowledge Transfer Scaling Laws for 3D Medical Imaging

The paper models data allocation for 3D medical imaging pretraining as a scaling-law optimization problem, where transfer-aware sampling outperforms data-proportional sampling by up to 58% and generalizes to unseen budgets with r=0.989 across CT, MRI, and PET domains.

#Vision#Multimodal#Benchmarking#arXiv

why featured

HKR-K passes with a concrete 58% gain, r=0.989, and a sampling mechanism. HKR-H and HKR-R are weak because the medical-imaging pretraining angle is narrow, so this stays in all.

editor take

Transfer-aware sampling gains up to 58%, r=0.989; stop mixing 3D medical pretraining data by inventory ratios.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Intelligent Truck Matching in Full Truckload Shipments Using Ping2Hex Approach

Project44 presents ITM 2.0 for full truckload GPS matching, using Uber H3 spatial indexing, temporal features, LightGBM ranking, and threshold post-processing; the system improves precision by 26 percentage points in North America and 14 points in Europe, while doubling coverage.

#Benchmarking#Project44#Uber#Research release

why featured

HKR-K passes with concrete mechanisms and accuracy deltas. HKR-H/R are weak: this is vertical logistics ML, not a general AI product, model, or agent update, so it lands in the 40–59 band.

editor take

ITM 2.0 lifts North America precision by 26 points; old-school H3 plus LightGBM still beats deep models in dirty logistics data.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning

The paper proposes FLAM, a federated learning evaluation method using aggregatable measures, and claims it matches centralized evaluation without requiring a global test dataset, addressing cases where participant-level weighted averaging yields incorrect metrics.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes for a concrete evaluation mechanism without a global test set. HKR-H/R are weak, and federated-learning metrics are niche academic material, so it sits in the 40–59 low-value research band.

editor take

FLAM claims centralized-equivalent FL evaluation without a global test set; the abstract omits metric coverage and error bounds, so don’t crown it yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

The paper proposes a two-axis view of channel information and tests it on ResNet-18, VGG-16, and MobileNetV2 trained on CIFAR-100; under a fixed FLOPs-matched pruning protocol, local-axis metrics predict channel removability more reliably than target-axis metrics, with the same direction preserved on CIFAR-10, Tiny-ImageNet, ImageNet-100, and a ConvNeXt-T/ImageNet-100 pilot.

#Vision#Interpretability#Inference-opt#ResNet-18

why featured

HKR-K passes because the paper adds a testable pruning signal across three CNNs on CIFAR-100. HKR-H/R are weak: the topic is narrow vision compression, not a broad practitioner trigger.

editor take

ResNet-18, VGG-16, and MobileNetV2 favor local-axis pruning signals; task relevance stays overrated, and VGG-16 norms still survive.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation

MARL-Rad decomposes chest X-ray interpretation into region-specific agents and a global integrating agent, then jointly optimizes them with clinically verifiable rewards; experiments on MIMIC-CXR and IU X-ray report higher RadGraph, CheXbert, and GREEN scores, while blinded clinician evaluation finds its reports clinically comparable to ground-truth reports.

#Agent#Multimodal#Fine-tuning#MARL-Rad

why featured

HKR-K passes via a concrete mechanism and benchmark claims; HKR-H and HKR-R miss. This is a narrow medical-imaging paper, not a product launch or broad agent capability update, so it lands in the 40–59 band.

editor take

MARL-Rad improves three clinical metrics on MIMIC-CXR and IU X-ray; role-trained agents beat post-hoc agent wiring here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Optimal Allocation of Dynamics and Reward Samples in Model-Based Reinforcement Learning

The paper analyzes model-based reinforcement learning with imagined rollouts, derives the optimal dynamics-to-reward sample allocation under power-law scaling assumptions, and reduces the choice between more noisy reward rollouts and fewer cleaner reward rollouts to a one-dimensional optimization problem.

#Agent#Reasoning#Asadi#Wang

why featured

Triggers hard-exclusion technical-accessibility: this is a theory-heavy model-based RL rollout paper with no practitioner on-ramp, code, or product implication. HKR-K passes, but the cap keeps it excluded.

editor take

Timor et al. derive sample allocation for imagined training; no experiments disclosed, so don’t treat it as an MBRL recipe yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

The paper introduces SB-TRPO, a hard-constrained reinforcement learning algorithm that combines reward and cost natural policy gradients at each step, guarantees a fixed fraction of optimal cost reduction, and evaluates safety-task tradeoffs on standard and challenging Safety Gymnasium tasks.

#Agent#Safety#Alignment#SB-TRPO

why featured

HKR-K passes: the mechanism and testbed are specific. HKR-H/R fail because the title is dry and the work stays close to specialist safe-RL research; technical accessibility lowers the score, but no hard exclusion is triggered.

editor take

SB-TRPO mixes reward and cost natural gradients per step; hard-constrained RL needs checkable cost reduction, not softer penalties.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Neurosymbolic Imitation Learning with Human Guidance: A Privileged Information Approach

The paper proposes a neurosymbolic imitation learning method that uses gaze data as privileged information available only during training; the abstract says empirical evaluations test effectiveness, efficiency, and generalization, but the RSS snippet does not disclose sample counts or benchmark names.

#Reasoning#Research release

why featured

HKR-K passes for the training-time gaze-as-privileged-information mechanism. HKR-H/R are weak, and the post gives no sample size, benchmark names, or result numbers, so it stays in all.

editor take

The paper uses gaze as training-only privileged information; sample counts and benchmarks are undisclosed, so I don’t buy the both-worlds claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysis

The paper builds the Korean EVAD annotated corpus for fashion e-commerce reviews, uses SSP and FST-based linguistic resources for ABSA annotation, and reports F1 scores of 0.88 for KoBERT and 0.90 for KcBERT on aspect-value pair recognition.

#Fine-tuning#Benchmarking#KoBERT#KcBERT

why featured

HKR-K passes for a new corpus, labeling method, and F1 results. HKR-H/R fail: the angle is a narrow academic NLP dataset with little practitioner tension, so it stays in the 40–59 band.

editor take

EVAD uses SSP/FST on Korean fashion reviews; KcBERT hits 0.90 F1, a cleaner ABSA signal than generic sentiment leaderboards.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

The paper introduces EFLA, replacing the first-order Euler update in delta-rule linear attention with an exact closed-form flow while preserving linear-time complexity, parameter count, and chunkwise parallelism; the RSS snippet says it reduces perplexity and improves robustness, but it does not disclose numerical gains.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes via a concrete mechanism and retained linear time/block parallelism. HKR-H/R are weak, and the continuous-time dynamics framing is technical for general AI pros, so it stays in the lower research band.

editor take

EFLA swaps Euler updates for closed-form flow in delta-rule linear attention; gains lack numbers, so I buy the mechanism, not the payoff.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

The study trains a recurrent neural network to predict the next token in 2D continuous token scenes from current input and saccade-like displacement, and decoding analyses show path integration plus dynamic binding between token identity and position.

#Reasoning#Memory#Interpretability#Research release

why featured

HKR-K passes because the paper reports a testable representation mechanism. HKR-H and HKR-R fail: the angle is academically dense, with no product, benchmark, safety, or market hook; no hard-exclusion rule is triggered.

editor take

An RNN predicts next tokens in 2D scenes; path integration is unsurprising, but the intervention story needs cross-architecture replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→TAP: Two-Stage Adaptive Personalization of Multi-Task and Multi-Modal Foundation Models in Federated Learning

TAP personalizes federated foundation models with a two-stage method: it first uses mismatched client-server architectures to replace selected parameters, then applies post-FL distillation after the global model stabilizes; the arXiv abstract says code is public, but does not disclose dataset names or exact metric gains.

#Multimodal#Fine-tuning#Research release#Open source

why featured

HKR-K passes because TAP describes a two-stage personalization mechanism; HKR-H and HKR-R fail because there is no surprising result, metric, or practitioner conflict. The federated-learning angle is specialist, so this stays in the lower band.

editor take

TAP has a two-stage FL personalization recipe, but no datasets or gains disclosed; I trust the architecture-mismatch idea more than “extensive experiments.”

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→StreamPhy Achieves Streaming Inference of High-Dimensional Physical Dynamics

StreamPhy infers full-field physical dynamics from incoming irregular sparse measurements using a data-adaptive observation encoder, structured state-space model, and FT-FiLM decoder; experiments on three physical systems report at least 48% accuracy improvement and 20–100× faster inference than diffusion-based methods.

#Inference-opt#StreamPhy#Research release#Benchmark

why featured

Hard-exclusion-4 applies: this is AI for physical dynamics, with no agent, product, or general-model implication disclosed. HKR-K is real via +48% accuracy and 20-100x speed, but audience fit is narrow, so it stays below 40.

editor take

StreamPhy reports 48% accuracy gains and 20–100x faster inference on 3 systems; I buy the streaming SSM bet, not broad generality yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→Characterizing and Correcting Effective Target Shift in Online Learning

The paper derives a closed-form solution for online kernel regression, proves it is equivalent to offline regression with shifted target outputs, and shows iterative target correction improves continual-learning performance on CIFAR-10 and CORe50 versus training with true targets.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via the target-shift mechanism and CIFAR-10/CORe50 setup, but no gain size is disclosed. HKR-H/R fail; the theory-heavy scope keeps it in the low-value research-signal band without triggering hard exclusion.

editor take

Online kernel regression equals offline shifted labels; beating true labels on CIFAR-10 and CORe50 makes the claim hard to ignore.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

29d ago

arXiv · cs.LG· atomEN04:00 · 05·11

→ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data

ProtoSSL learns a reusable prototype bank from unlabeled time-series data with a self-supervised objective, then aligns prototypes to downstream tasks, outperforming supervised prototype baselines across six ECG datasets in low-data regimes with as few as 256 labeled examples.

#Interpretability#Fine-tuning#Audio#ProtoSSL

why featured

HKR-K passes on concrete experiment details, but HKR-H and HKR-R are weak. This is specialized time-series representation research with no code release, model launch, or product path disclosed.

editor take

ProtoSSL beats supervised prototype baselines on 6 ECG sets with 256 labels; I buy low-label value, not broad time-series claims yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

03:28

29d ago

HuggingFace Papers (takara mirror)· rssEN03:28 · 05·11

→PruneTIR: Inference-Time Tool Call Pruning for Effective Yet Efficient Tool-Integrated Reasoning

PruneTIR prunes trajectories, resamples tool calls, and suspends tool use during inference through three triggers: success, stuck states, and repeated retries; the post says it improves Pass@1 and reduces working context length, but does not disclose the exact gains.

#Agent#Reasoning#Tools#PruneTIR

why featured

HKR-K is clear: three inference-time triggers are specified. HKR-R lands for agent tool-call reliability and cost, but HKR-H is weak and no Pass@1 or benchmark delta is disclosed, so this stays in all.

editor take

PruneTIR adds 3 inference triggers for tool calls; Pass@1 gains are undisclosed, so I read it as runtime loss control.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

03:07

29d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN03:07 · 05·11

→Position: Academic Conferences Face Denominator Gaming from Fully Automated Scientific Agents

The position paper defines Agentic Denominator Gaming: an attacker uses AI agents to submit many low-quality but plausible papers, inflating the submission denominator under stable acceptance-rate policies and increasing the publication probability of a small targeted set of legitimate papers.

#Agent#Safety#Research release#Safety/alignment

why featured

HKR-H/K/R all pass, but the summary gives a mechanism, not scale, conference data, or a real attack case. This fits a solid AI-safety/policy paper above featured threshold, not same-day must-write.

editor take

Don’t read this as research-ethics scolding; it attacks a conference mechanism where agents win by bloating submissions, not by writing good papers.

sharp

Agentic Denominator Gaming is sharp because the attack target is the acceptance-rate mechanism, not paper quality. The paper’s setup is concrete: an attacker submits many superficially plausible low-quality papers, inflates the submission denominator, and improves the odds for a small set of targeted legitimate papers under stable acceptance-rate policies. That is harder than “detect AI-written papers,” because detection screens content while the attack burns reviewer capacity. ScienceBoard reported only 15% overall success for GPT-4o, Claude 3.7, and UI-TARS on realistic scientific workflows, so agents still struggle at doing science. But spam that clears formatting, citations, and fake-ish experiment tables needs a much lower bar. If top conferences keep treating a stable acceptance rate as a quality anchor, automated paper mills get a clean economic target.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

01:02

29d ago

HuggingFace Papers (takara mirror)· rssEN01:02 · 05·11

→Paper Evaluates Efficient Neural Architectures for Real-Time ECG Interpretation on Limited Hardware

The paper evaluates five CNN architectures on three public 12-lead ECG datasets from Germany, China, and the United States, and uses a unified Efficiency Score combining model size, inference speed, memory usage, and AUC performance.

#Inference-opt#Benchmarking#AttiaNet#DeepResidualCNN

why featured

Hard-exclusion: traditional medical/science AI crossover; AI is used for ECG interpretation with no product, agent, or platform implication. HKR-K passes on benchmark details, but audience fit is narrow, capped below 40.

editor take

The paper tests 3 12-lead ECG datasets; device latency is undisclosed, so Efficiency Score is not deployment proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

00:04

29d ago

HuggingFace Papers (takara mirror)· rssEN00:04 · 05·11

→Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Fashion Florence fine-tunes Florence-2 with LoRA to generate JSON fashion attributes from one clothing image; on 461 held-out images, it reports 94.6% category accuracy, 63.0% material accuracy, 99.8% valid JSON output, and 0.753 style-tag F1 while running as a 0.77B-parameter model on a single GPU.

#Vision#Fine-tuning#Multimodal#Fashion Florence

why featured

HKR-K passes with LoRA, test-set size, accuracy, and JSON-validity numbers. HKR-H/R are weak because this is a narrow vertical vision-extraction paper, so it fits the 60–71 band.

editor take

Fashion Florence trains a 0.77B model on 3,688 images; 94.6% category accuracy is nice, 63.0% material accuracy bites.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

papers · 2026-05-11

more

feeds

admin