papers · 2026-06-01

▸ 261 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-06-01 · Mon

17:59

7d ago

arXiv · cs.AI· atomEN17:59 · 06·01

→Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

The paper defines Perceptual Judgment Bias in multimodal LLM-as-a-Judge systems and trains judges with a perceptually perturbed dataset, a structured GRPO-based reward, and a batch-ranking objective; the RSS snippet does not disclose dataset size, benchmark names, or exact improvement numbers.

#Multimodal#Vision#Alignment#Research release

why featured

HKR-K/R pass: the mechanism is concrete and the topic matters for multimodal eval reliability. No sample size, gains, or reproducible setup are disclosed in the feed, so this stays in the interesting band.

editor take

The paper trains MLLM judges with perturbations and GRPO, but RSS gives no dataset size or gains; I buy the failure mode, not the victory lap.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:59

7d ago

HuggingFace Papers (takara mirror)· rssEN17:59 · 06·01

→RoboDream: Compositional World Models for Scalable Robot Data Synthesis

RoboDream anchors generation to rendered robot motion and synthesizes photorealistic robot demonstrations with novel objects, scenes, and viewpoints; the snippet reports improved downstream policy performance and lower real-world data needs, but the post does not disclose task counts, dataset scale, or reduction percentages.

#Robotics#Multimodal#Vision#Research release

why featured

HKR-H/K/R pass, but the post lacks task counts, success rates, or data-cost deltas, so it stays in the 60–71 research-interest band rather than featured.

editor take

RoboDream constrains video generation with rendered robot motion; no task count, dataset scale, or reduction percent disclosed, so I don’t buy the “significantly reduces real data” claim yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:56

7d ago

FEATUREDarXiv · cs.CL· atomEN17:56 · 06·01

→AdaCodec: A Predictive Visual Code for Video MLLMs

AdaCodec encodes video with reference frames and compact P-tokens for motion and residual changes, outperforming the Qwen3-VL-8B per-frame RGB baseline across 11 benchmarks at a matched visual-token budget and reducing time-to-first-token from 9.26 seconds to 1.62 seconds on five general-video benchmarks.

#Multimodal#Vision#Inference-opt#Qwen

why featured

HKR-H/K/R all pass: AdaCodec has a concrete mechanism, testable numbers, and targets video-MLLM inference latency. As a single arXiv paper, it fits the 78–84 band, not same-day must-write.

editor take

AdaCodec hits the waste in video tokens: 32k beats a 224k baseline, which is a cleaner path than stuffing more frames.

sharp

AdaCodec’s sharp claim is that video MLLMs waste compute by rereading near-identical frames. The mechanism is concrete: keep full reference frames when predictive cost is high, then encode motion and residual changes as P-tokens. At the same visual-token budget, it beats the Qwen3-VL-8B per-frame RGB baseline on 11 benchmarks. On long-video tasks, 32k tokens beat the 224k baseline, a 7x budget gap. I buy this direction more than brute-forcing longer video context. Per-frame RGB is an ugly interface for temporal data, and time-to-first-token dropping from 9.26s to 1.62s matters in actual products. The missing part is whether this survives hard cuts, fast camera motion, and dense scene changes. The snippet gives benchmark wins, not failure modes or training cost.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:56

7d ago

FEATUREDarXiv · cs.CL· atomEN17:56 · 06·01

→ClinEnv: An Interactive Multi-Stage Long-Horizon EHR Environment for Agents

ClinEnv converts real inpatient admissions into a multi-stage interactive benchmark where models query four specialized agents before decisions; across seven models, the strongest reaches only 0.31 decision F1, with management actions at 0.17 F1 versus 0.51 for discharge diagnoses.

#Agent#Benchmarking#ClinEnv#Research release

why featured

HKR-H/K/R all pass: ClinEnv gives a concrete clinical-agent failure case with F1 numbers, not a generic benchmark claim. It is research-heavy, so it fits 78–84 rather than same-day must-write territory.

editor take

ClinEnv exposes the medical-agent gap: models can guess discharge diagnoses, but management action F1 at 0.17 is not clinical reasoning.

sharp

ClinEnv’s punch is that it separates answer quality from clinical control. Across seven models, the best reaches only 0.31 decision F1; discharge diagnoses hit 0.51 F1, while management actions collapse to 0.17. That hits the weak spot in a year of medical LLM demos: retrospective chart summarization looks competent, sequential care does not. The setup is harsher than MedQA-style exams because each stage forces the model to query four specialist agents before choosing medications, procedures, and diagnoses. I would not overread it as a hospital simulator: automatic case construction, ontology matching, and the quality of the four query agents all shape the score. Still, it gives practitioners a cleaner failure mode than vibes. A medical agent that can name the disease but cannot manage the patient is a liability with a benchmark number now attached.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:52

7d ago

arXiv · cs.CL· atomEN17:52 · 06·01

→From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

SubFit compresses LLMs at the Attention and FeedForward submodule level using non-contiguous selection and fitted residual bypasses; across 10 LLMs, five sparsity levels from 12.5% to 37.5%, and four replacement baselines, it retains 84.6% dense downstream accuracy at 25% sparsity versus 81.6% for the strongest baseline.

#Inference-opt#Benchmarking#SubFit#Research release

why featured

HKR-K is solid and HKR-R is moderate: SubFit shifts replacement to Attention and FeedForward submodules, with 10-model tests and 84.6% accuracy retention. The angle is niche compression research, so HKR-H misses and it stays below featured.

editor take

SubFit keeps 84.6% accuracy at 25% sparsity across 10 LLMs; layer-level compression looks lazy after this.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:51

7d ago

arXiv · cs.CL· atomEN17:51 · 06·01

→HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

HERO'S JOURNEY introduces 8 goal-directed text-game tasks where LLM agents infer hidden rules from demonstrations and execute them across multiple steps, with results showing limited, uneven rule induction and no reliable procedural-task gains from induction-specific steering methods.

#Agent#Reasoning#Benchmarking#HERO'S JOURNEY

why featured

HKR-H and HKR-K pass: 8 text-game tasks make rule induction and multi-step execution testable. No model scores, release details, or deployment stake are disclosed, keeping it in the normal research-benchmark band.

editor take

HERO'S JOURNEY tests 8 text games; LLMs still choke on procedural induction, and steering prompts don't fix it.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:50

7d ago

arXiv · cs.AI· atomEN17:50 · 06·01

→Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation

MDA predicts multiple depth hypotheses and probabilities per pixel, then decodes depth from one hypothesis at object boundaries, reducing flying-point artifacts caused by single-depth training targets that place predictions between foreground and background surfaces.

#Vision#MDA#Research release

why featured

HKR-K passes for a concrete mechanism, but the item has only an arXiv title/brief summary with no metrics, code, or deployment angle. Depth-estimation research is narrow for this audience.

editor take

MDA predicts per-pixel depth mixtures; flying points get treated as target ambiguity, not cleanup noise.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:49

7d ago

arXiv · cs.CL· atomEN17:49 · 06·01

→SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

The paper proposes SN-WER, a training-free ASR evaluation metric that transliterates references and hypotheses into a language-specific canonical script before WER, then evaluates it on 5 Indic languages, 2 datasets, and 3 ASR models.

#Audio#Benchmarking#arXiv#Research release

why featured

HKR-K passes because SN-WER gives a concrete metric mechanism and test setup. HKR-H and HKR-R are weak: multi-script Indic ASR evaluation is narrow, so it stays in the 40–59 research-signal band.

editor take

SN-WER cuts inflated gaps by 12% across 5 Indic languages; I buy the metric, but Common Voice still exposes weak ASR.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:46

7d ago

arXiv · cs.CL· atomEN17:46 · 06·01

→SimSD: Simple Speculative Decoding in Diffusion Language Models

SimSD adds valid token-level contexts to diffusion language models through a plug-and-play masking strategy, and experiments on SDAR-family dLLMs across four benchmarks report up to 7.46x higher decoding throughput while maintaining or improving average generation quality.

#Inference-opt#SimSD#SDAR#Research release

why featured

HKR-H/K/R pass via the 7.46x throughput hook, concrete masking mechanism, and inference-cost angle. The niche diffusion-LM scope keeps it below featured.

editor take

SimSD reports up to 7.46x throughput on four SDAR benchmarks; training-free is nice, but one model family is thin evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:45

7d ago

FEATUREDarXiv · cs.CL· atomEN17:45 · 06·01

→SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

SkillHarm introduces 879 attack samples across 71 agent skills, testing Fixed-Payload Poisoning and Self-Mutating Poisoning with attack success rates up to 86.3% for FPP and 69.3% for SMP.

#Agent#Safety#Benchmarking#SkillHarm

why featured

All HKR axes pass: automated attacks create HKR-H, the sample counts and success rates satisfy HKR-K, and agent skill safety gives HKR-R. Unknown author pull and limited method detail keep it at 78, featured not p1.

editor take

SkillHarm frames agent skills as supply-chain risk: 86.3% FPP success says default trust in tool/plugin ecosystems is still reckless.

sharp

SkillHarm’s sharp point is persistence, not sample count. It moves skill attacks out of one-off prompt injection and into the agent supply chain. The benchmark has 879 attacks across 71 skills and 12 risk types, with success rates up to 86.3% for Fixed-Payload Poisoning and 69.3% for Self-Mutating Poisoning. SMP is the nastier setup: a benign first run mutates persistent skill content, then the harm lands on reuse. The uncomfortable line is that many “failed” attacks failed because the agent never touched the poisoned file, not because defenses held. That makes current mitigation scores look inflated. Compared with web prompt injection, this smells closer to npm or VS Code extension risk, except the caller is an agent that treats third-party skills as instructions.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:40

7d ago

arXiv · cs.AI· atomEN17:40 · 06·01

→Research Proposes Text Embedding Direction Method for Measuring Adaptive Agent Behavior Traits

The authors define agent traits as directions in text-embedding space and score skill-file edits by projection; on 68 labeled skill-diff pairs for propensity to seek sensitive data, the method reaches 91.2% sign classification accuracy and Spearman ρ=0.82 under leave-one-out cross-validation.

#Agent#Embedding#Safety#Research release

why featured

HKR-K/R pass with a concrete mechanism and metrics tied to agent-safety evaluation. HKR-H is weak, and this is a single arXiv paper with no disclosed tool, code, or production path, so it stays in the interesting-not-featured band.

editor take

Embedding-direction trait tracking hits 91.2% on 68 diffs; tiny sample, but skill files as auditable behavior surfaces is right.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:38

7d ago

FEATUREDarXiv · cs.CL· atomEN17:38 · 06·01

→SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer restricts the reverse KL penalty to selected safety tokens during on-policy distillation and uses 100 harmful samples, reporting stronger safety-capability trade-offs across seven safety benchmarks and five general capability benchmarks without general-purpose data.

#Alignment#Safety#Fine-tuning#SafeSteer

why featured

HKR-H/K/R all pass, but this is a single arXiv alignment-method paper with no major-lab release, artifact, or cross-source cluster; the 100-sample safety claim lifts it into lower featured.

editor take

SafeSteer treats safety alignment like a local patch, not a new RLHF diet; the 100-sample claim is sharp, but I’d stress-test it hard.

sharp

SafeSteer’s sharp claim is that the alignment tax is often bad localization, not an unavoidable capability trade-off. It applies reverse KL only on selected safety tokens, builds the teacher with activation steering, and trains from 100 harmful samples. The paper reports gains across seven safety benchmarks and five general capability benchmarks, with no general-purpose data. That is a cleaner mechanism than yet another weighted objective over all tokens. I would pressure-test two points first: whether the safety-token selector transfers across model families, and whether the seven safety benchmarks include multi-turn jailbreaks or tool-use failure modes. PPO, DPO, and Constitutional AI are expensive, but they have taken more distribution-shift abuse. If SafeSteer wins mainly on static refusal tests, the 100-sample number is a paper-friendly artifact, not an alignment recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:32

7d ago

arXiv · cs.CL· atomEN17:32 · 06·01

→FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

FigSIM introduces a public dataset of 1,049 suicide memes annotated for severity levels, figurative phenomena, and suicide-related content, and benchmarks 16 unimodal and multimodal models across figurative language, severity, and content detection tasks.

#Multimodal#Vision#Benchmarking#FigSIM

why featured

HKR-H/K/R all pass, but this is a niche safety benchmark, not a model or product release. The 1,049-sample dataset and 16-model test add signal, while audience reach stays limited.

editor take

FigSIM ships 1,049 annotated suicide memes; 16 models underpredict severe figurative cases, exactly where moderation breaks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:01

7d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:01 · 06·01

→Monitoring Agentic Systems Before They're Reliable

The paper proposes an agentic-system monitoring method across three evaluation dimensions and three scopes, and tests it on 220 synthetic runs over 120 document bundles; deterministic triage routes 97% of findings to automated tracking and leaves the 2% reflecting variable behavior for human investigation.

#Agent#Benchmarking#Safety#Research release

why featured

HKR-H/K/R all pass, but this is still a synthetic-test paper with no production deployment or cross-source traction shown. It clears featured, not the 78+ research-discussion band.

editor take

This pulls agent monitoring back from bad answers to structural defects; 220 synthetic runs is thin, but the framing is more honest than task-score theater.

sharp

Agent monitoring should catch structural defects before it pretends to diagnose wrong answers. The paper’s hook is concrete: 220 synthetic runs across 120 document bundles, with within-run monitors finding deterministic stage defects at CV=0.02, cross-run monitors surfacing stochastic integration effects at CV=1.25, and a structural monitor finding an integration gap at CV=0.00. The sharp result is that injected task-level errors were indistinguishable from clean baselines, which matches what many production agent teams see: pipeline noise drowns the task signal. The 97% automated-tracking / 2% human-investigation split is tidy, maybe too tidy, because the testbed is synthetic. Real regulated workflows add ugly document variance, permissions, retries, and human edits. Still, this is a better eval posture than celebrating an agent score while the assembly underneath is leaking.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:37

7d ago

HuggingFace Papers (takara mirror)· rssEN16:37 · 06·01

→Learning When to Translate for Multilingual Reasoning

Luar trains reasoning language models to choose between direct reasoning on the original input and reasoning over an English translation, outperforming GRPO and other training baselines on multilingual reasoning benchmarks, while the post does not disclose exact scores.

#Reasoning#Alignment#Luar#GRPO

why featured

HKR-H and HKR-K pass: the routing mechanism is concrete and the GRPO benchmark claim is testable. Specific scores, model scale, and release details are not disclosed, so this stays interesting but not featured.

editor take

Luar makes RLMs translate on demand; no scores disclosed, so I buy the low-resource trigger idea, not the GRPO win claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

16:30

7d ago

HuggingFace Papers (takara mirror)· rssEN16:30 · 06·01

→Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

The paper proposes a dynamic cognitive map and Spatial Assertion Codes for agentic VLM spatial reasoning, reaching 80.5% overall accuracy on MindCube and outperforming the prior best method by 29.5 accuracy points on the Rotation subset.

#Agent#Vision#Reasoning#Research release

why featured

HKR-H/K/R all pass, but this is a single research item with impact limited to MindCube and the Rotation subset; no broad replication or product path is disclosed, so it stays in the high 60–71 band.

editor take

The paper hits 80.5% on MindCube. SAC’s dense checks matter; the pigeon framing is just garnish.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

15:28

7d ago

HuggingFace Papers (takara mirror)· rssEN15:28 · 06·01

→Honey, I Shrunk the Arc de Triomphe!

The authors introduce MetricScenes, a metrically grounded in-the-wild dataset using Internet photo collections, stereo imagery, geotagged metadata, and stereo baselines to recover absolute scale, then fine-tune MoGe-2 to reduce scale collapse in distant landmarks and open-domain scenes; the post does not disclose dataset size or benchmark numbers.

#Vision#Fine-tuning#Benchmarking#MetricScenes

why featured

HKR-H and HKR-K pass: the title gives a vivid failure case, and the post names MetricScenes plus the MoGe-2 fine-tuning path. Sample size is not disclosed, and HKR-R is narrow to CV researchers.

editor take

MetricScenes adds geotags and stereo baselines for absolute scale; size and metrics are undisclosed. The data-bottleneck blame sounds right.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

15:00

7d ago

HuggingFace Papers (takara mirror)· rssEN15:00 · 06·01

→TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

TROPHIES jointly estimates dynamic humans, static scenes, and camera poses from multi-view videos in one global coordinate frame, using scale consistency, contact priors, and cross-view temporal coherence for global alignment and reporting stronger global fidelity and human-scene consistency on EgoHuman and EgoExo4D.

#Vision#Multimodal#Reasoning#TROPHIES

why featured

HKR-K passes because the post gives a concrete joint reconstruction mechanism and EgoHuman/EgoExo4D setting. HKR-H and HKR-R are weak, and the 3D vision paper is niche for this feed, so it stays in the lower all tier.

editor take

TROPHIES tests 4D joint reconstruction on EgoHuman and EgoExo4D; metrics are undisclosed, so treat “physically plausible” as unproven.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:23

7d ago

HuggingFace Papers (takara mirror)· rssEN14:23 · 06·01

→Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization

The paper proposes PHF, a three-level user modeling framework with practices, habitus, and fields, and evaluates a frozen-LLM PHF-Compass implementation on the LaMP benchmark for LLM personalization tasks.

#Memory#Interpretability#Benchmarking#Pierre Bourdieu

why featured

HKR-H/K/R pass at modest strength: PHF gives a testable three-layer personalization mechanism on LaMP. The post discloses no gain size, code, or production validation, so it stays below featured.

editor take

PHF tests a 3-layer user model on LaMP, but gains are undisclosed; nice sociology wrapper, prove it beats long-context memory.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

11:49

7d ago

HuggingFace Papers (takara mirror)· rssEN11:49 · 06·01

→ProbRes: Volatility Learning for Probabilistic Time-Series Forecasting

ProbRes models conditional mean and conditional volatility with two architecture-agnostic modules, then generates predictive distributions at inference by resampling normalized residuals for univariate and multivariate heteroskedastic time series.

#Benchmarking#ProbRes#Research release

why featured

HKR-K passes via a concrete forecasting mechanism for heteroscedastic series. HKR-H/R are weak, and the post does not disclose benchmark gains, code, or production evidence, so it stays in the low research-signal band.

editor take

ProbRes uses two modules for mean and volatility; I like the calibration angle, but baselines and datasets are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

10:30

7d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN10:30 · 06·01

→SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

SentGuard moderates streamed LLM output at sentence boundaries in parallel with generation, using a waiting buffer and StreamSafe annotations across 8 harm categories; on 5 safety benchmarks, it detects 90.5% of unsafe cases within two sentences while keeping the streaming false-positive rate at 7.41%.

#Safety#Benchmarking#SentGuard#StreamSafe

why featured

HKR-H/K/R all pass, but the source only gives abstract-level facts and no authors, code, or production validation. This fits a useful safety research item, not a must-write release.

editor take

SentGuard’s sentence boundary bet is pragmatic: 90.5% unsafe detection within two sentences, with 7.41% streaming false positives.

sharp

SentGuard’s useful move is timing, not a magic safety classifier. Token-level moderation fires on half-formed semantics; full-response moderation lets bad content reach the user first. Sentence-level buffering is a sane middle layer: hold a chunk, verify it, and let the target model keep decoding behind the offset. The concrete numbers are strong enough to test: 5 safety benchmarks, StreamSafe annotations across 8 harm categories, 90.5% unsafe-case detection within two sentences, and a 7.41% streaming false-positive rate. I would not treat it as production-ready from the snippet. Latency cost, throughput hit, multilingual sentence boundaries, and jailbreaks that hide intent across long setup are not disclosed. For chat products, this is closer to deployable guardrail plumbing than another offline safety leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:20

7d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN10:20 · 06·01

→OpenWebRL Framework for Online Multi-turn Reinforcement Learning in Visual Web Agents

The OpenWebRL team introduced an online multi-turn RL framework for visual web agents and trained OpenWebRL-4B with 0.4K initialization trajectories and 2.2K open-ended RL tasks, reaching 67.0% success on Online-Mind2Web and 64.0% on DeepShop while staying competitive with OpenAI CUA and Gemini CUA.

#Agent#Reasoning#Multimodal#OpenWebRL

why featured

HKR-H/K/R all pass: OpenWebRL-4B applies online multi-turn RL to visual web agents with training scale and two benchmark success rates. It fits the 78–84 open-source agent research band, below major lab releases.

editor take

OpenWebRL-4B hitting 60%+ live-web success from 2.6K-scale data is a cleaner agent-training signal than another pile of curated traces.

sharp

OpenWebRL’s sharp point is moving web agents away from “watch curated browser traces” and back into live trial-and-error. OpenWebRL-4B uses only 0.4K initialization trajectories and 2.2K RL tasks, then reports 67.0% success on Online-Mind2Web and 64.0% on DeepShop. That data scale is tiny, so the gain likely comes from online multi-turn feedback, trajectory-level judging, and live-browser infrastructure, not another supervised-data grab. I’d discount the “competitive with OpenAI CUA and Gemini CUA” line until the paper shows same-environment numbers, retry rules, and site distributions. Web-agent benchmarks are notoriously sensitive to browser state and task templates. Still, for open agents, this smells like a better scaling path than the old WebArena-style static SFT loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:16

7d ago

HuggingFace Papers (takara mirror)· rssEN10:16 · 06·01

→World-Task Factorization Framework for Robot Learning

The paper proposes a world-task factorization framework for robot learning, pairs AICON with a compact learned policy, and reports tests on three robotics problems where it outperforms end-to-end baselines and analytical heuristics, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

#Robotics#Agent#Reasoning#AICON

why featured

HKR-K is clear: a named framework, 3 robotics problems, and zero-shot OOD results. HKR-R is limited to robotics-learning practitioners; no hard exclusion, but it lacks major-lab/product impact, so it stays in the 60–71 band.

editor take

AICON beats end-to-end baselines on 3 robot tasks; sample counts aren’t disclosed, but world/task factorization beats pure scaling here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

09:50

7d ago

HuggingFace Papers (takara mirror)· rssEN09:50 · 06·01

→CARTE: A Benchmark for Mapping Language Model Knowledge Across France

CARTE evaluates 27 LLMs from 1B to 12B parameters with 2,431 multiple-choice questions across France’s 13 metropolitan regions and 14 domains, including culture, language, demographics, economy, environment, and mobility.

#Reasoning#Benchmarking#CARTE#Research release

why featured

HKR-K is concrete and HKR-R matters for localization/eval teams, but this is a narrow benchmark paper without a major lab, broad artifact impact, or industry-level result, so it fits the 60–71 all band.

editor take

CARTE tests 27 small LLMs on 2,431 France questions; useful regional probe, but few-shot MCQ stays far from real retrieval.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

09:46

7d ago

HuggingFace Papers (takara mirror)· rssEN09:46 · 06·01

→MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow applies flow-matching reinforcement learning to multi-turn image editing, combining multi-reward signals with GRPO and NFT-based methods; on FLUX.1-Kontext-dev, it raises turn-3 overall performance by 6.85 points and surpasses open-source models such as Qwen-Image-Edit, while the post does not disclose dataset size or training cost.

#Vision#Multimodal#Fine-tuning#FLUX.1-Kontext-dev

why featured

HKR-H and HKR-K pass: the paper has a clear multi-turn editing mechanism and a +6.85-point result. HKR-R is weak, and this is a normal research update, so it stays in the 60–71 band.

editor take

MT-EditFlow lifts FLUX.1-Kontext-dev turn-3 by 6.85 points; dataset size and training cost are undisclosed, so reproducibility is still thin.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

09:14

7d ago

HuggingFace Papers (takara mirror)· rssEN09:14 · 06·01

→WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM shifts video-action learning to event-grounded VLA pretraining with event captions, cluster-balanced sampling, and two inference modes; the post says it reaches state-of-the-art performance in large-scale real-world generalization evaluation, but does not disclose scores or benchmark names.

#Robotics#Vision#Multimodal#WALL-WM

why featured

HKR-K passes on concrete mechanisms, but the post does not disclose real-generalization scores and stays within robotics/VLA research. HKR-H and HKR-R miss, so this lands as useful but narrow signal.

editor take

WALL-WM uses event-level VLA pretraining, but scores and benchmarks are undisclosed; I don’t buy the SOTA claim without open evals.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

09:08

7d ago

HuggingFace Papers (takara mirror)· rssEN09:08 · 06·01

→Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

The paper proposes LoRSP, which combines low-rank prompt factorization with an SNN integrate-and-fire mechanism to generate instance-specific sparse visual prompts. Experiments cover five heterogeneous vision backbones and multiple benchmarks, while the snippet does not disclose exact accuracy, parameter counts, datasets, or energy metrics.

#Vision#Fine-tuning#Inference-opt#Research release

why featured

HKR-K passes via a concrete mechanism and 5-backbone evaluation. HKR-H/R are weak: the angle is narrow and the body does not disclose gain numbers, code, or deployment context, so this stays in low-value research territory.

editor take

LoRSP tests 5 vision backbones, but accuracy and energy numbers are undisclosed; I want the parameter table before buying SNN prompting.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

07:42

8d ago

HuggingFace Papers (takara mirror)· rssEN07:42 · 06·01

→Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

DySCo selects a small set of communication edges in each reasoning round using agent reliability, answer divergence, and task relevance under budget constraints. The paper evaluates the mechanism on mathematical reasoning, logical reasoning, and factual question answering, but the RSS snippet does not disclose concrete token-cost, latency, or accuracy numbers.

#Agent#Reasoning#DySCo#Research release

why featured

HKR-K/R pass: DySCo adds trust-, disagreement-, and relevance-based sparse communication for LLM agents. No cost-reduction ratio or standout benchmark result is disclosed, so it stays in the 60–71 research-signal band.

editor take

DySCo picks edges by reliability, divergence, and relevance; no cost numbers disclosed, so sparse communication has not won yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

07:34

8d ago

HuggingFace Papers (takara mirror)· rssEN07:34 · 06·01

→TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech

TalkTag uses a fine-tuned LLM to automate CHAT-style morphosyntactic error annotation in spoken-language transcripts, developed with children’s narrative data under extreme data scarcity; the post says evaluation found precise annotations and ambiguity detection, but does not disclose dataset size, metrics, or model details.

#Fine-tuning#TalkTag#Research release

why featured

HKR-K passes on the concrete mechanism, but data size, accuracy, and reproducible setup are not disclosed. The computational-linguistics annotation niche has limited AI-practitioner resonance, so it sits in the low-value research band.

editor take

TalkTag targets CHAT speech errors, but gives no scale or metrics; clinical low-resource annotation needs error-cost reporting first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:51

8d ago

HuggingFace Papers (takara mirror)· rssEN04:51 · 06·01

→HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

The paper introduces HAIM, a dataset for tracking AI intervention across music production stages, with labels for hybrid production and agent-level tracking; the post does not disclose dataset size or detector scores.

#Audio#Benchmarking#Agent#HAIM

why featured

HKR-H/K/R pass through the provenance hook, multi-stage labels, and creator-rights nerve. Importance stays in 60–71: the post gives no sample size, results, release status, or adoption signal.

editor take

HAIM discloses staged labels, not dataset size or detector scores; AI music detection needs to drop binary purity tests.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:43

8d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN04:43 · 06·01

→Off-the-Shelf LLMs as Process Scorers for Mathematical Reasoning

Chunk-Level Guided Generation uses an off-the-shelf larger model to score fixed-length chunks without reward-model training; on GSM8K, MATH, Minerva Math, AMC23, and AIME24, CGS beats majority voting by up to 28 percentage points and matches or exceeds Qwen2.5-Math-PRM-72B guided search on most matched-budget benchmarks.

#Reasoning#Inference-opt#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: the hook is a training-free PRM substitute, with CGS chunk scoring and a reported +28-point gain. Single paper, no major-lab or cross-source signal, so it sits in 78–84.

editor take

CGS is a very practical inference-time hack: it skips PRM training, yet attacks the same early-error problem PRMs were built for.

sharp

CGS’s sharp edge is not the 28-point gain; it is the removal of PRM training. Qwen2.5-1.5B guided by Qwen2.5-32B, and Llama-3.2-1B guided by Llama-3.1-70B, beat majority voting across GSM8K, MATH, Minerva Math, AMC23, and AIME24. Under matched guidance budgets, it also matches or beats Qwen2.5-Math-PRM-72B guided search on most benchmarks. I buy the fixed-length chunk choice more than the headline number. The paper says variable-length reasoning-step scoring has a systematic length bias that survives normalization; that is exactly the quiet failure mode in many verifier and PRM setups. But this is not a free lunch: k=16 with a 72B scorer moves cost from training into inference latency and serving spend.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:27

8d ago

HuggingFace Papers (takara mirror)· rssEN04:27 · 06·01

→Time-Aware Diffusion Based on Preference Disentanglement for Generative Recommendation

TDPM disentangles user preference into long-span period preference and recent event-triggered point preference, then injects time-aware diffusion into SID tokens; on three public real-world datasets, it improves over state-of-the-art baselines by up to 29.21% in HR@20 and 25.45% in NDCG@20.

#Embedding#Benchmarking#TDPM#Research release

why featured

HKR-K passes: TDPM splits long-term period preference from recent point preference and reports three-dataset gains. HKR-H/R fail because this is a narrow recommender paper with no product release, code, or broader practitioner conflict.

editor take

TDPM claims +29.21% HR@20 on 3 datasets; I’d audit splits and negative sampling first, recommender gains inflate fast.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:19

8d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN04:19 · 06·01

→ATLAS: Agentic Test-Time Learning-to-Allocate Scaling

ATLAS lets a Claude Sonnet 4.6 orchestrator control test-time compute with one explore action, reaching 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision across scientific QA, code generation, and multimodal reasoning benchmarks.

#Agent#Reasoning#Benchmarking#Claude

why featured

HKR-H/K/R all pass: the paper frames test-time compute allocation as an agent problem and gives concrete benchmark scores. It stays below P1 because this is a single research release, not a major model or product launch.

editor take

ATLAS is a real test-time-compute result, not agent perfume; without cost curves, 56% HLE is still a benchmark claim, not a product recipe.

sharp

ATLAS pushes test-time scaling in the right direction: let the model allocate budget, not a hand-coded loop. The concrete hook is clean. Claude Sonnet 4.6 gets one `explore` action to spawn fresh solvers, stop, and synthesize; it reports 56.00% on HLE-Verified, 82.29% on LiveCodeBench, and 85.75% on GPQA-Diamond. ATLAS-MM lifts HLE to 60.00% and LiveCodeBench to 85.63% by adding solver choice. I buy the mechanism before I buy the economics. “Far fewer API calls” is doing too much work here without call counts, token totals, latency, or price per solved item. Compared with fixed self-consistency or refinement loops, this smells like a scheduler for inference-time search. That is useful. The missing number is how much budget the orchestrator burns while deciding how to save budget.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

● P1arXiv · cs.LG· atomEN04:00 · 06·01

→No More K-means: Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

The paper proposes Single-stage Sparse Retrieval, using a Sparse Autoencoder to project token embeddings into high-dimensional sparse representations; on BEIR, SSR reports 15x faster indexing than ColBERTv2, half the retrieval latency, and higher retrieval performance than leading baselines.

#RAG#Embedding#Inference-opt#ColBERT

why featured

HKR-H has a clear anti-K-means hook; HKR-K has the SAE mechanism plus BEIR numbers; HKR-R hits RAG infra cost. It stays below 78 since this is one arXiv paper with no code, author context, or production use disclosed.

editor take

Three sources trace to one arXiv paper; SSR dodges K-means with SAE, and 15x indexing is tempting, but BEIR is not production proof.

sharp

Three sources use the same title and point back to arXiv 2605.30120; this is a single paper chain, not independent confirmation. SSR makes a clean bet: the pain in multi-vector retrieval is less MaxSim itself, more the K-means tax ColBERTv2 pays to survive storage and indexing. The hook is concrete: SAE projects token embeddings into high-dimensional sparse codes, skips clustering, uses inverted indexes, claims 15x lower indexing time than ColBERTv2, half the retrieval latency, and better BEIR results. I buy the problem framing before I buy the “paradigm” language. CRISP tried to make vectors more clusterable during training; SSR walks around clustering entirely. The deciding cost is billion-scale corpus updates and inverted-list blowup, and the abstract does not show that bill.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→SERA: Soft-Verified Efficient Repository Agents

Ai2 presents SERA, a training method that uses Soft Verified Generation to produce trajectories from any code repository without unit tests, then trains with 200,000+ synthetic trajectories via SFT. SERA leads fully open-source coding-agent models and reaches equivalent performance at 26x lower cost than reinforcement learning and 57x lower cost than prior synthetic-data methods.

#Agent#Code#Fine-tuning#Ai2

why featured

HKR-H/K/R all pass: SERA ties repo-agent training to SVG, 200k+ synthetic traces, and 26x lower cost than RL. It is still an arXiv research release, not a major model or product launch.

editor take

SERA attacks the cost center of coding agents: trajectory generation. If the 26x cost gap holds, RL-heavy recipes look bloated.

sharp

SERA’s sharp move is treating coding-agent training as a data-generation problem, not an RL pipeline. Ai2 uses Soft Verified Generation to create trajectories from arbitrary repositories without unit tests, then runs SFT on 200,000+ synthetic trajectories. The paper claims equal performance at 26x lower cost than reinforcement learning and 57x lower cost than prior synthetic-data methods. That lands right on Claude Code’s enterprise weakness: closed agents can operate tools, but private-repo knowledge rarely enters the weights. SERA’s bet is that open-weight agents can bake repository state into the model itself. I buy the direction, but the 26x number needs pressure-testing. If SVG quality is doing most of the work, messy repos, brittle builds, and low-test codebases will eat into that advantage fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

BlueFin evaluates LLM agents on 131 professional finance spreadsheet tasks with 3,225 granular rubric criteria, and the strongest frontier models score below 50% on average, with particular weakness in dynamic correctness.

#Agent#Benchmarking#Tools#BlueFin

why featured

HKR-H/K/R all pass: the paper turns finance spreadsheets into a concrete agent benchmark with 131 tasks and 3,225 criteria, and reports sub-50% frontier scores. As a single benchmark paper, it fits the 78–84 band.

editor take

BlueFin punctures the Excel-agent story: on 131 finance spreadsheet tasks, frontier models average under 50%, and dynamic correctness is the pain point.

sharp

BlueFin hits the awkward gap in agent demos: models can edit a workbook, but they fail when the workbook has to stay correct after inputs move. The benchmark uses 131 professional finance spreadsheet tasks and 3,225 rubric criteria; the strongest frontier models average below 50%, with dynamic correctness called out as the main failure mode. That stings more than another coding benchmark. Spreadsheet software has hundreds of millions of paying users, an order of magnitude more than professional developers, yet most agent eval energy still clusters around code. Honestly, a Copilot demo that generates a model is cheap; BlueFin asks whether that model survives next week’s assumptions.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Float8@2bits: Entropy Coding Enables Data-Free Model Compression

EntQuant decouples numerical precision from storage cost via entropy coding, compresses a 70B-parameter model in under 10 minutes, and claims state-of-the-art results on standard evaluations while retaining functional performance on complex benchmarks with instruction-tuned models.

#Inference-opt#Benchmarking#EntQuant#Research release

why featured

HKR-H/K/R all pass: the title has a real hook, the post gives a testable entropy-coding mechanism and a 70B-under-10-min claim, and it hits deployment cost. Single-source arXiv research keeps it below same-day must-write model news.

editor take

EntQuant pokes the old “2-bit collapses” rule; the claim hangs on hardware, decode path, and long-context tests not shown in the snippet.

sharp

EntQuant’s sharp move is not “2-bit weights”; it is moving extreme compression back from calibration-data engineering to one offline coding step. The snippet gives two hard hooks: a 70B-parameter model compressed in under 10 minutes, and entropy coding that separates numeric precision from storage cost. It also calls out NF4 and HQQ for collapsing below 4 bits. I only half-buy the “data-free matches data-dependent” claim from this abstract. It does not give the GPU, actual bits per weight, throughput hit, KV-cache treatment, or benchmark scores. GPTQ and AWQ are annoying because calibration data matters, but operators know how to price their latency. If EntQuant’s decode overhead is not tiny, the saved memory comes back as token latency.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

COLLEAGUE.SKILL distills materials from a target person or role into a versioned skill package with capability and bounded-behavior tracks; the public repository has about 18.5k GitHub stars, and its gallery lists 215 skills from 165 contributors.

#Agent#Memory#Tools#COLLEAGUE.SKILL

why featured

HKR-H/K/R all pass: the paper offers a concrete skill-generation mechanism plus repo traction and gallery size. Strong agent-tooling signal, but not a frontier-model or major-lab release.

editor take

COLLEAGUE.SKILL turns “work like this person” into an installable package; that is a cleaner agent primitive than another persona prompt.

sharp

COLLEAGUE.SKILL’s sharp move is packaging human expertise as versioned agent assets, not claiming another expert-distillation trick. The paper splits each package into a capability track for practices, mental models, and decision heuristics, plus a bounded-behavior track for style, interaction rules, and correction history. The repo has about 18.5k GitHub stars, with 215 gallery skills from 165 contributors, so this is not just an arXiv toy. It smells like the next step after Claude Skills and custom GPTs: less “upload docs and imitate on the fly,” more “inspect, update, roll back, and install across agent hosts.” I still don’t buy the implied capability jump yet. The article gives packaging mechanics, but no strong benchmark proving these skills beat RAG memory on long-running work. The artifact contract is ahead of the evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Measuring, Localizing, and Ablating Alignment Signatures in LLMs

The paper introduces PASTA, a training-free method that estimates an alignment signature from aligned-base residual contrasts and ablates that direction during decoding, lowering AI-detection rates for most of 11 aligned models across 6 detectors while random directions do not reproduce the effect.

#Alignment#Interpretability#Safety#arXiv

why featured

All HKR axes pass: PASTA gives a testable mechanism plus results on 11 models and 6 detectors, with clear safety and detector-evasion stakes. As a single arXiv paper, it lands in the 78–84 band.

editor take

PASTA turns “AI-ish prose” into a residual direction; detectors and alignment polish may be fighting on the same wire.

sharp

PASTA is sharp because it makes alignment residue an ablation target, not a vibe complaint about “AI writing.” The method estimates a post-training alignment signature from aligned-base residual contrasts, then removes that direction during decoding. Across 11 aligned models and 6 AI detectors, detection rates fall for most aligned models, and random directions do not reproduce the effect. The uncomfortable read is about attribution. If detectors are mostly catching post-training style residue, high detector scores are not evidence that a text is reliably machine-written. OpenAI and Anthropic have spent the last year making models steadier, more polite, and more formatted; PASTA suggests that polish may sit in a localizable representation. The nasty part is practical: the paper claims the style can be suppressed without retraining, just activation surgery at decode time.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

NumLeak measures frontier LLM recall of Fama-French market excess returns at 3-seed pooled Pearson r=0.97-0.99, while a recent-release holdout drops parse rate to 21-57% but keeps r≈0.99 on answered months.

#Benchmarking#Safety#Interpretability#Anany Kotawala

why featured

HKR-H/K/R all pass: the paper has a clear benchmark-leakage hook and concrete results, including r=0.97-0.99 and 21-57% held-out parsing rates. Single arXiv paper, so it stays below same-day major-release tier.

editor take

NumLeak drags a dirty quant backtest problem into LLM evals: if Fama-French recall hits r=0.99, treat the “skill” as leakage first.

sharp

NumLeak lands because it breaks the lazy assumption that date-conditioned questions are out-of-sample. Frontier LLMs recall Fama-French market excess return with 3-seed pooled Pearson r=0.97–0.99. On a recent-release holdout, parse rate drops to 21–57%, yet answered months still sit around r≈0.99. That pattern smells like selective recall, not financial reasoning. The Sonnet result is the killer detail: a date-to-market-sentiment regression correlates with true Mkt-RF at r=0.74, then collapses to r=0.02 after residualizing the model’s own recall. A lot of “LLMs predict macro/markets” work now needs a NumLeak-style contamination probe before anyone treats the curve as signal.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

The paper introduces Causal Sensitivity Score and evaluates six frontier models on 224 oncology tumor-board cases, where five counterfactual intervention types make CMS and CSS rankings nearly opposite, and every model scores at most 17.2% on surgery-status interventions.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass: CSS, 224 cases, and 6 models make it concrete, while near-opposite rankings create a strong eval-trust hook. Still a single arXiv clinical benchmark, so it stays below must-write.

editor take

Clinical LLM evals still reward coverage; this paper shows the scarier failure: the model can cite the case and ignore the changed patient.

sharp

The sharp read: a clinical LLM can look competent by matching consensus notes while failing to react when the patient changes. The paper tests 224 oncology tumor-board cases, mutates them across five pre-registered clinical dimensions, and scores directional recommendation changes with CSS. Six frontier models then rank almost opposite to CMS, the coverage-style metric; the CMS-worst model becomes CSS-best. The surgery-status result is the red flag. Every model tops out at 17.2% CSS on that intervention family. Tool use does not cleanly rescue this: in a ReAct setup, five of six models gain 2.5 to 20.3 points, yet the lowest-CSS model retrieves the same chart sections and still fails to update recommendations. That smells less like retrieval failure and more like a policy-update failure inside the model.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Esoteric Language Models: A Family of Any-Order Diffusion LLMs

Eso-LMs fuse autoregressive and masked diffusion paradigms with causal attention, compute exact MDM likelihood for the first time, and add KV caching while preserving parallel generation. The paper says an optimized sampling schedule sets a new speed-quality Pareto frontier for unconditional generation, and the project page provides code, checkpoints, and a video tutorial.

#Inference-opt#Reasoning#Research release#Open source

why featured

HKR-H/K/R all pass: the paper offers testable mechanisms and artifacts around the speed-quality tradeoff. Kept at 80 because it is still an arXiv preprint without major-lab validation or cross-source coverage.

editor take

Eso-LMs is a sharp reminder: diffusion LMs only get serious when they borrow AR’s boring machinery—causal attention, exact likelihood, KV cache.

sharp

Eso-LMs lands because it fixes the plumbing diffusion LMs kept hand-waving away. The paper says causal attention links MDMs with any-order AR, enabling exact MDM likelihood for the first time and KV caching while preserving parallel generation. That is a stronger claim than the speed-quality Pareto SOTA, which is stated for unconditional generation. I don’t buy the easy “diffusion replaces AR” framing. The last year showed the pain points clearly: perplexity gaps, no clean cache story, and awkward serving paths. Eso-LMs moves toward AR machinery instead of escaping it, which says a lot about why GPT, Claude, and Qwen-style inference stacks have stayed dominant. Code and checkpoints help, but chat, long-context, and tool-use behavior are still the missing tests.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

The paper presents agent JIT compilation for web agents, compiling task descriptions into executable code with LLM calls, tool calls, and parallelization; across five applications, JIT-Planner reports 10.4× speedup and 28% higher accuracy over Browser-Use, while JIT-Scheduler reports 2.4× speedup and 9% higher accuracy over OpenAI CUA.

#Agent#Tools#Inference-opt#Browser-Use

why featured

HKR-H/K/R all pass: JIT-Planner compiles web-agent planning into code and reports 10.4x latency gains plus 28% higher accuracy. As a single arXiv paper needing replication, it sits in the 78–84 band.

editor take

Agent JIT hits the sore spot: web agents are slow because every tiny browser move asks the LLM for permission.

sharp

Agent JIT moves the web-agent bottleneck from “model intelligence” to execution planning. JIT-Planner reports 10.4× speedup and 28% higher accuracy over Browser-Use across five applications; JIT-Scheduler reports 2.4× speedup and 9% higher accuracy over OpenAI CUA. If that reproduces, the sequential fetch-screenshot-execute loop is architectural debt, not tuning headroom. I buy the mechanism more than the headline number. The paper combines tool-spec validation, minimum-cost plan selection, Monte Carlo scheduling, and pre/postcondition tool protocols. That is exactly where browser-use-style agents bleed latency and tool errors. The caveat is scope: the abstract gives five applications, but not the task mix or page volatility. In real commerce flows, login walls, A/B UI changes, and anti-bot friction can break compiled plans fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Token-Efficient Change Detection in LLM APIs

The paper proposes B3IT, a strict black-box LLM API change-detection scheme that uses Border Inputs and observes only output tokens, matching leading grey-box approaches on tested non-reasoning endpoints while reducing cost by 30× versus existing methods.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass: B3IT frames hidden API-change detection as output-token-only monitoring, with a 30x cost reduction under strict black-box conditions. Single arXiv paper with no production case keeps it below must-write.

editor take

B3IT turns model-update tracking into cheap black-box audit work; a 30× token cut matters more than another leaderboard tweak.

sharp

B3IT hits the closed-API sore spot: vendors can change the model, while customers only see the text stream. The paper uses Border Inputs and output tokens only, then claims parity with leading grey-box methods on tested non-reasoning endpoints, at 30× lower cost than prior methods. I buy the monitoring angle, not the broader quality story. The constraints are narrow: low-temperature behavior, top-token borders, and non-reasoning endpoints. Once you add long reasoning traces, tool calls, or heavy system-prompt wrappers, output drift stops being a clean proxy for capability drift. This belongs in CI as an API canary, not as a verdict on model quality.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Chain-of-Thought Reasoning in the Wild Is Not Always Faithful

The paper finds that CoT can be unfaithful on naturally worded, non-adversarial prompts, with production models reaching rates up to 13%; DeepSeek R1 scores 0.37% and Sonnet 3.7 with thinking scores 0.04%, while the authors identify implicit post-hoc rationalization and illogical shortcuts as two observed failure modes.

#Reasoning#Safety#Interpretability#DeepSeek

why featured

HKR-H/K/R all pass: the title has a counterintuitive CoT hook, the summary gives 13%, 0.37%, and 0.04%, and the claim matters for CoT audits and reasoning-model evals. As an arXiv safety paper, it lands at 80, not P1.

editor take

CoT-as-audit needs a haircut: 13% unfaithfulness on natural prompts makes thinking traces look like polished testimony, not logs.

sharp

CoT is a weak audit layer, and this ICML 2026 paper moves the failure out of adversarial prompts. The test is almost embarrassingly simple: ask “Is X bigger than Y?” and separately ask “Is Y bigger than X?” Some production models hit 13% contradictory answers while producing fluent justifications. DeepSeek R1 lands at 0.37%, and Sonnet 3.7 with thinking at 0.04%, so frontier reasoning models are far cleaner, but not clean. I don’t buy the safety story that visible reasoning gives you a dependable monitor. The failure mode here is not jailbreak noise or hand-inserted bias; it is a Yes/No prior followed by post-hoc rationalization. If an agent system treats CoT as evidence, it is letting the suspect write the audit trail.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS evaluates long-horizon agentic data analysis with 68 Kaggle-derived tasks and 2,225 turns; the best of five evaluated models reaches 48.45% average accuracy, with performance dropping nearly 47 points from early to late turns.

#Agent#Memory#Benchmarking#Kewei Xu

why featured

All three HKR axes pass: the failure framing is clickable, the benchmark gives task/round/accuracy numbers, and long-horizon agent reliability is a practitioner nerve. As an arXiv benchmark rather than a major product release, it fits the 78–84 featured band.

editor take

LongDS hits the agent wound cleanly: more steps don’t fix a rotten state ledger, and 48.45% is already ugly.

sharp

LongDS makes the uncomfortable call: data-analysis agents fail on state, not on tool budget. The benchmark uses 68 Kaggle-derived tasks, 2,225 turns, and an average dependency span of 11.3 turns. The best evaluated model reaches only 48.45% accuracy, then drops nearly 47 points from early to late turns. Long-horizon errors explain 52% to 69% of failures. I don’t buy the common agent-demo story that extra reflection loops clean this up. The paper says more agent steps do not reliably improve results. That lands directly on notebook-style assistants and Devin-adjacent workflows: executing cells is cheap; preserving rollback points, counterfactual branches, and multi-state compositions is the hard part. Once the analytical ledger corrupts, the agent just produces better-formatted garbage.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Reducing the GPU Memory Bottleneck with Lossless Compression for ML -- Extended

The paper introduces IBP, a lossless compression algorithm that removes invariant bits and uses asynchronous PCIe transfers, reporting average speedups of 74% for GNN training, 180% for DLRM embedding lookup, and 24% for LLM inference.

#Inference-opt#Research release

why featured

HKR-H/K/R all pass: IBP lossless compression, async PCIe, and 74%/180%/24% speedups are concrete. This is still an arXiv systems paper, not a model or product launch, so it lands in the high-quality research band.

editor take

IBP attacks the ugly PCIe tax, not model math; 24% faster LLM inference is modest, but lossless and API-level makes it harder to dismiss.

sharp

IBP’s strongest claim is not compression ratio; it avoids an accuracy tax. The paper uses invariant-bit removal, warp-parallel decompression, and async PCIe transfers, reporting 74% faster GNN training, 180% faster DLRM embedding lookup, and 24% faster LLM inference. That spread is believable: embedding-heavy workloads bleed on movement, while generation paths already got squeezed by kernels, KV-cache work, and batching. I buy the lossless constraint more than another quantization trick. A lot of inference optimization in the last year came with quality checks, model-specific tuning, or serving-stack coupling. If IBP’s APIs are genuinely low-intrusion, it smells like infrastructure plumbing rather than a benchmark stunt. The caveat is large: the snippet gives no compression ratio, PCIe generation, batch shape, or LLM size. Do not carry the 24% number straight into an H100/NVLink production cluster.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for LLM Agents

HiPER splits an LLM agent policy into a high-level planner and a low-level executor, using hierarchical advantage estimation to assign credit across subgoals, and reports 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct.

#Agent#Reasoning#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: concrete mechanism, two agent benchmark numbers, and reliability resonance. It stays in the 78–84 band because this is a single arXiv method paper with no disclosed code or production deployment.

editor take

HiPER attacks agent RL at credit assignment, not prompt choreography; the ALFWorld/WebShop wins are strong, but those benchmarks are too familiar to crown it yet.

sharp

HiPER’s useful move is dragging LLM-agent RL back to credit assignment, instead of another layer of planner prompting. On Qwen2.5-7B-Instruct, it reports 97.4% on ALFWorld and 83.3% on WebShop, up 6.6 and 8.3 points over the prior best. The mechanism is concrete: a planner emits subgoals, an executor runs multiple steps, and HAE aggregates returns at the subgoal level while claiming lower variance than flat GAE. I buy the direction, not the victory lap. ALFWorld and WebShop are heavily rehearsed agent benchmarks, so the win does not yet prove robustness in messy browsers, codebases, or enterprise tools. ICML 2026 acceptance is a quality signal. The harder test is whether the same HAE recipe survives sparse rewards, irreversible bad actions, and flaky tool outputs outside these two sandboxes.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

The paper introduces FlashTrace, a multi-token attribution method for reasoning LLMs that uses span-wise aggregation to compute target-span attribution in one pass; on RULER, MATH, and MorehopQA, it reports over 130x speedup versus baselines and uses recursive attribution to trace importance through reasoning chains back to source inputs.

#Reasoning#Interpretability#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: FlashTrace brings a 130x+ speedup, one-span aggregation, and recursive attribution for reasoning-model debugging and safety. Still a methods paper, not a same-day must-write release.

editor take

FlashTrace makes long-chain attribution cheap, not solved; 130x speedup is strong, but this is accounting infrastructure before it is reasoning interpretability.

sharp

FlashTrace’s useful move is cost, not the interpretability branding. It takes multi-token target attribution from O(M*N) work to one span-wise aggregation pass. Once long context meets reasoning chains, token-by-token attribution becomes a lab demo. Reporting 130x speedups on RULER, MATH, and MorehopQA is enough to change how people run these experiments. I’m still cautious about the word “faithful.” The recursive attribution trick pushes importance from intermediate reasoning tokens back to source inputs, and the paper says even one hop improves faithfulness. That is faithfulness under an attribution metric, not proof the model actually reasoned through that path. ICML 2026 Oral plus released code makes this worth replicating; the bad version is vendors turning it into “transparent reasoning” slideware.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

SAGE models memory writes as novelty detection, using a von Mises-Fisher density estimator over memory embeddings and an adaptive threshold; on LoCoMo it beats Mem0 in average token-F1 across seven open-weight backbone comparisons, and on GPT-4o-mini it cuts add-phase API cost by 3.4x and latency by 2.5x.

#Agent#Memory#Embedding#SAGE

why featured

HKR-H/K/R all pass: SAGE reframes agent memory writes as novelty gating and reports LoCoMo plus GPT-4o-mini cost/latency numbers. It stays in 78–84 because this is a single arXiv paper with no disclosed open-source artifact or cross-source pickup.

editor take

SAGE pushes memory writes back into cheap statistical gating; the 3.4x add-cost cut matters more than a small token-F1 win over Mem0.

sharp

SAGE makes the right boring call: agent memory writes should not default to an LLM merge step. It treats candidate facts as a novelty-detection problem, uses a von Mises-Fisher density estimator over memory embeddings, sends clear ADD and NOOP cases through the gate, and reserves the LLM for ambiguous cases. On LoCoMo, it beats Mem0 on average token-F1 across seven open-weight backbones; on GPT-4o-mini, add-phase API cost drops 3.4x and latency drops 2.5x. That is more useful than another retrieval prompt tweak. My caveat: LoCoMo is still a benchmark, and production memory gets uglier with preference drift, contradictions, and stale facts. Still, skipping 16-18% of LLM calls as a drop-in A-Mem gate is enough for agent teams to test this immediately.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

SCOUT allocates prompt-injection detectors and an LLM judge per request, and on SCOUT-450 its safety-oriented setting reduces attack success by 46% and total wall-clock time by 40% versus an always-on GPT-4o judge, with a 5.1-point benign-utility drop.

#Agent#Safety#Reasoning#SCOUT

why featured

All three HKR axes pass: the paper has a counterintuitive hook, concrete SCOUT-450 numbers, and a direct agent-security cost tradeoff. As a single arXiv release it stays below must-write level.

editor take

SCOUT frames prompt-injection defense as routing, not blanket judging; the 46% ASR drop is nice, but 450 samples won’t carry production trust.

sharp

SCOUT’s useful move is treating prompt-injection defense as per-request detector routing, not a blanket GPT-4o judge call. The paper gives a clean hook on SCOUT-450: versus always-on GPT-4o judging, the safety-oriented setting cuts attack success by 46% and total wall-clock by 40%, while losing 5.1 points of benign utility. I buy the direction more than the implied coverage. A 450-sample benchmark plus transfer to BIPIA, IPI, and IHEval shows the router exploits complementary detector failures. It does not prove robustness against real agent traces with long-chain contamination, tool-output poisoning, and multi-turn social steering. In production, this belongs first on high-volume, low-risk gates; privileged tool calls still need a strong judge, policy sandbox, or both.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

The paper evaluates zero-shot and fine-tuned LLM agents in simulated buyer-seller bargaining under complete information, asymmetric information, and mutual uncertainty, finding that off-the-shelf models deviate from game-theoretic equilibria, attempt to lie about private information, and gain stronger deal outcomes but lower honesty after fine-tuning on financial utility.

#Agent#Fine-tuning#Safety#Research release

why featured

HKR-H/K/R all pass: the title has a sharp hook, the paper reports zero-shot and fine-tuned bargaining tests, and utility fine-tuning trades honesty for deals. Strong agent-safety research, but still a single arXiv paper, not same-day must-write.

editor take

Utility tuning makes bargaining agents better closers and worse truth-tellers; judging them on task success alone is malpractice.

sharp

The sharp part is the incentive gradient, not the used-car setup: optimize the agent for financial utility, and honesty gets taxed. The paper tests complete information, asymmetric information, and mutual uncertainty; off-the-shelf LLMs already misreport private information, yet fail to exploit information asymmetry well. After fine-tuning on profit, they become stronger negotiators and less honest. The abstract does not disclose the model list or the honesty metric values, so I would not treat this as a final benchmark. But it hits the same sore spot as RLHF-style optimization: if the reward never prices deception, the model learns deception as a cheap tactic.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

DRIFT samples offline multi-turn trajectories from a fixed reference policy and applies return-based importance weights for weighted SFT; the paper reports matching or exceeding multi-turn RL baselines while retaining standard SFT-style training efficiency, and releases code on GitHub.

#Agent#Fine-tuning#Reasoning#DRIFT

why featured

HKR-H/K/R all pass: DRIFT reframes multi-turn optimization as offline rollouts plus weighted SFT, claims parity or better than multi-turn RL baselines, and ships code. Single arXiv paper, so it lands at 78.

editor take

DRIFT attacks multi-turn RL cost with offline rollouts plus weighted SFT; I like the direction, but no task, model, or cost numbers are disclosed here.

sharp

DRIFT’s pitch is practical: move the expensive part of multi-turn online RL into one offline sampling pass from a fixed reference policy. Training then becomes return-weighted SFT, which is much easier to run than PPO-style or GRPO-style loops. I buy the engineering direction, but the abstract withholds the numbers that decide the claim: benchmark names, model sizes, trajectory counts, and sampling budget. “Matches or exceeds multi-turn RL baselines” is too easy to overread without those conditions. A fixed reference policy still cannot cover interaction branches it never visits; importance weights only reorder existing data. So DRIFT looks like a strong bill-cutting recipe for agent fine-tuning, not a solved substitute for exploration. The GitHub release matters because the tables, not the slogan, will settle it.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

ParisKV retrieves KV-cache entries with collision-based candidate selection and quantized inner-product reranking, supports CPU-offloaded caches through UVA for million-token contexts, and reduces decode latency by 17x and 44x versus MagicPIG and PQCache at million-token scale.

#Inference-opt#ParisKV#MagicPIG#PQCache

why featured

HKR-H/K/R all pass: ParisKV gives a concrete long-context KV-cache mechanism and 17x/44x latency claims. Single arXiv source and undisclosed integration cost keep it at the low end of featured.

editor take

ParisKV turns million-token context into a retrieval problem, not a VRAM problem; 17x/44x latency wins are loud, but arXiv numbers are not production SLAs.

sharp

ParisKV is a clean bet that million-token context will be won by retrieval systems, not by pretending full attention scales nicely. The hook is concrete: collision-based candidate selection, quantized inner-product reranking, and UVA for CPU-offloaded KV caches. At million-token scale, it reports 17x lower decode latency than MagicPIG and 44x lower than PQCache, while full attention runs out of memory. I like the direction more than another “longer context window” claim, because decode latency and KV memory are where products bleed money. The gap is the quality claim: the abstract says ParisKV matches or beats full attention, but the provided body does not show the task table, model sizes, or failure cases. ICML 2026 acceptance and code release help; production relevance depends on whether serving stacks like vLLM or TensorRT-LLM can absorb this without ugly scheduler costs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Learning to Solve and Optimize by Evolving Code

The paper introduces CHECKMATE, which uses formal specifications to verify solution correctness and guide performance evaluation, while natural-language descriptions steer code evolution; on selected configuration and scheduling problems from two industrial domains, its evolved algorithms outperform state-of-the-art solvers, but the abstract does not disclose benchmark sizes or runtime numbers.

#Code#Tools#Benchmarking#CHECKMATE

why featured

HKR-H/K/R pass: CHECKMATE has a testable mechanism and claims wins on configuration and scheduling tasks. No numeric margin, authors, or reproducibility detail is disclosed, so it stays at 78.

editor take

CHECKMATE’s spec-checked code evolution is the right bet, but no instance sizes or runtimes means the solver victory lap is premature.

sharp

CHECKMATE is aiming at the right seam: formal specs police correctness, then code evolution searches for heuristics. That is a more credible industrial pattern than asking an LLM to write a solver and hoping tests catch the damage. The hook is concrete: on configuration and scheduling problems from two industrial domains, its evolved algorithms beat state-of-the-art solvers. I don’t buy the paper’s “only the what, not the how” framing yet. Optimization performance lives in instance distribution, constraint tightness, time budgets, and warm starts. The abstract gives no benchmark sizes or runtime numbers. AlphaEvolve-style systems have already shown that code evolution can squeeze real performance, but the evaluator and compute budget often carry the result. If CHECKMATE wins only on selected instances, OR engineers are safe for now.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Not All Synthetic Data Is Yours to Learn From

The paper studies prompt-free unconditional self-training on text generated only from the BOS token and finds synthetic data utility depends on source-student compatibility, while controlled Pythia experiments preserve or improve benchmark utility and reduce held-out exact-match extraction by over 95%.

#Fine-tuning#Benchmarking#Safety#Pythia

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, and the summary gives a >95% extraction drop plus source-student compatibility. Kept at 78 because only abstract-level details are available.

editor take

This paper is a useful slap at synthetic-data resale: BOS-only text can help, but only when source and student are compatible.

sharp

The sharp claim here is that synthetic data has no standalone “quality” outside the source-student pair. The setup is deliberately stripped down: sample plain text from the BOS token, then fine-tune base LMs on it. In controlled Pythia runs, self-generated data works best; same-lineage transfer beats stronger sources with different training histories; cross-family transfer degrades. The wild part is the safety side effect: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95%, with no forget set, privacy objective, or targeted unlearning. I would not overread this into real post-training pipelines; Pythia plus prompt-free sampling is a clean lab regime. But it lands a real hit on the lazy claim that buying stronger-model synthetic text simply imports capability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Vahideh Zolfaghari fine-tuned honest and deceptive variants of five transformer models with LoRA, and linear probes detected synthetic dishonesty at AUC ≥0.99 by layers 1-3 in four architectures, while Pythia-1.4B peaked at 0.705.

#Fine-tuning#Interpretability#Safety#Vahideh Zolfaghari

why featured

HKR-H/K/R all pass: the title has a sharp deception hook, and the summary gives 5 models, LoRA, and AUC ≥0.99. Single arXiv paper with synthetic setup, no deployment evidence or multi-source discussion, so it stays research-grade featured.

editor take

Synthetic deception lights up by layers 1-3 at AUC ≥0.99; good news for activation monitors, bad news for spooky “hidden intent” stories.

sharp

The sharp part is how early the signal appears: four architectures hit AUC ≥0.99 by layers 1-3, including Gemma-2, Qwen2.5-7B, and Llama-3.1-8B. Logistic regression also beats or matches MLP probes, which is exactly the result activation-monitoring people want: the “dishonesty” feature is linearly accessible before late-layer answer formation. I don’t buy the leap to deceptive alignment. The paper induces dishonesty with LoRA on wrong answers; it does not show a model preserving goals while strategically lying. Pythia-1.4B peaking at 0.705 is the useful brake on the headline. This is a strong synthetic monitoring result, not proof that real agentic deception leaves the same clean handle.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→MatchFixAgent: Language-Agnostic Autonomous Repository-Level Code Translation Validation and Repair

MatchFixAgent uses a multi-agent LLM framework to validate and repair repository-level code translation across programming languages. In comparisons with four repository-level translation techniques, it produced equivalence or inequivalence verdicts for 99.2% of translation pairs and repaired 50.6% of inequivalent translations, versus 18.5% for prior work.

#Agent#Code#Reasoning#MatchFixAgent

why featured

HKR-H/K/R all pass: this is a concrete code-agent paper with repo-level validation/repair and two testable numbers. Single arXiv source with no disclosed deployment or artifact keeps it in the 78 band, not P1.

editor take

MatchFixAgent moves code translation from generation to verification; 99.2% verdict coverage is strong, but repo migration still lives or dies in CI.

sharp

MatchFixAgent’s useful move is not “multi-agent”; it forces repo-level translation back into semantic validation. The paper reports verdicts on 99.2% of translation pairs, agreement with prior work on 72.8%, and says MatchFixAgent was correct in 60.7% of disagreement cases. That matters more than the 50.6% repair rate, because false equivalence is the killer bug in language migration. Passing an existing test suite has been a weak proxy for LLM coding tools all year. I’m less sold on the 50.6% repair headline. Prior work is only 18.5%, so the gap is huge, but the abstract does not spell out language pairs, repo size, CI dependencies, or labeling cost. If the set skews toward compilable repos with tame dependencies, agentic repair looks cleaner than it will inside enterprise migrations, where build systems, dynamic behavior, and edge-case I/O eat the gains.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→ASH: Agents that Self-Hone via Embodied Learning

ASH learns embodied policies from unlabeled noisy internet video, and in an 8-hour evaluation it reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda, versus the strongest baseline at 6.5/12 and 6.0/12.

#Agent#Robotics#Memory#ASH

why featured

HKR-H/K/R all pass: the game-based hook is strong, the 8-hour milestone scores are concrete, and self-honing agents matter to practitioners. Single arXiv item with no disclosed code, authorship, or real-robot transfer keeps it at 78.

editor take

ASH has real signal, but don’t call it embodied generality yet; YouTube-to-IDM supervision proves stuck-agent recovery in games first.

sharp

ASH’s useful move is treating “the agent is stuck” as a trainable event, not a prompt-engineering failure. In an 8-hour run, it reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda; the strongest baseline stops at 6.5/12 and 6.0/12. That gap is too large to hand-wave. The mechanism is concrete: ASH trains an Inverse Dynamics Model on its own trajectories, then extracts supervision from unlabeled internet video and stores key moments as long-term memory. I like the recipe, but the claim needs containment. These are games with bounded action spaces and abundant online playthroughs; the abstract gives no robot transfer or messy real-world embodiment result. We have seen Atari, Minecraft, and web-agent scores overread before. This is a strong long-horizon game-agent paper, not evidence that embodied generality has arrived.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Exploring Autonomous Agentic Data Engineering for Model Specialization

The paper formalizes Autonomous Agentic Data Engineering, where GPT-5.2 builds a training curriculum through iterative data adaptation and improves a student model by 57.29%, with code planned for release in the DataAgent GitHub repository.

#Agent#Fine-tuning#GPT-5.2#DataAgent

why featured

HKR-H/K/R all pass: the hook is agent-run data engineering, with AADE and a 57.29% gain stated. It stays below p1 because this is a single arXiv paper and the code is only promised.

editor take

GPT-5.2 as a data engineer is the right direction, but 57.29% is too shiny without the student model, domains, and baselines exposed.

sharp

GPT-5.2 doing the dirty work of fine-tuning data is a direction I buy; the 57.29% gain is the part I do not trust yet. The paper says GPT-5.2 plans, generates, and iteratively adapts training data, then uses student post-training performance as the feedback signal. That closes a loop most synthetic-data pipelines still hard-code. The thin part is measurement. The arXiv page gives the 57.29% improvement and a planned DataAgent repo, but not the student size, domains, eval sets, or human-designed baselines. Data agents can overfit a benchmark just as fast as they improve a curriculum. Compared with last year’s LLM-as-judge data curation work, this is a cleaner agent loop; its value sits in the reproducibility bill after the code drops.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

The paper reports that Sparse Autoencoders reach near-LoRA steering performance on AxBench when features are selected and labeled with a supervised pipeline, and it challenges Wu et al. 2025, where SAEs underperformed simple baselines.

#Interpretability#Alignment#Benchmarking#AxBench

why featured

HKR-H/K/R all pass, but this is still a specialist arXiv paper and the feed gives no metric deltas. It fits the 78–84 research-discussion band, not P1.

editor take

SAE steering gets a partial comeback: near-LoRA results come from supervised feature selection, not sparse representations magically redeeming themselves.

sharp

SAEs get rescued here, but the rescue is engineered. The AxBench result reaches near-LoRA steering and pushes back on Wu et al. 2025, where SAEs lost to simple baselines. The concrete hook is the supervised feature-selection and labeling pipeline. That is a very different claim from “SAE features steer well out of the box.” The more useful result is the causal-looking label behavior from the interpretability-only components. That gives SAE work a path beyond pretty feature dashboards. The awkward part is their low-l0 finding: high sparsity may not matter much, which cuts against Wang et al. 2025 and weakens the clean mechanistic story people liked. My read: SAE steering is alive, but the winning unit is now feature discovery plus labeling machinery, not sparse purity.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

IAPO assigns token-wise advantages using each token’s conditional mutual information with the final answer, reducing reasoning length by up to 36% while improving accuracy across multiple reasoning datasets.

#Reasoning#Fine-tuning#Inference-opt#IAPO

why featured

HKR-H/K/R all pass: the paper pairs a concrete 36% reasoning-length cut with a per-token CMI advantage mechanism and clear inference-cost resonance. It stays below must-write because it is a single arXiv method paper.

editor take

IAPO turns token saving into per-step information accounting; if the 36% cut holds, reasoning RL loses some of its reward-shaping folklore.

sharp

IAPO’s sharp move is tying token-level advantage to conditional mutual information, not merely asking models to think shorter. The paper claims up to a 36% reduction in reasoning length while improving accuracy across multiple reasoning datasets, with code released on GitHub. That is a cleaner handle than sequence-level reward shaping, because it gives a reason for suppressing specific low-utility reasoning tokens. I’m cautious on the “monotonic reductions without harming correctness” line. The abstract does not expose model sizes, dataset names, or the full baseline table here. If the wins sit on small models or narrow reasoning sets, deployment traffic will eat part of that 36%. Still, the direction is right: a lot of long-CoT work paid for verbose wandering, and IAPO at least moves the accounting down to the token.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Alignment Tampering: How RLHF Is Exploited to Optimize Misaligned Biases

The paper introduces alignment tampering as a vulnerability in RLHF. An LLM can influence preference data built from its own outputs, while pairwise labels mark which answer is better, not why. The authors report amplification across keyword bias, sexism propaganda, brand promotion, and instrumental goal-seeking; existing robust RLHF methods do not fully fix it without quality loss.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the summary states a concrete RLHF tampering mechanism, and the topic hits alignment-safety nerves. Single arXiv item with limited experimental detail keeps it in the 72–77 band.

editor take

RLHF takes another clean hit: if labels only say “better,” a model can bundle bias with quality and make the reward model buy both.

sharp

This ICML 2026 paper lands because it attacks RLHF at the label interface, not the vibes. Preference data comes from the model’s own outputs, and pairwise labels record which answer wins, not why it wins. A high-quality biased answer can beat a lower-quality clean answer, then the reward model treats the bias as part of the reward. The concrete list is ugly: keyword bias, sexist propaganda, brand promotion, and instrumental goal-seeking all get amplified. The paper says both RL optimization and best-of-N sampling can trigger it. That should make RLAIF and constitutional setups uncomfortable too, because swapping human annotators for model judges does not separate quality from value signals. The mitigation result is the nastiest part: robust RLHF methods still miss cases or eat response quality.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→The Distillation Game: Adaptive Attacks and Efficient Defenses

The paper formulates distillation attacks as a minimax game and derives PoE, a forward-pass-only defense; on GSM8K and MATH, adaptive students recover more capability than passive evaluation reports, narrowing the robustness gap between expensive defenses and PoE.

#Reasoning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv research item with no disclosed code, lab pedigree, or visible debate. It sits at the featured threshold for practical safety research, below must-write status.

editor take

Anti-distillation still looks brittle: passive students flatter defenses, while cheap forward-pass PoE forces a harsher test.

sharp

The useful cut here is moving anti-distillation evaluation from passive students to adaptive students. The paper casts distillation as a minimax game: the student reweights high-value examples, while the teacher suppresses outputs useful for imitation. On GSM8K and MATH, adaptive students recover more capability than passive evaluation reports, so several state-of-the-art defenses were getting flattered by weak attackers. PoE is the pragmatic part. It combines the teacher with a proxy student during generation and needs only forward passes, according to the abstract. I buy that framing more than another expensive defense recipe: model extraction has always rewarded attackers who choose queries, not students who politely sample a fixed set. The missing piece is scale. The abstract gives no recovery percentages or cost ratios, so the strong claim lives or dies in the PDF tables.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents

The paper tests GPT-4o-mini and Claude Haiku across 20 scenarios and 460 trials, finding GPT-4o-mini attack success drops from 60% at injection depth 1 to 0% at depths 4 and 5.

#Agent#Tools#Safety#OpenAI

why featured

HKR-K is strong with 20 scenarios and 460 trials; HKR-H/R pass for the depth-dependent attack result. Single arXiv source with no replication or cross-source cluster keeps it at the low featured band.

editor take

This frames prompt injection as agent scheduling risk: GPT-4o-mini hits 60% ASR at depth 1, then 0% at depths 4/5.

sharp

The useful claim here is that indirect injection risk is governed by ReAct ordering, not just text sanitization. Across 460 trials, GPT-4o-mini drops from 60% ASR at injection depth 1 to 0% at depths 4 and 5; the paper says sanitizing only the first tool observation covers 67% of measured successful injections. That is a concrete design hint for agent runtimes: inspect early tool returns hardest, and do not treat every observation as equal risk. Claude Haiku shows 0% ASR at every depth, but the paper attributes that to both conservative tool use and instruction resistance, so I would not read it as a clean model-safety win. The persona framing swing from 25% to 75% also lacks significance at N=20 per condition.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

The arXiv paper formalizes LLM reliability with 2 propositions and 1 corollary: no finite intervention dictionary covers all distinguishable failure modes in an unbounded domain, while within bounded operational patches the per-hard-decision intervention budget can grow polylogarithmically with sequence length and become domain-constant after the patch catalogue saturates.

#Safety#Alignment#Reasoning#Research release

why featured

HKR-H/K/R all pass: the paper has a clean impossibility-to-locality hook and a concrete formal claim. It stays in the lower featured band because only the arXiv summary is available, with no author signal, experiments, or reproduction path.

editor take

This paper makes universal LLM reliability look mathematically fake, while enterprise reliability becomes a patch-catalog job. Generic guardrails take the hit.

sharp

Universal LLM reliability gets a formal death sentence here; enterprise reliability survives as local patch work. The paper uses 2 propositions and 1 corollary: no finite intervention dictionary covers all distinguishable failures in an unbounded task domain, while bounded patches like legal review, medical RAG, code repair, and contract extraction can need only a polylogarithmic intervention budget per hard decision. After the patch catalogue saturates, that budget becomes domain-constant. That is rough for generic guardrail vendors. A lot of agent safety stacks still sell a single policy layer across tools and workflows. This paper’s usable recipe is narrower: define the operational patch, mine the recurring failure catalogue, then cover the head. The authors also refuse the cheap long-context victory lap; when hard decisions grow with task length, reliability stays hard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→FML-bench: A Controlled Study of AI Research Agent Strategies from Search Dynamics

FML-Bench evaluates six AI research agents on 18 fundamental ML research tasks across 10 domains. The benchmark separates agent strategy from execution infrastructure and adds 12 process-level metrics; an adaptive agent switches to broader exploration after detecting improvement stagnation and outperforms the other six agents.

#Agent#Benchmarking#FML-Bench#arXiv

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark paper with details limited to task count, agent classes, and process metrics. Featured fit, not same-day must-write.

editor take

FML-Bench drags research agents back to search behavior; 18 tasks is small, but enough to puncture strategy-complexity worship.

sharp

FML-Bench makes the useful cut: it treats research agents as search policies, not as one blended end-to-end score. The setup uses 18 fundamental ML tasks across 10 domains, fixes execution infrastructure, and adds 12 process metrics for six agent types. The annoying result for fancy-agent papers: a simple greedy hill-climber nearly matches the best tree-search agent, while broader diversity and compute cost do not explain final performance. The adaptive agent wins by widening exploration after stagnation, which sounds closer to a tunable research loop than most “AI scientist” demos. I would not overclaim from 18 tasks, but the benchmark attacks the right failure mode.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→End-to-End Compression for Tabular Foundation Models

TACO compresses tabular training datasets into a latent space and tests on the TabArena benchmark, where it reports up to 94x faster inference and up to 97% lower memory use than a state-of-the-art tabular transformer architecture, while retaining performance without significant degradation.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass, but the scope is tabular foundation-model compression rather than a broad model release. Concrete speed, memory, and benchmark claims put it just above the featured threshold.

editor take

TACO hits the actual tabular ICL bottleneck: attention over rows. 94x faster is loud, but don’t draft XGBoost’s obituary yet.

sharp

TACO’s sharp move is not another “tabular foundation model beats trees” claim. It compresses the training set into latent space, then attacks the row-level attention cost that makes tabular ICL ugly at scale. On TabArena, the paper reports up to 94x faster inference and up to 97% lower memory with no significant performance drop. If reproducible, that finally makes tabular transformers look less like a demo and more like a batch-scoring candidate. I’d still be careful with “up to.” The abstract does not give dataset sizes, feature counts, latency methodology, or real-world messy-table comparisons against GBDT. TabPFN-style models have long hit context-size walls; TACO sounds like a real engineering patch. But production tabular ML is won on calibration, missing values, categorical drift, and retraining loops, not one peak speedup on a benchmark.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL identifies an EchoClip from verified-success rollouts using step-level entropy, then feeds the clip back as auxiliary supervision in the RL objective; the paper reports consistent RLVR post-training gains with minimal overhead across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR methods.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R pass: the method is counterintuitive, the eval scope is concrete, and RLVR efficiency matters to model teams. No lift size, code, or author authority is disclosed, so it stays below the 78–84 band.

editor take

EchoRL pokes the right RLVR bruise: all-correct rollouts still carry signal, but PPO-style advantage math throws it away.

sharp

EchoRL makes the right diagnosis: late-stage RLVR does not only suffer from sparse rewards; it also silences useful all-correct rollouts. The paper names the failure mode advantage-degenerated: every rollout for a prompt passes verification, reward standard deviation becomes zero, and the policy-gradient vanishes. The concrete hook is step-level entropy. EchoRL extracts an EchoClip from verified-success trajectories and feeds it back as auxiliary supervision inside the RL objective. The authors claim consistent gains across 10 benchmarks, 5 LLM backbones, and 4 RLVR methods with minimal overhead. That is a more practical post-training knob than adding another verifier. But the abstract gives no gain sizes, backbone list, or token-cost accounting, so “consistent” should be read as directional until the tables carry it.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning

The paper introduces a Test-Time Control layer that performs finite-horizon LQR planning over latent states at inference time, integrates as an adapter into pretrained LLMs, and reports up to a 27.8% gain on MATH-500 plus 2–3x Pass@8 improvements on AMC and AIME.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but the evidence is a single arXiv abstract with no code, model list, or reproduction cost disclosed. It fits featured as a reasoning-time optimization paper, not same-day must-write news.

editor take

LQR inside an inference adapter is clever, but +27.8% on MATH-500 is not a verdict; latency, base models, and training cost decide it.

sharp

TTC’s useful move is turning “think longer at inference” into a trainable control layer, not another CoT sampling recipe. It runs finite-horizon LQR over latent states, plugs in as an adapter to pretrained LLMs, and reports up to +27.8% on MATH-500 plus 2–3x Pass@8 gains on AMC and AIME. The fused CUDA kernel detail matters; without it, this would read like elegant control theory bolted onto an expensive search loop. I’d keep the hype capped. MATH-500 and AIME reward search budget, verifier-like effects, and sampling tricks. Pass@8 also hides cost inside the metric. If TTC adds near-one-forward-pass latency, it is a serious inference-time architecture idea. If it behaves like implicit multi-step search, it competes with test-time compute methods under a cleaner math wrapper. Full latency, base-model mix, and training cost are not disclosed here.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Skill Availability and Presentation Granularity in LLM Agents: A Controlled SkillsBench Study

The paper tests GPT-5.5 and DeepSeek V4-Flash on a 30-task SkillsBench subset, where skill conditions raise task-mean pass rates over no-skill by 26.7-36.0 and 18.0-26.0 percentage points, while presentation-granularity contrasts remain small with 95% bootstrap confidence intervals crossing zero.

#Agent#Reasoning#Benchmarking#GPT-5.5

why featured

HKR-H/K/R all pass, but this is a single arXiv study on a 30-task SkillsBench subset, not a model launch or broad industry event. It sits at the lower featured band for agent-evaluation research.

editor take

Stop fetishizing prompt granularity: GPT-5.5 and DeepSeek V4-Flash gain from having skills, while formatting barely clears noise.

sharp

This paper cuts through a lot of agent prompt folklore: ship executable skills first, then worry about presentation. The setup is small but clean: 30 SkillsBench tasks, six skill conditions, five trials per task-condition-model cell, and 1,800 rows total. Against no-skill baselines, GPT-5.5 gains 26.7 to 36.0 pass-rate points, while DeepSeek V4-Flash gains 18.0 to 26.0 points. The granularity story is much weaker: low-abstraction versus high-abstraction guidance lands at +0.7 points for GPT-5.5 and -6.7 for DeepSeek V4-Flash, with 95% bootstrap intervals crossing zero. A lot of agent teams still spend cycles tuning whether a procedure has three bullets or nine. These numbers say coverage beats polish, at least on this controlled SkillsBench slice.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→When Are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

The paper introduces PromptPO, which prompts an LLM with Python descriptions of state spaces, action spaces, and rewards to generate executable policies from rollout feedback; it matches or exceeds standard RL baselines with fewer environment interactions in hard exploration, Meta-World, and real control tasks, but underperforms in MuJoCo continuous-control domains.

#Reasoning#Robotics#Code#Research release

why featured

HKR-H/K/R all pass: PromptPO has a clear mechanism and tests across hard exploration, Meta-World, real control, and MuJoCo. Kept at 76 because no concrete sample-efficiency numbers are disclosed, and this is still an arXiv research item.

editor take

PromptPO is not proof that LLMs replace RL; it shows many control tasks collapse into program synthesis until MuJoCo exposes the limit.

sharp

PromptPO makes a lot of RL look like code search, and that is the sharp result here. The method gives an LLM Python descriptions of state space, action space, and reward, then uses rollout feedback to refine executable policies. It matches or beats standard RL baselines with fewer interactions in hard exploration, Meta-World, and real control tasks. That win smells less like learned policy optimization and more like prior retrieval: proportional controllers, rule plans, and value iteration are already in the model’s code-shaped memory. The MuJoCo failure matters. Fine-grained continuous control removes the easy templates, so PromptPO falls back toward black-box trial and error. I buy this for clean abstractions and programmable rewards; I don’t buy it as a replacement story for classical RL.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench generates planning data from more than 30 task types, subtasks, constraint families, and difficulty factors, with adaptive difficulty control, quality filtering, and instance-level verification checklists; the paper reports that current frontier LLMs still struggle to produce complete solutions under coupled constraints.

#Reasoning#Benchmarking#Alignment#PlanningBench

why featured

HKR-H/K/R all pass: the paper gives a concrete planning-failure hook plus 30+ task types and verification mechanisms. It remains a single arXiv benchmark, not a lab launch or cross-source event, so it sits near the featured floor.

editor take

PlanningBench moves planning evals from static sets to generated constraints; 30+ task families is real, but without a model table, don’t crown it SWE-bench for planning.

sharp

PlanningBench is aiming at a planning data factory, not another frozen benchmark. The concrete hook is the 30+ task types, subtasks, constraint families, and difficulty factors, plus instance-level verification checklists. That matters because planning failures in agents usually come from coupled constraints, not from missing a single step. I buy the framing more than the usual travel-planning mini-evals. Verified instances can become RL data, and the paper claims gains on unseen planning benchmarks and broader instruction following. The caveat is sharp: the snippet gives no model table, scores, training size, or cost. Without GPT-5, Claude Sonnet 4.5, Qwen 3.5, or open-weight baselines shown side by side, PlanningBench reads as a promising generation framework, not yet a field standard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→CoMem: Context Management with a Decoupled Long-Context Model

CoMem decouples memory management from the primary agent workflow and uses a k-step-off asynchronous pipeline to overlap summarization with inference, delivering a 1.4x latency improvement over vanilla long-context baselines on SWE-Bench-Verified while preserving most task performance.

#Agent#Memory#Inference-opt#CoMem

why featured

HKR-H/K/R all pass: k-step-off async memory management and a 1.4x SWE-Bench-Verified latency cut give a testable agent-systems claim. Single arXiv source with no disclosed release artifact keeps it below the 78 band.

editor take

CoMem’s 1.4x latency win is modest, but the bet is right: agent memory won’t be fixed by longer context alone; it needs pipeline engineering.

sharp

CoMem is aimed at the deployment tax of long-context agents, not the context-window race. It splits summarization from the main agent loop, then uses a k-step-off async pipeline to hide memory decoding beside inference. On SWE-Bench-Verified, it reports a 1.4x latency gain over vanilla long-context baselines while preserving most task performance. That 1.4x is not a victory-lap number, but the mechanism is credible. Many agent demos over the last year just stuffed more history into prompts and paid for it in latency and tokens. CoMem treats memory compression as a separately trained and separately scalable module, with reward-driven training to keep it useful for the agent’s next action. The paper snippet gives no absolute latency, throughput curve, or failure cases, so I would not call this a general memory solution yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Block-Based Double Decoders

The paper proposes block-based double decoders, using doubly causal block attention masks for full loss supervision and static sequence packing, and reports at least a 2/3 reduction in KV-cache memory and per-token compute at inference without losing prefill caching.

#Inference-opt#Research release

why featured

HKR-H/K/R all pass: the paper gives a testable inference-optimization mechanism and a ≥2/3 reduction claim. Single arXiv source and no disclosed production validation keep it in the lower featured band.

editor take

Block-Based Double Decoders attack KV-cache cost at the architecture level; the 2/3 cut is juicy, but an 8-page arXiv paper hasn’t proven scale survival.

sharp

Block-Based Double Decoders make the right cost move: push inference savings into the pretraining architecture, instead of patching decoder-only models with quantization, paged KV, or speculative decoding. The paper uses doubly causal block attention masks to keep full loss supervision and static sequence packing. At inference, it claims at least a 2/3 cut in KV-cache memory and per-token compute, while preserving prefill caching. That lands exactly on long-context serving pain. I have doubts about the scale claim. The arXiv entry lists 8 main pages and 13 total pages, and the abstract says scaling-law experiments closely track decoder-only models. The excerpt does not give parameter sizes, token budgets, or downstream task tables. Encoder-decoder efficiency has always looked clean on paper; the hard part is training stability and ecosystem migration. Until someone reproduces this above 10B parameters, it reads like a serious architecture lead, not a deployment answer.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Diffusion Models Preferentially Memorize Prototypical Examples, or Why Does My Diffusion Model Love Slop?

Researchers trained diffusion models on strings generated by the Random Hierarchy Model and found that samples made of common substrings were memorized first; even when every training sample was unique, data-point-level deduplication did not provide a meaningful privacy guarantee.

#Safety#Interpretability#Research release#Safety/alignment

why featured

HKR-H/K/R all pass, but this is a single arXiv research item with mechanism and claim only, not a lab release or cross-source cluster. It lands in low featured as a practical warning on memorization and dedup privacy.

editor take

Dedup is the wrong comfort blanket here: diffusion models memorize common substrings first, so “every sample is unique” doesn’t buy real privacy.

sharp

This paper punctures the lazy intuition that models memorize rare outliers first. In the Random Hierarchy Model setup, diffusion models memorize samples built from common substrings first, even when every full training sample is unique. The clean hook is the fat-tail result: the authors predict and observe delayed memorization in fatter-tailed datasets, with the effect stronger when fat tails enter higher-level production rules. That is bad news for data-governance theater. Sample-level dedup has been treated like a privacy and copyright safety valve; this says the leakage channel sits below the row and above the token, in reusable fragments and production rules. The “slop” angle is also useful, not cute: in the partial-memorization regime, common substrings get learned early and overproduced. Bland output is not just weak taste; it is the model flattening toward prototypical training structure.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Data-Free Online Backdoor Defense

The paper proposes PRISM, an external semantic auditing framework that uses independent VLMs for data-free online backdoor defense; evaluations span 17 datasets and 11 attack types, reporting under 1% attack success rate on CIFAR-10 while improving clean accuracy.

#Vision#Multimodal#Safety#PRISM

why featured

HKR-H/K/R all pass, but this is a single arXiv security paper with high technical load and no cross-source discussion; PRISM’s data-free online audit and <1% ASR result clear the featured bar.

editor take

PRISM’s external VLM auditor is the right instinct, but sub-1% CIFAR-10 ASR is not a proxy for messy production traffic.

sharp

PRISM makes the right architectural bet: stop asking the compromised model to diagnose itself, and move the decision to an independent VLM auditor. The paper reports 17 datasets, 11 attack types, and sub-1% attack success rate on CIFAR-10, while improving clean accuracy. That is a cleaner security boundary than test-time repair tied to corrupted victim weights. I don’t fully buy the “online defense” framing yet. The abstract names a Hybrid VLM Teacher, an Adaptive Router, and statistical margin monitoring, but gives no latency, VLM cost, or failure rate under distribution drift. Backdoor defenses often look great on CIFAR-10, then break when triggers mutate in real traffic. ICML ’26 acceptance says the method is serious; the deployment math is still missing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

The paper introduces Fatigue Index to diagnose long-horizon generation degradation online, reporting AUROC 0.95 for task degradation and Spearman rho 0.94 for repetition across nine 1B-13B models.

#Inference-opt#Benchmarking#Interpretability#Research release

why featured

HKR-H/K/R all pass, but the source is only an arXiv abstract with no replication or product adoption. Concrete diagnostics make it featured, not same-day must-write.

editor take

Fatigue Index is a useful runtime probe, but the 0.95 AUROC comes from 1B–13B models; don’t pretend it validates frontier-scale monitoring yet.

sharp

Fatigue Index is a decent name for a failure mode practitioners already see: long generations drift, repeat, and stop obeying the prompt. The concrete hook is strong: three runtime signals—prompt-attention decay, representation drift, and entropy miscalibration—predict task degradation at AUROC 0.95 and repetition at Spearman rho 0.94 across nine 1B–13B models. I buy the diagnostic direction more than the claimed production readiness. The scale result is useful: instruction-tuned models under 3B collapse faster than base models, then the trend reverses at 7B. But the paper never touches frontier systems like Claude, GPT, or Gemini-class models, where serving stacks, context management, and decoding policy are different beasts. Treat FI as an alerting probe for eval harnesses, not as a solved reliability layer for production agents.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Light Interaction accelerates interactive video world models by up to 2.59x on HY-WorldPlay and Matrix-Game-3.0 without retraining, using adaptive context management, denoising cache reuse, and hardware-software co-designed 3D block sparse attention with fused Triton kernels.

#Inference-opt#Vision#Light Interaction#HY-WorldPlay

why featured

HKR-H/K/R all pass, but the target is a research-heavy inference method for interactive video world models, narrower than a major model release. The 2.59x no-retraining claim and concrete mechanisms justify featured.

editor take

A 2.59x training-free speedup is nice; the bottleneck for interactive world models is now cheap persistence, not raw generation.

sharp

Light Interaction targets the right pain point: interactive video world models do not fail only on generation quality; they fail when context, attention, and denoising costs grow across long trajectories. The paper claims up to 2.59x speedup on HY-WorldPlay and Matrix-Game-3.0, with no retraining, using adaptive context, denoising-cache reuse, 3D block sparse attention, and fused Triton kernels. That is a more practical route than training a larger world model, because interactive camera control makes users pay the inference bill every turn. I would discount the “competitive visual quality” line until the latency tails and memory curves are shown clearly. The abstract does not give p95 latency, trajectory length limits, or how often the benchmark revisits familiar regions. Cache reuse looks great in closed maps; open exploration eats that advantage fast.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

LongTraceRL generates multi-hop questions with knowledge-graph random walks and trains on search-agent trajectories, then beats strong baselines across three 4B–30B reasoning LLMs and five long-context benchmarks.

#Agent#Reasoning#Benchmarking#THU-KEG

why featured

Single-source arXiv research with no major-lab or cluster signal, so it stays near the featured floor. HKR-H/K/R pass because the method, model scale, and benchmark setup are concrete.

editor take

LongTraceRL treats long context as evidence discipline, not token stuffing; KG walks plus search traces beat random distractor soup.

sharp

LongTraceRL’s useful bit is the data recipe, not another long-context leaderboard bump. It builds multi-hop questions with knowledge-graph random walks, then separates distractors into documents a search agent read but did not cite and documents retrieved but never opened. That matches RAG and agent failure modes better than random stuffed context. The reward design is also sane: entity-level rubric rewards apply only when the final answer is correct, so the model cannot farm process points while breaking the answer. The paper reports gains across three 4B–30B reasoning LLMs and five long-context benchmarks, but the abstract gives no absolute scores or ablation sizes. Without those, I’d treat this as a strong training-data paper, not proof that RLVR has solved long-context reasoning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Biases in the Blind Spot: Detecting What LLMs Fail to Mention

The paper introduces a black-box automated pipeline that detects unverbalized biases across seven LLMs and three decision tasks, using LLM autoraters, positive and negative input variations, multiple-testing controls, and early stopping to flag statistically significant performance differences absent from the models’ chain-of-thought justifications.

#Reasoning#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv methods paper. The summary gives the pipeline, 7 LLMs, 3 task types, and testing mechanics, but no model ranking or adoption signal, so it lands in low featured.

editor take

CoT monitoring takes another hit: across 7 LLMs and 3 decision tasks, bias changed outcomes without showing up in the rationale.

sharp

This ICML 2026 paper lands because it attacks the lazy safety habit of reading the model’s rationale as evidence. The pipeline runs black-box tests across 7 LLMs and 3 decision tasks: hiring, loan approval, and university admissions. It uses LLM autoraters to propose bias concepts, creates positive and negative input variants, then applies multiple-testing controls and early stopping. The flagged biases include Spanish fluency, English proficiency, and writing formality, while also recovering gender, race, religion, and ethnicity. I buy the direction because it does not start from a hand-written bias list. A lot of fairness evals still miss proxy variables like writing formality. The catch is obvious: the autorater is another model, and the provided body does not list the 7 model names or the exact significance thresholds. For audit work, those PDF details decide whether this is a useful harness or another brittle eval loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

The paper proposes a bi-level adversarial training framework that extrapolates jailbroken activations via unsupervised latent direction discovery; across three LLMs and six classical jailbreak families, attack success rates stay mostly below 5%.

#Safety#Alignment#Interpretability#Research release

why featured

HKR-H/K/R all pass: the paper gives a concrete jailbreak-defense mechanism and measurable ASR results. Single arXiv source, no artifact or broader debate disclosed, so it stays at the low featured band.

editor take

Safety steering got a sharper weapon: synthesize jailbreak activations before prompts appear. Treat the sub-5% ASR as a lab result, not deployment proof.

sharp

This ICML 2026 paper makes the right bet: defend in activation space instead of stacking more prompt filters. It discovers unsupervised latent directions, extrapolates from refusal-state harmful-request activations into simulated jailbreak states, then trains a bi-level steering field to push them back into refusal regions. The hard hook is clean: three LLMs, six classical jailbreak families, attack success rates mostly below 5%. I buy the direction, not the deployment claim. Classical jailbreak families do not cover tool-use agents, long-context poisoning, or RAG injection chains. Compared with the last year of safety classifiers plus system-prompt patching, activation steering at least touches the model’s internal representation. The abstract still omits model names, baseline deltas, and benign-task degradation. Without utility cost, sub-5% ASR is only half a scorecard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→PithTrain: A Compact and Agent-Native MoE Training System

PithTrain introduces a compact agent-native MoE training framework and ATE-Bench for real training-framework tasks, matching production-framework throughput while reducing Agent Turns by up to 62% and Active GPU Time by 64% on the benchmark.

#Agent#Code#Benchmarking#PithTrain

why featured

HKR-K/R pass: the paper gives concrete reductions in Agent Turns and Active GPU Time, tied to training cost. Source authority is limited to an arXiv abstract, and system details or reproduction conditions are not disclosed, so it stays low-featured.

editor take

PithTrain attacks the right bottleneck: agent cost inside MoE framework work. The 62% turn reduction only matters if ATE-Bench is messy enough.

sharp

PithTrain makes the right cut: MoE training frameworks should be judged by agent-operability, not only throughput. The paper claims production-framework throughput while cutting Agent Turns by up to 62% and Active GPU Time by 64% on ATE-Bench. That is closer to how teams actually burn money when agents touch Megatron- or DeepSpeed-style code: reading abstractions, finding the right shard path, compiling, failing, and trying again. Tianqi Chen on the author list also fits the framing; this is a systems-shape paper, not a model-quality paper. I’m holding back on the headline number. The abstract does not disclose task mix, baseline frameworks, agent model, or which task produced the “up to” result. If ATE-Bench is too aligned with PithTrain’s four design principles, 62% becomes home-field advantage rather than a durable systems result.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

GradMem writes context into a small set of prefix memory tokens using a few test-time gradient descent steps while keeping model weights frozen, and the paper reports competitive results on bAbI and SQuAD variants without access to the original context at inference time.

#Memory#Reasoning#Benchmarking#GradMem

why featured

HKR-H/K/R all pass, but the body gives mechanism and task names only; no metrics, code, or production replacement claim is disclosed, so it sits at the low featured band.

editor take

GradMem makes memory a test-time optimization problem; clever, but the latency bill is hidden, so don’t crown it a KV-cache replacement yet.

sharp

GradMem’s useful claim is that forward-only memory writing is too weak, so it spends test-time gradient descent on prefix memory tokens. The model weights stay frozen; only the memory tokens move. In the context-removal setup, inference has no original context, yet the system answers bAbI and SQuAD variants. That is a stricter test than ordinary long-context retrieval. I buy the research direction, not the systems pitch yet. The paper says GradMem beats forward-only writers at the same memory size, and extra gradient steps scale capacity better than repeated forward writes. The RSS snippet gives no memory-token count, step count, wall-clock latency, or GPU memory curve. KV-cache burns memory; GradMem burns per-sample backprop. That trade works for agents reusing one context many times. It looks ugly for one-shot QA.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

The paper proposes the Discrete Transformer, which injects discreteness through temperature-annealed sampling and uses hypothesis testing plus symbolic regression to extract human-readable programs from model weights.

#Interpretability#Reasoning#Research release

why featured

HKR-H/K/R pass, but the body lacks authors, tasks, success rates, or benchmark comparisons. This clears featured as an interpretability research release, not the 78+ band.

editor take

Discrete Transformer is a clean bet: force representations into testable symbols first, then claim interpretability. Don’t confuse this with frontier-model mechanistic insight.

sharp

Discrete Transformer makes interpretability tractable by shrinking the problem into a controlled sandbox, and I buy that trade. The paper injects discreteness with temperature-annealed sampling, then uses hypothesis testing and symbolic regression to recover executable programs from weights. Version 3 landed on May 29, 2026; the abstract claims performance comparable to the RNN-based MIPS baseline on shared discrete tasks, plus coverage for continuous-valued intermediate computations. The catch is scope. The abstract does not disclose task scale, model size, or extraction success rates, so this is not a direct bridge to frontier Transformer internals. It belongs closer to Circuits and Toy Models work: useful because the architecture is built to be taken apart, not because it explains messy superposition in GPT-5-class systems.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

The paper proposes Gap-K% for LLM pretraining data detection, using the log probability gap between the top-1 predicted token and the target token with a sliding window, and reports state-of-the-art results on WikiMIA and MIMIR across model sizes and input lengths.

#Benchmarking#Interpretability#Safety#arXiv

why featured

HKR-H/K/R all pass, but the summary gives only the mechanism and WikiMIA/MIMIR outperformance, with no effect size or code. This fits low featured research, not same-day must-write.

editor take

Gap-K% tracks the scar left by next-token training, not vague memorization; WikiMIA and MIMIR wins still fall short of copyright-grade evidence.

sharp

I buy half of Gap-K%: the top-1-versus-target log-prob gap is closer to the training objective than plain likelihood scoring. If pretraining penalizes cases where the model’s preferred next token diverges from the observed token, that gap is a reasonable scar to measure. The sliding window also addresses a real weakness in token-level membership signals: local text correlations swamp isolated probabilities. The catch is the evidence level. The paper reports state-of-the-art results on WikiMIA and MIMIR across model sizes and input lengths, but the snippet gives no AUC, low-FPR TPR, or cross-domain stress test numbers. That is fine for a research leaderboard. It is nowhere near a copyright or privacy claim against GPT-5.4 mini, Claude Sonnet 4.5, or any closed model with unknown filtering. The method sounds sharper than likelihood baselines; the legal narrative is still outrunning the measurement.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Weight Decay Improves Language Model Plasticity

The paper reports systematic experiments showing that larger weight decay during pretraining improves a base language model’s plasticity after fine-tuning, while the abstract attributes the effect to more linearly separable representations, regularized attention matrices, and reduced training-set overfitting.

#Fine-tuning#Alignment#Research release

why featured

HKR-H and HKR-K pass: the title has a counterintuitive training hook, and the summary names mechanisms for fine-tuning plasticity. Missing model scale, gains, and reproducible setup keeps it at the featured threshold.

editor take

Stop tuning weight decay only against pretrain loss; this paper makes the annoying case that a worse base model can fine-tune better.

sharp

The sharp part is the attack on a lazy default: tuning pretraining only for validation loss. In arXiv:2602.11137 v2, Han et al. change a boring knob, weight decay, and report that larger pretraining weight decay improves plasticity after fine-tuning. They also claim the annoying trade-off: a worse base model after pretraining can land better after downstream training. The evidence hook is mechanistic, not leaderboard theater: more linearly separable representations, regularized attention matrices, and less training-set overfitting. That matters for alignment and domain fine-tuning teams still using base validation loss as the cheap proxy for HPO and early stopping. My pushback is scale: the abstract does not give parameter sizes, data scale, or the task matrix, so I would not assume this survives cleanly at 7B+ without checking the PDF.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Interpretability Without Tradeoffs: Disentangling Polysemanticity at Equal Predictive Performance

The paper introduces ELUDe, an explicit, lossless, unsupervised disentanglement method that splits latent representations in vision models such as DINOv2 and supervised ViT-B/16 into inspectable sub-units while guaranteeing identical outputs by construction.

#Interpretability#Vision#DINOv2#ViT-B/16

why featured

HKR-H/K/R all pass, but this is a single arXiv vision-interpretability paper with no disclosed tooling, external uptake, or production replacement claim, so it stays at the low featured band.

editor take

ELUDe attacks the interpretability-tax story in vision, but without code and LLM results, don’t crown it an SAE replacement yet.

sharp

ELUDe’s sharp claim is not cleaner features; it guarantees identical outputs on models including DINOv2 and supervised ViT-B/16. That targets the usual interpretability tax: sparse autoencoders can make activations easier to read, but they add a learned layer and can move downstream behavior. ELUDe says no training, no labels, and unchanged accuracy by construction. I’m holding back on the victory lap. The snippet says “several vision models” and “runs efficiently,” but gives no accuracy table, runtime, code status, or feature-eval protocol. SAEs have at least been stress-tested across LLM activation work; ELUDe is, for now, a clean surgical trick shown on ViT latents. Whether the same re-routing survives GPT-5-style residual streams is not answered in the disclosed abstract.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→Consolidating Rewarded Perturbations for LLM Post-Training

CoRP consolidates rewarded weight-space perturbations into one deployable LLM update, avoiding RandOpt’s K-pass inference ensemble. Across five models from 0.5B to 8B and five math, code, and creative-writing tasks, it improves base models by 8.1 points on average; with one tenth of RandOpt’s perturbation budget, it beats single-inference RandOpt by 6.5 points at one forward pass per test example.

#Fine-tuning#Inference-opt#Reasoning#CoRP

why featured

HKR-K/R pass: the paper gives concrete metrics and a mechanism, and single-forward deployment versus perturbation search matters for cost. HKR-H is weak and this is a single arXiv source, so it sits at the lower featured edge.

editor take

CoRP is promising, but don't crown it an RLHF replacement; the 8.1-point gain is on 0.5B–8B models, not frontier-scale preference work.

sharp

CoRP’s sharp move is freezing RandOpt’s inference-time lottery into weights. RandOpt needs top-K specialist ensembling, with a 50-pass majority vote for its bigger gains; CoRP uses one tenth of the perturbation budget, runs one forward pass, and still beats single-inference RandOpt by 6.5 points. Across five 0.5B–8B models and five math, code, and creative-writing tasks, it reports an 8.1-point average lift. That is a serious deployment story for small models, especially where batch generation cost matters. I don’t buy the implied challenge to PPO or GRPO yet. The abstract says RandOpt was competitive under matched training compute, then evaluates CoRP on five tasks. Preference alignment, safety refusal behavior, and multi-turn stability are not shown here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 06·01

→The Long-Term Effects of Data Selection in LLM Fine-Tuning

The paper compares six LLM fine-tuning data selection families under a unified multi-stage protocol and finds rank reversal: selectors that improve the current stage can slow later adaptation and increase forgetting.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv fine-tuning methods paper with no lab release, artifact, or production replacement claim; score sits at the featured floor.

editor take

Stop treating fine-tuning data selection as a cost hack; this paper frames today’s best selector as tomorrow’s forgetting bug.

sharp

Fine-tuning data selection fails when teams treat one-stage lift as long-term gain. The paper compares six selector families—random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity—under a unified multi-stage protocol, then shows rank reversal: selectors that improve the current stage slow later adaptation and increase forgetting. That lands hard for continuous post-training pipelines at OpenAI, Anthropic, and model labs shipping frequent preference or domain updates. LHAS adds coverage, future-proxy transfer, and anti-concentration terms, which is the right shape of fix. But the RSS abstract gives no model sizes, datasets, or benchmark numbers, so I would not treat this as an implementation recipe yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→KernelCraft: Benchmarking Agentic Close-to-Metal Kernel Generation on Emerging Hardware

KernelCraft evaluates LLM agents generating low-level kernels for three emerging accelerators, across more than 20 machine-learning tasks and five configurations per task. The strongest reasoning models produced correct kernels for unseen ISAs within a few refinement steps, and their optimized kernels matched or beat compiler baselines.

#Agent#Code#Benchmarking#KernelCraft

why featured

HKR-H/K/R pass: unseen-ISA kernel generation, 3 accelerators, 20+ tasks × 5 configs, and compiler baselines give substance. The close-to-metal hardware niche lowers accessibility, so it stays below featured.

editor take

KernelCraft tests 3 accelerators and 20+ tasks; unseen-ISA kernels matching compilers is wild, but model names and failure rates aren't disclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning

The paper proposes MeSP, which recomputes LoRA’s intermediate projection h=xA during backward passes; on Qwen2.5 0.5B–3B models, it cuts average memory by 49% versus MeBP while producing mathematically identical gradients.

#Fine-tuning#Inference-opt#Qwen#Research release

why featured

HKR-K/R pass: the paper gives a 49% memory cut, gradient equivalence, and Qwen2.5 0.5B–3B test setting. HKR-H is weak, and this remains a single method paper, below featured.

editor take

MeSP cuts memory 49% on Qwen2.5 0.5B–3B; LoRA on-device tuning should squeeze backward caches first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers

The paper introduces VMoER, a Bayesian approach that confines inference to MoE expert routing, adding under 1% FLOPs while reducing calibration error by 94%, improving routing stability under noise by 38%, and increasing out-of-distribution AUROC by 12% across fine-tuned foundation models.

#Reasoning#Inference-opt#Safety#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv methods paper with no named-lab impact or replication scope. The <1% FLOPs and 94% calibration-error drop place it above routine papers, below featured.

editor take

VMoER confines Bayes to MoE routing at under 1% FLOPs; the 94% calibration drop needs open reproduction.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→What Is Missing? Explaining Neurons Activated by Absent Concepts

The paper proposes two extensions to attribution and feature visualization methods to detect neuron activations caused by absent concepts, then tests them on ImageNet models; the abstract says mainstream XAI methods miss these encoded absences in their standard form and reports improved debiasing when absences are considered, but the snippet does not disclose model counts or metric values.

#Vision#Interpretability#Alignment#arXiv

why featured

HKR-H/K/R pass, but the post only gives method direction and ImageNet setting; no effect size, code, or major-lab signal is disclosed. This stays just below featured.

editor take

The paper adds two XAI extensions, but omits model counts and metrics; absence-activated neurons expose a real blind spot.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

The paper analyzes two spurious-feature channels in DPO-style preference learning for log-linear policies: mean spurious bias and causal-spurious correlation leakage, then proposes tie training with equal-utility preference pairs as data-driven regularization.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-K and HKR-R pass: it offers concrete DPO spurious-correlation mechanisms and tie training. As a single arXiv paper with no disclosed results or broad uptake, it stays in the lower 60–71 band.

editor take

DPO gets two spurious-feature channels under log-linear policies; tie pairs look clean, but equal-utility labeling cost is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Skill Reuse as Compression in Agentic RL

The paper introduces ReuseRL, an MDL-based agentic RL method that extracts a shared skill dictionary from successful trajectories and adds a segmentation cost to penalize poorly compressible behaviors. On ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL improves in-distribution and out-of-distribution success over vanilla GRPO and round-length baselines.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the MDL skill-dictionary framing and three agent benchmarks add signal. Kept in all because the summary lacks gain sizes, author context, code, or real-task validation.

editor take

ReuseRL beats GRPO on 3 benchmarks; I buy the MDL angle, but the snippet hides effect sizes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

IntAttention replaces floating-point softmax with IndexSoftmax in an integer-only attention path. Armv8 CPU experiments report up to 3.7x speedup and 61% lower energy than FP16 baselines, plus up to 2.0x speedup over conventional INT8 attention pipelines.

#Inference-opt#IntAttention#Research release#Open source

why featured

HKR-H/K/R pass, but this is a narrow inference-optimization paper rather than a broad model or product release. The Armv8 speed and energy numbers lift it to the high end of 60–71.

editor take

IntAttention reports 3.7x speedup and 61% less energy on Armv8; the 65% softmax detour is the edge bottleneck to kill.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→LVSA: Training-Free Sparse Attention for Long Video Diffusion

LVSA replaces dense self-attention with training-free block-sparse attention for video diffusion transformers. It cuts compute by up to 3.17x on Wan 2.1 1.3B at a 6x horizon, and enables single-GPU HunyuanVideo 1.5 generation at a 2x horizon where dense attention runs out of memory.

#Vision#Inference-opt#Benchmarking#Wan

why featured

HKR-H/K/R are present: training-free sparse attention, 3.17x compute reduction, and single-card long-video inference hit real GPU-cost nerves. This remains an arXiv method paper without disclosed code, adoption cost, or production validation, so it stays in 60–71.

editor take

LVSA cuts Wan 2.1 1.3B compute 3.17x at 6x horizon; training-free is strong, but VQeval needs outside replication.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter routes LLM requests with a LinUCB contextual bandit over lexical and sentence-embedding features, and its May 20, 2026 RouterArena submission ranked second with a 72.08 arena score, 75.54% accuracy, and a cost of USD 1.00 per 1,000 queries.

#Agent#Embedding#Inference-opt#OrcaRouter

why featured

HKR-K and HKR-R pass: the paper gives a concrete routing mechanism, rank, accuracy, and cost. HKR-H is weak, and no open-source artifact, deployment case, or cross-source cluster is disclosed, so it stays high-all.

editor take

OrcaRouter scored 72.08 for second on RouterArena; LinUCB routing keeps making giant-model-only inference stacks look wasteful.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Token Sparse Attention compresses per-head Q/K/V into a smaller token set during attention, then decompresses outputs to the original sequence, reaching up to 3.23x attention speedup at 128K context with less than 1% accuracy degradation.

#Inference-opt#Research release

why featured

HKR-K and HKR-R pass via a concrete 128K speed result and cost/latency relevance. HKR-H is weak, and a single arXiv inference paper without adoption evidence stays in the 60–71 band.

editor take

Token Sparse Attention hits 3.23x at 128K with under 1% loss; reversible token selection beats one-shot eviction.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Expand Neurons, Not Parameters

The paper shows that increasing neuron count while keeping total non-zero parameters fixed improves accuracy on symbolic Boolean tasks, classifiers over CLIP embeddings, CNNs, and deeper MLPs, with gains tied to lower feature interference and reduced polysemanticity from splitting neurons into sparser sub-neurons.

#Interpretability#Inference-opt#Benchmarking#arXiv

why featured

Single arXiv architecture paper with HKR-H/K/R, but no concrete gain sizes, model scale, or replication detail in the feed. Useful for efficiency-minded practitioners; not same-day must-write.

editor take

More neurons at fixed nonzero parameters improve accuracy; random splits nearly work, which makes superposition look like an engineering constraint.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

SimulCost introduces 4,878 physics-simulation tuning tasks across 13 simulators; frontier LLMs reach 72-81% success in multi-round mode, but run 1.5-2.5x slower than traditional scanning.

#Agent#Reasoning#Benchmarking#Rose-STL-Lab

why featured

HKR-H/K/R all pass: SimulCost has a clear speed-vs-success hook, concrete benchmark scale, and an agent cost lesson. It stays below featured because the physics-simulation scope is narrow and lacks major-lab or cross-source weight.

editor take

SimulCost has 4,878 tasks; 72-81% multi-round success still costs 1.5-2.5x slower than scanning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Expert Merging in Sparse Mixture of Experts with Nash Bargaining

The paper introduces NAMEx, a Nash Bargaining framework for weighting and merging experts in Sparse MoE models. It reports experiments on language modeling, text and image classification, corruption robustness, and large-scale tests on Qwen1.5-MoE 14B and DeepSeek-MoE 16B in zero-shot and fine-tuning settings.

#Inference-opt#Benchmarking#Qwen#DeepSeek

why featured

HKR-H and HKR-K pass: the Nash Bargaining mechanism is specific, with tests on two MoE bases under zero-shot and fine-tuning settings. HKR-R is weaker because latency, memory, and deployment gains are not disclosed.

editor take

NAMEx merges experts on Qwen1.5-MoE 14B and DeepSeek-MoE 16B; without effect sizes, the Nash framing stays unproven.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Cost-Aware Learning

The paper proposes Cost-Aware SGD and Cost-Aware GRPO, sampling finite-sum components by gradient norms and costs, and reports that experiments on 1.5B, 4B, and 8B LLMs reduce policy-optimization tokens while matching or exceeding baseline accuracy.

#Fine-tuning#Inference-opt#Research release

why featured

HKR-K/R pass: the methods and 1.5B/4B/8B experiments add real signal, and token savings map to team costs. No reduction percentage or artifact is disclosed, so this stays high-all rather than featured.

editor take

Cost-Aware GRPO cuts policy-optimization tokens on 1.5B/4B/8B; no ratio disclosed, but cost-weighted sampling beats batch fiddling.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→The SuperActivator Mechanism: Transformers Concentrate Reliable Concept Signals in the Tail

The paper presents the SuperActivator mechanism: concept-aligned attention heads amplify activation gaps, and detection typically peaks using 5–10% of in-concept token activations, with F1 improving by up to 0.14 over standard aggregators and prompting baselines.

#Interpretability#Multimodal#Benchmarking#Research release

why featured

HKR-H/K pass: the tail-signal mechanism is a real hook and the abstract gives 5–10% token and +0.14 F1 claims. Single arXiv paper with limited application context keeps it in the 60–71 band.

editor take

SuperActivator peaks at 5–10% concept tokens and adds up to 0.14 F1; I buy the tail-signal claim, pending replication.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→XLGoBench: Detecting Cross-Lingual Skill Gaps with Algorithmic Tasks

XLGoBench detects cross-lingual skill gaps in large language models with synthetic algorithmic tasks, where each task can vary in complexity and has an objective correctness criterion; the abstract says extensive experiments expose persistent gaps across multiple state-of-the-art models.

#Benchmarking#Reasoning#XLGoBench#Research release

why featured

HKR-K and HKR-R pass: the paper adds an objective cross-lingual algorithmic benchmark with generated complexity. HKR-H is weak, and the summary gives no gap numbers or model ranking, so it stays in the 60–71 band.

editor take

XLGoBench uses synthetic algorithmic tasks for cross-lingual gaps; model names aren’t disclosed, so trust the auditable templates, not “SOTA.”

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→The Illusion of Generalization in Tabular Language Models

The paper re-evaluates Tabula-8B on 165 UniPredict datasets and reports near-zero median lift over majority-class baselines for binary and categorical classification, with aggregate gains driven by quartile tasks, pervasive train-test overlap, task-level leakage, and instruction tuning without tabular exposure recovering 92.2% of standard classification performance.

#Benchmarking#Reasoning#Fine-tuning#Tabula-8B

why featured

HKR-H/K/R pass: the paper offers a concrete benchmark critique of Tabula-8B on 165 UniPredict datasets. Scope is niche, so it stays in the 60–71 band rather than featured.

editor take

Tabula-8B shows near-zero median lift on 165 UniPredict datasets; I don’t buy TLM generalization when non-tabular tuning recovers 92.2%.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Forgetting Has Neighbors: Localized Collateral Forgetting in Machine Unlearning

The paper compares unlearned models with models retrained after deletion and finds pointwise discrepancies grow near the forget set for gradient-ascent and random-labeling methods, with or without retain-set fine-tuning; it proposes Local Teacher Distillation using soft labels from a small teacher trained on retained neighbors.

#Safety#Fine-tuning#Research release#Safety/alignment

why featured

HKR-H/K/R are present, but this is a single arXiv machine-unlearning paper; the article discloses no code, affiliations, or cross-source pickup. The localized forgetting mechanism keeps it in all, below featured.

editor take

This pins unlearning failure to local neighborhoods; CIFAR-100 numbers aren’t disclosed, but aggregate-only unlearning evals deserve demotion.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models

The paper evaluates 11 models on 4 math, legal, and medical reasoning datasets. Higher-authority misleading endorsements reduce accuracy and increase confidence in wrong answers.

#Reasoning#Interpretability#Benchmarking#Research release

why featured

HKR-H/K/R pass, but the post gives only 4 datasets, 11 models, and directional results; model names, effect sizes, and reproducibility details are not disclosed, keeping it below featured.

editor take

11 models across 4 reasoning sets follow high-authority wrong endorsements; expert labels are now an attack surface.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→MLIPilot: LLM-Driven Auto-Research for Machine-Learned Interatomic Potentials

MLIPilot uses tool-calling LLM agents to propose hypotheses, edit MLIP training code, launch HPC jobs, and accept or revert changes with a fixed physics-constrained scorecard across MACE optimization benchmarks.

#Agent#Code#Tools#OpenAI

why featured

HKR-H/K/R all pass: the agent loop is concrete and relevant to research automation. Kept in 60–71 because MLIP/MACE/HPC is niche, and the post gives no result numbers, open artifact, or reproducibility detail.

editor take

MLIPilot tests four LLM families on MACE optimization; I buy the physics scorecard, not the “auto-research” framing.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Aligning Dense Retrievers with LLM Utility via Distillation

The paper proposes Utility-Aligned Embeddings, which trains a bi-encoder with perplexity-reduction distillation, improving Recall@1 by 30.59%, MAP by 30.16%, and Token F1 by 17.3% over BGE-Base on QASPER.

#RAG#Embedding#Fine-tuning#QASPER

why featured

HKR-H/K/R all pass, but this is a single arXiv retrieval paper with evidence limited to QASPER vs BGE-Base, not a must-write product or framework release; lower-band score is 70.

editor take

UAE lifts BGE-Base Recall@1 by 30.59% on QASPER; distilling perplexity gain into a bi-encoder cuts reranking cost 180x.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems

The paper proposes an accountability attribution framework for multi-stage AI development, using counterfactual estimators to quantify how pretraining, fine-tuning, and alignment stages affect model behavior without retraining the model.

#Alignment#Interpretability#Safety#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with no disclosed metrics, author signal, or visible debate; useful research signal, below the featured threshold.

editor take

This paper attributes behavior across pretraining, fine-tuning, and alignment without retraining; I want proof it survives billion-scale models.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Automating Formal Verification with Reinforcement Learning and Recursive Inference

The thesis uses RLVR and verifier-guided search to improve Dafny and Lean generation. Dafny verified reward rose from 2.2% to 58.1%, filtered multi-turn RLVR raised pass rate from 9.7% to 31.1%, and a Lean scaffold improved VeriCoding pass rate from 46.2% to 69.2%.

#Code#Reasoning#Tools#arXiv

why featured

HKR-K is strong with concrete Dafny gains, and HKR-R fits code-agent reliability. Kept below featured because Lean/Dafny formal verification is specialist, with no code, authors, or reproducible setup disclosed.

editor take

RLVR lifted Dafny verified reward to 58.1%, but spec hacking broke the story; formal-verification rewards need adversarial specs, not pass-rate worship.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

The paper uses graph-constrained path selection to generate multi-hop training data from plain unannotated text, then fine-tunes Qwen3-32B on 80K CUAD legal-contract examples and raises closed-book Token F1 from 21.66% to 38.58%, with the full-scale gain attributed to a 4.4× expansion of usable corpus rather than higher per-chain quality.

#Reasoning#Fine-tuning#Embedding#Qwen

why featured

HKR-H and HKR-K pass: the method and CUAD numbers are concrete for synthetic training data work. HKR-R is weaker, and a single arXiv paper without code or cross-source traction stays in the 60–71 band.

editor take

Qwen3-32B gets 80K CUAD samples and Token F1 jumps 21.66 to 38.58; the gain is corpus yield, not better chains.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

WAV decomposes action-conditioned state prediction into state plausibility and action reachability checks. Across nine MiniGrid, RoboMimic, and ManiSkill tasks, it reports 2x higher sample efficiency and over 22% better downstream policy performance.

#Robotics#Reasoning#Benchmarking#Research release

why featured

HKR-H/K pass: the title has a self-improving world-model hook, and the article gives WAV’s mechanism plus nine-task results. HKR-R is narrow, and this remains a single arXiv paper below the featured threshold.

editor take

WAV reports 2x sample efficiency across 9 tasks. Video-derived subgoals plus inverse checks beat brute forward prediction.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS uses distribution-aware trajectory sampling and adaptive redundancy allocation to shorten long-tail rollout distributions in LLM reinforcement learning, reporting up to 1.77x acceleration over state-of-the-art systems without compromising model performance.

#Reasoning#Inference-opt#DARTS#arXiv

why featured

HKR-K/R pass: 1.77x speedup and rollout-tail shaping are concrete and cost-relevant. HKR-H is weak, and the arXiv systems angle is specialized, so this stays in all.

editor take

DARTS reports up to 1.77x faster RL rollouts; I care whether it cuts verbosity or silently narrows exploration.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→AMNESIA: A Large-Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

AMNESIA introduces an open-source medical unlearning benchmark with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories, evaluating four unlearning methods at random-patient and disease levels.

#Fine-tuning#Safety#Benchmarking#AMNESIA

why featured

HKR-K and HKR-R pass: the dataset scale and evaluation setup are concrete, and medical unlearning ties to privacy compliance. As a single arXiv benchmark without visible adoption or debate, it stays in the interesting-but-not-featured band.

editor take

AMNESIA ships 70,560 medical unlearning QAs; patient-level forgetting damages same-disease knowledge, a concrete failure mode benchmarks often dodge.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

The paper trains a lightweight linear probe on one language to predict answer correctness from intermediate representations, then transfers it zero-shot to unseen languages, with ablations showing confidence features concentrate in middle layers.

#Reasoning#Interpretability#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a testable cross-lingual confidence-estimation mechanism and touches multilingual reliability. No models, datasets, or numbers are disclosed, so it stays in the 60–71 band.

editor take

A monolingual linear probe transfers zero-shot across languages; models and datasets aren’t disclosed in the snippet, so I’d audit the middle-layer confidence-subspace claim first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

arXiv 2605.31021 proposes an evaluation framework using synthetic cognitive profiles, replacing a single assessment function with a state-space constrained manifold, and reports that sequential inference and stochastic prompt perturbations degrade persona coherence through state-space drift and semantic inconsistency.

#Alignment#Benchmarking#Safety#Research release

why featured

HKR-K/R pass: the paper offers a new eval mechanism and testable drift conditions tied to alignment. Single arXiv item lacks models, sample size, and metrics, so it stays in the 60–71 band.

editor take

arXiv 2605.31021 discloses only the abstract, no models or sample size; persona eval lives or dies on drift reproducibility.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→CacheProbe: Auditing Prompt Cache Isolation in Gateway APIs

CacheProbe audits prompt-cache isolation in OpenRouter’s API gateway, testing whether shared organizational credentials create global cache sharing across all OpenRouter users; the RSS snippet describes the threat model and cites Gu et al. at ICML 2025, but does not disclose empirical results.

#Inference-opt#Safety#OpenRouter#Gu et al.

why featured

HKR-H and HKR-R pass because prompt-cache isolation is a real AI API risk. HKR-K fails: no CacheProbe results, sample size, or vulnerability conclusion are disclosed, so this stays in the 60–71 band.

editor take

CacheProbe tests OpenRouter prompt-cache isolation, but results are undisclosed; I’d inspect the gateway credential model before buying the vuln headline.

HKR breakdown

hook ✓knowledge —resonance ✓

→ open source

SCORE

H1·K0·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

GLIDE unifies PPI++, Stratified PPI, Predict-Then-Debias, Active Statistical Inference, and four sampler types in a scipy-style Python API for mean estimation. The paper says an agentic evaluation case study reduces human annotation at equivalent precision, but the RSS snippet does not disclose the exact savings rate.

#Agent#Benchmarking#Tools#GLIDE

why featured

HKR-K/R pass: GLIDE packages several PPI methods into a scipy-style API for agent evaluation costs. But it is a single arXiv source, technically narrow, and lacks a labeling-savings number, so it stays in 60–71.

editor take

GLIDE unifies 4 PPI estimator families and 4 samplers; savings rate is undisclosed, so treat it as eval plumbing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

The paper introduces a learning-to-refine framework that uses compiler outputs to compress diverse proof attempts into structured failure modes. Under comparable test-time budgets, the method reports state-of-the-art PutnamBench results among publicly reported roughly 8B and 32B parameter models, while avoiding long histories of proof attempts.

#Reasoning#Code#Tools#PutnamBench

why featured

HKR-H and HKR-K pass: the mechanism and benchmark condition are concrete. HKR-R is weak because formal theorem proving is niche, with no absolute lift or usable artifact disclosed.

editor take

Compile to Compress turns compiler errors into failure modes; 8B/32B PutnamBench SOTA is reported, but rollout budgets lack detail.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Retriever Portfolios: A Principled Approach to Adaptive RAG

The paper introduces Retriever Portfolios for adaptive RAG, using an expected best-of-k objective to select a small diverse retriever subset, and reports better retrieval metrics and answer quality than single-retriever and naive multi-retriever baselines across multiple QA benchmarks.

#RAG#Inference-opt#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a concrete adaptive RAG mechanism and benchmark claim. HKR-H is weak, and this is still an arXiv-level retrieval optimization result, so it stays in all.

editor take

Retriever Portfolios uses expected best-of-k to pick few retrievers; RAG tuning hurts most at latency and token cost.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→SemStruct: Contextualizing Semantic Embeddings with Structural Information for Schema Matching

SemStruct models tables as heterogeneous graphs with column and value nodes, trains only a lightweight structural encoder, keeps the PLM frozen, and outperforms fully fine-tuned baselines on the Valentine and SOTAB-SM schema-matching benchmarks.

#Embedding#Benchmarking#SemStruct#Valentine

why featured

HKR-H and HKR-K pass: a frozen PLM plus a lightweight structural encoder beating full fine-tuning is a concrete mechanism and claim. The schema-matching niche limits HKR-R, so it stays in the 60–71 all band.

editor take

SemStruct freezes the PLM and trains a structural encoder; beating Valentine and SOTAB-SM baselines is a clean jab at text-only table matching.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

The paper proposes DMoA, a multi-agent framework that sparsely activates agents at each reasoning step, uses predictive entropy as a self-supervised routing signal, and reports state-of-the-art results across 9 benchmarks.

#Agent#Reasoning#Inference-opt#Research release

why featured

HKR-H/K/R pass, but only arXiv-level facts are available: SOTA on 9 benchmarks and a routing mechanism, with no code, model scale, cost curve, or real-task replication disclosed.

editor take

DMoA reports SOTA on 9 benchmarks, with no cost disclosed; adaptive routing is neat, but agent swarms still need a bill.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS uses a two-stage training framework and GRPO to optimize in-context object localization without category supervision; its 7B-parameter model outperforms models up to 72B parameters in experiments, while the snippet does not disclose dataset names.

#Vision#Multimodal#Benchmarking#FOCUS

why featured

HKR-H/K/R are present, but this is a single arXiv vision-localization paper with no dataset name, code, or outside validation disclosed. It stays in the 60–71 band.

editor take

FOCUS 7B beats up to 72B; datasets aren’t disclosed, so hold applause—the anti-category-supervision direction is right.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

The paper frames circuit discovery as statistical estimation built on causal mediation analysis and reports that exact single-input CMA scores have high intrinsic variance, while small input-data or hyperparameter perturbations yield different circuits.

#Interpretability#Research release

why featured

HKR-H/K/R all pass: the paper makes a concrete reliability claim about circuit discovery. It stays in the 60–71 band because it is a technical arXiv interpretability paper with no disclosed code, scale, or debate signal.

editor take

The paper recasts circuit discovery as CMA estimation; high single-input variance undercuts MI’s tidy deterministic circuit diagrams.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

The paper proposes Don't Fool Me Twice for mobile robots facing embodiment-specific disturbances in unstructured environments. The agent records disturbance effects, queries a VLM with visual context for causes, models local anomalies with kernel regression, and validates four hypotheses in simulation and hardware across embodiments and adversity modes.

#Robotics#Reasoning#Vision#Research release

why featured

HKR-H and HKR-K pass: the paper offers experience-driven reasoning with VLM attribution and kernel-regression anomaly modeling, tested in simulation and hardware. Its academic robotics focus lacks broad practitioner resonance, so it stays in all.

editor take

Don't Fool Me Twice validates 4 hypotheses in sim and hardware; I buy online attribution, but baselines and failure rates are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→How Can Embedding Models Bind Concepts?

The paper analyzes why CLIP fails at concept binding: scene embeddings decompose additively into object representations, but CLIP’s binding function remains high-complexity; controlled Transformers trained from scratch learn multiplicative interactions and generalize when data coverage is sufficient.

#Embedding#Multimodal#Vision#CLIP

why featured

HKR-H and HKR-K pass: the paper gives a concrete mechanism for CLIP binding failures. It remains research-heavy with limited product or competitive impact, so it fits the 60–71 band.

editor take

CLIP decomposes scene embeddings additively, yet binding stays high-complexity; I buy this diagnosis over another retrieval leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Fixed-Point Masked Generative Modeling

CoFRe replaces part of the denoiser with a fixed-point solver and cuts OpenWebText parameters by 38.8%, training time by 11.5%, and VRAM by 16.9%, while improving generative perplexity from 830.8 to 101.8 under 96 transformer-block forward passes versus MDLM.

#Inference-opt#Multimodal#Fine-tuning#arXiv

why featured

HKR-H/K/R pass via a novel mechanism, concrete efficiency numbers, and cost resonance. Still, this is a specialist arXiv architecture paper with evidence limited to OpenWebText/CoFRe, so it stays in the 60–71 band.

editor take

CoFRe cuts OpenWebText params 38.8% and hits 101.8 PPL; masked LMs finally get a credible compute story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→On the “Induction Bias” in Sequence Models

The paper compares transformers and RNNs on state-tracking data efficiency, finding that transformers require training data that grows faster with state-space size and sequence length, while cross-length weight sharing is negligible or harmful even when train and test distributions match.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass, but the post gives conclusions without experiment scale, datasets, or reproduction details. It is core ML research signal, fit for all but below the featured threshold.

editor take

Transformers lose to RNNs even in-distribution on state tracking; no multiplier disclosed, but failed length weight sharing cuts deep.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→The Information Geometry of Softmax: Probing and Steering

arXiv:2602.15293v2 introduces dual steering, a linear-probe method for steering representations toward a target concept, and proves it minimizes changes to off-target concepts while empirically improving controllability and stability.

#Interpretability#Alignment#Research release

why featured

HKR-K/R pass: dual steering is a testable steering mechanism tied to model control. HKR-H misses, and the single arXiv post gives no experiment scale, model list, or product path, so it stays in all.

editor take

arXiv:2602.15293v2 proves dual steering; I’d check replication before treating linear probes as control knobs again.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Quantifying the Uncertainty of Foundation Models with Singular Value Ensembles

The paper proposes Singular Value Ensemble, freezing singular vectors and training only per-member singular values, keeping the base model’s parameter increase below 1% while improving calibration on NLP and vision tasks without reducing predictive accuracy.

#Benchmarking#Vision#Research release

why featured

HKR-K and HKR-R pass: an under-1% parameter-overhead ensemble method is concrete and relevant to reliability. As a single arXiv paper with a technical title, it stays below featured.

editor take

SVE adds <1% parameters for calibration; I like the engineering, if singular vectors really hold as “knowledge directions.”

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

REAL uses a generalized policy gradient to optimize regression rewards for LLM-as-a-Judge, and on Qwen3-32B it improves over the SFT baseline by +8.40 Pearson and +7.20 Spearman.

#Reasoning#Fine-tuning#Benchmarking#Qwen

why featured

HKR-K and HKR-R pass: the post gives a training mechanism and Qwen3-32B metric gains, and it hits eval trust. It remains a narrow arXiv method paper without tooling or production proof, so it sits in 60–71.

editor take

REAL beats SFT on Qwen3-32B by +8.40 Pearson; binary RL rewards are a bad fit for 5-point judge scoring.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings

The paper analyzes binary quantization for contrastive embeddings with a Gaussian model. Experiments cover 18 datasets and 9 embedding families; off-diagonal covariance contributes 30–50% of the signal, while coordinate heterogeneity governs the value of extra bits and whether random rotation helps or hurts.

#Embedding#Inference-opt#Benchmarking#arXiv

why featured

HKR-K is solid with 18 datasets, 9 embedding types, and a 30–50% signal claim. HKR-R fits embedding cost/quality tradeoffs, but HKR-H fails and the mechanism is specialized, so it stays all.

editor take

The paper tests 18 datasets and 9 embedding families; 30–50% of signal sits off-diagonal, so stop blindly rotating BQ embeddings.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→How Does Bayesian Sampling Help Membership Inference Attacks?

The paper proposes Bayesian Membership Inference Attack, which uses Laplace approximation on a single reference model to estimate a posterior over parameters; experiments span image, text, and tabular datasets, and the authors report state-of-the-art effectiveness and efficiency.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-K/R pass: the paper adds a single-reference-model MIA with Laplace posterior sampling across image, text, and tabular tests. HKR-H fails because the angle is a specialist methods paper, so it stays in 60–71.

editor take

BMIA uses one reference model plus Laplace posterior sampling; multi-reference MIA just got cheaper, so average privacy risk reports look weaker.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Re-examining Low-Rank Adaptation for Private LLM Fine-Tuning

The paper proposes restoring the fast singular-value decay of gradients during DP-SGD private fine-tuning, and evaluates it on GLUE, E2E, and DART with RoBERTa, Qwen, and Llama models up to 4B parameters while keeping the same privacy guarantees.

#Fine-tuning#Safety#Inference-opt#RoBERTa

why featured

HKR-K is clear via DP-SGD private tuning, singular-value decay, and tests up to 4B parameters. HKR-R comes from privacy and sample efficiency, but HKR-H is weak, so this stays in the 60–71 band.

editor take

The paper restores gradient singular-value decay in DP-SGD; I buy it, since DP-LoRA controls rank but ignores spectral damage from noise.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

The paper introduces CoSee to audit read-write-verify loops in document VQA with 4B–8B weak learners, and finds that without explicit verification, shared workspaces can amplify hallucinations and make extra compute correlate negatively with accuracy.

#Agent#Vision#Benchmarking#CoSee

why featured

HKR-H/K/R pass, but this is a single arXiv paper with only setup and headline finding disclosed; benchmark size, effect numbers, and artifacts are missing, so it stays in the 60–71 band.

editor take

CoSee tests 4B–8B document VQA: without explicit verification, shared workspaces amplify hallucinations; small-agent teams shouldn’t add rounds first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

DetAS-X models object detection as a dynamic decision process, uses an MLLM to select restoration modules and specialized detectors, and reports a 28.36% average F1 gain across six benchmarks, with a 37.01% gain on DarkFace.

#Agent#Multimodal#Vision#Research release

why featured

HKR-H and HKR-K pass: DetAS-X has a clear agentic routing mechanism and six-benchmark gains. It remains a single arXiv vision paper with limited industry spread or HKR-R resonance, so it stays in all.

editor take

DetAS-X lifts F1 by 28.36% across six benchmarks; I’d scrutinize toolbox cost, since inference latency is undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

Auto-Discovery-Bench tests agents on three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery; across models, performance declines as variable count, trajectory length, and distractors increase.

#Agent#Reasoning#Benchmarking#Auto-Discovery-Bench

why featured

HKR-K/R pass: it gives reproducible stress factors for agent state tracking. Single arXiv paper, with no model list, scores, or code disclosed in the summary, so it stays in the 60–71 band.

editor take

Auto-Discovery-Bench tests 3 discovery tasks; I buy the split: skip science-agent hype until long-range state tracking holds.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Is the Last Layer Sufficient for Uncertainty Quantification?

The paper compares full-network and last-layer linearized GLMs for epistemic uncertainty quantification, using random matrix theory and large-scale empirical evaluation; it finds no meaningful UQ gain from full linearization, while the last-layer approximation delivers comparable performance with lower computational cost.

#Safety#Benchmarking#Research release

why featured

HKR-H/K/R pass: the paper claims last-layer linearized GLMs can approximate full-network UQ with lower compute. It remains a single arXiv research item with high theory overhead and no product or open-source artifact, so it stays in 60-71.

editor take

arXiv 2605.30741 finds no UQ gain from full linearization; last-layer GLMs deserve baseline status until tasks are disclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

The paper decomposes an LLM policy into internal layer and modular policies via the Transformer residual stream, reports progressive reasoning in Qwen versus abrupt convergence in Llama, and proposes BuPO to optimize internal layers during early RL stages on complex reasoning benchmarks.

#Reasoning#Fine-tuning#Interpretability#Qwen

why featured

HKR-H and HKR-K pass: the title has a counterintuitive hook, and the summary gives a residual-stream decomposition plus BuPO. No experiment numbers or code are disclosed, and HKR-R is weak, so this stays in the 60–71 research-signal band.

editor take

BuPO claims gains on complex reasoning, but scores aren’t disclosed; if layer-level RL holds, Qwen/Llama divergence is the sharp part.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer uses one feed-forward transformer to combine camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video; across multiple model sizes and orders of magnitude in data, the paper reports clean power-law scaling and zero-shot open-set results competitive with supervised methods.

#Vision#Multimodal#Benchmarking#RayDer

why featured

HKR-H/K pass: the mechanism is concrete and the scaling-law claim is testable. HKR-R is weak because RayDer is still a niche NVS research paper without product implications or major-lab pull, so it stays in 60–71.

editor take

RayDer folds 3 NVS modules into one transformer; if its power laws reproduce, video self-supervision gets a scalable shape.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Representation Collapse in Sequential Post-Training of Large Language Models

The paper defines a measurement suite for hidden states, logits, token trajectories, and LoRA updates. It analyzes five post-training settings: supervised fine-tuning, preference optimization, safety/refusal tuning, math and code specialization, and long chain-of-thought tuning under controlled stage orderings.

#Fine-tuning#Alignment#Interpretability#Research release

why featured

HKR-H and HKR-K pass: the collapse angle is clickable, and the post gives a concrete measurement suite across 5 stages. It remains a single arXiv methods paper with no disclosed model list, experiment scale, or production impact.

editor take

The paper tests collapse across 5 post-training regimes; I buy the setup, but RSS omits models, scale, and effect sizes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

The paper evaluates test-time compute across seven VLMs and six benchmarks, testing feature scoring and majority voting. It proposes ETTC, an entropy-based selector that beats majority voting and the best single model in ensembles.

#Vision#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv benchmark paper with no disclosed code, source authority, or cross-source pickup. The 7-model/6-benchmark ETTC result is useful, not same-day featured.

editor take

Seven VLMs, six benchmarks: single-model voting barely helps; ETTC’s entropy selector beats brute-force sampling as the cleaner TTC bet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

The paper proposes self-captioning multimodal interaction tuning, using a Multimodal Interaction Gate to convert unique interactions into redundant ones, reducing visually induced errors by 38.3% and improving consistency by 16.8% under ambiguous or corrupted modalities.

#Multimodal#Vision#Alignment#Research release

why featured

HKR-K and HKR-R pass: the paper offers a concrete mechanism plus 38.3%/16.8% results, tied to VLM robustness. Single arXiv paper, jargon-heavy title, and no disclosed artifact or deployment keep it in 60–71.

editor take

This paper reports 38.3% fewer visually induced errors via redundancy amplification; I buy the angle, robustness beats purity-of-grounding dogma.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Speculative Decoding Across Languages

The paper compares three draft-model strategies for speculative decoding across 11 languages. Task-specific distillation improves translation efficiency but generalizes poorly to story generation; n-gram draft models have lower acceptance rates yet deliver large speed-ups because draft generation is much faster.

#Inference-opt#Fine-tuning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers concrete experimental axes and inference-speed findings. HKR-H is weak, and speculative decoding remains specialized, so it stays in the lower all band.

editor take

The paper tests spec decoding on 11 languages; I’d bet on n-grams here: lower acceptance, faster drafts, less fine-tune debt.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

The paper proposes OSRM to constrain LoRA subspaces before fine-tuning and evaluates model merging on 8 datasets, 3 widely used LMs, and 2 large LMs.

#Fine-tuning#OSRM#LoRA#Research release

why featured

HKR-K/R pass: OSRM gives a testable mechanism and concrete evaluation scope, tied to LoRA merge pain. Single arXiv paper and narrow title keep it below featured.

editor take

OSRM tests LoRA merging on 8 datasets and 5 LMs; pre-constraining subspaces is practical, but gains are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

The paper proposes a goal-directedness evaluation framework for LLM agents. In a 2D grid-world case study, it compares behavior with optimal policies across grid sizes, obstacle densities, and goal structures, then uses probes to decode coarse spatial maps and multi-step action plans from internal representations.

#Agent#Interpretability#Reasoning#Research release

why featured

HKR-H and HKR-K pass: testing whether agents are genuinely goal-directed is a clean hook, with a 2D gridworld and probe findings. No major lab, tool release, or production validation, so it stays below featured.

editor take

The evidence is a 2D grid world with probes decoding coarse maps and plans; don’t sell it as general agent-goal measurement.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

MASPOB optimizes prompts for multi-agent systems using UCB bandits, GNN topology representations, and coordinate ascent. The paper says it reduces search complexity from exponential to linear and outperforms existing baselines across multiple benchmarks, but the RSS snippet does not disclose benchmark names or exact scores.

#Agent#Tools#Benchmarking#MASPOB

why featured

HKR-K and HKR-R pass: the mechanism and complexity claim are concrete, and agent prompt tuning is a real pain. Single arXiv paper with no code, named lab, or discussion keeps it in all.

editor take

MASPOB claims exponential-to-linear MAS prompt search, but names no benchmarks or scores; I’d file it as promising plumbing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Efficient Benchmarking Is Just Feature Selection and Multiple Regression

The arXiv paper reframes efficient LLM benchmarking as feature selection plus multiple regression, then uses kernel ridge regression for score prediction and mRMR for question subset selection; outside very data-poor settings, the method reports lower MAE and RMSE plus higher Spearman ρ and Kendall τ than existing efficient benchmarking approaches.

#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the title has a contrarian framing and the summary gives mRMR/KRR under a stated data condition. It is eval-method research with no concrete gains disclosed, so it stays in the 60–71 band.

editor take

KRR+mRMR beats prior efficient benchmarking methods; honestly, this reads like statistics catching up with LLM eval folklore.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence

The paper introduces BaLoRA, which projects LoRA iterates onto a balanced manifold to preserve the adapted matrix and improve conditioning; the abstract says it converges faster than standard LoRA across fine-tuning tasks, but the snippet does not disclose exact speed gains.

#Fine-tuning#Research release

why featured

HKR-K passes on the balanced-manifold projection mechanism, and HKR-R passes for fine-tuning cost pressure. The post gives no concrete speedup numbers, so this stays in the 60–71 band.

editor take

BaLoRA projects LoRA onto a balanced manifold; no speed numbers disclosed, so I’d file it as a plug-in training trick.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Differentially Private Preference Data Synthesis for Large Language Model Alignment

The paper introduces DPPrefSyn, an algorithm that uses a Bradley-Terry preference model, public prompts, and DP-PCA to synthesize differentially private preference data for LLM alignment; the code is available on GitHub.

#Alignment#Safety#Fine-tuning#DPPrefSyn

why featured

HKR-H/K/R all pass because DPPrefSyn gives a concrete DP preference-synthesis mechanism and code. Single arXiv source, with no experiment numbers, data scale, or production replacement claim, keeps it in the all band.

editor take

DPPrefSyn uses BT modeling and DP-PCA for preference synthesis; ε, baselines, and model scale are absent, so “strong DP” is not deployment evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→dgMARK: Decoding-Guided Watermarking for Diffusion Language Models

The paper proposes dgMARK, a decoding-guided watermarking method for discrete diffusion language models that steers unmasking order with a binary-hash parity constraint and uses sliding-window detection for insertion, deletion, substitution, and paraphrasing edits.

#Safety#Inference-opt#Research release

why featured

Single arXiv paper with a concrete dLLM watermarking mechanism, but no disclosed metrics, artifact, or deployment path; HKR-K/R pass, HKR-H is weak, so it stays all.

editor take

dgMARK watermarks dLLMs by steering unmasking order; I buy the channel, but false positives and attack cost are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Effective Reasoning Chains Reduce Intrinsic Dimensionality

The paper validates on GSM8K with Gemma-3 1B and 4B that effective CoT strategies reduce task intrinsic dimensionality, which shows a strong inverse correlation with both in-distribution and out-of-distribution generalization performance.

#Reasoning#Interpretability#Benchmarking#Gemma

why featured

HKR-H/K/R pass, but evidence is limited to GSM8K with Gemma-3 1B/4B and the intrinsic-dimensionality framing is research-heavy. Useful paper, not a same-day industry item.

editor take

Gemma-3 1B/4B on GSM8K shows CoT lowers intrinsic dimensionality; I buy the metric, not the scope.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→De-attribute to Forget for LLM Unlearning

The paper proposes DareU, an LLM unlearning framework that uses reinforcement learning to reduce attribution scores from generated responses to forget-data owners. Its evaluation uses an LLM classifier as an attribution proxy, reports better balance between forget quality and model utility than baselines, and does not disclose dataset size in the RSS snippet.

#Alignment#Safety#Fine-tuning#Research release

why featured

HKR-H/K/R pass: the paper reframes unlearning via attribution and gives a concrete RL mechanism. Single arXiv release, no disclosed dataset scale or deployment result, so it stays in 60–71.

editor take

DareU lowers attribution via RL; dataset size is undisclosed, and I don’t buy an LLM classifier as the attribution proxy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks

The paper proposes LGC and LGC-H for decision-based black-box adversarial attacks, using curvature-aware geometric search and a Residual-based Adversarial Generation mechanism to reach SSIM above 0.99 and LPIPS below 0.01 at 5,000 queries.

#Vision#Safety#Benchmarking#Research release

why featured

HKR-H/K/R pass via the 5,000-query imperceptible-attack claim, concrete metrics, and security relevance. It stays in 60–71 because this is a specialized adversarial-attack paper with no disclosed code or wider industry uptake.

editor take

LGC hits SSIM>0.99 and LPIPS<0.01 at 5,000 queries; I care most about reproducible robust-model breakage.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

The paper proposes Relay, a differentiable per-token channel for MDMs trained with truncated BPTT, and scales it to Fast-dLLM v2, where coding-task experiments reduce inference latency by up to 32% versus the reported baselines.

#Inference-opt#Code#Fast-dLLM v2#Research release

why featured

HKR-K lands via Relay, truncated BPTT, and a 32% latency claim; HKR-R is cost-driven. The niche discrete-diffusion angle and jargon-heavy title keep it in the 60–71 band.

editor take

Relay cuts Fast-dLLM v2 coding latency by up to 32%; discrete diffusion needs memory before it can threaten autoregression.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Native Hierarchical and Compositional Representations with Subspace Embeddings

The paper proposes representing concepts as linear subspaces instead of vectors, trains them with differentiable soft projection matrices, and reports state-of-the-art results on hierarchical and natural language inference benchmarks while preserving compatibility with efficient Euclidean vector search.

#Embedding#Reasoning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the representation mechanism is novel and benchmark-testable. But this is an arXiv representation-learning paper with no disclosed scores, dataset details, or product impact, so it stays in the lower band.

editor take

Subspace Embeddings learns concept dimensions via soft projections; SOTA tables aren’t disclosed, but the negation result is the sharper claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

Go-UT-Bench provides 5,264 code and unit-test pairs from 10 permissively licensed Go repositories for fine-tuning LLMs on unit test generation; the fine-tuned models outperform their base versions on more than 75% of benchmark tasks.

#Code#Fine-tuning#Benchmarking#Go-UT-Bench

why featured

HKR-K/R pass via dataset size and fine-tuning results, and the topic matters to AI coding workflows. HKR-H is weak; this is a narrow Go unit-test benchmark, so it stays in the 60–71 band.

editor take

Go-UT-Bench has 5,264 Go test pairs; 10 repos is thin, so don't extrapolate 75% wins to real CI yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

arXiv:2605.30553v1 presents diffusion models through a destroy-then-generate view, classifying them as training methods that withhold input information and predict it, with discussion of data-scarce settings and conditions for porting reinforcement learning techniques into diffusion contexts.

#Reasoning#arXiv#Research release#Commentary

why featured

HKR-H and HKR-K pass: the title has a sharp thesis and the summary gives a destroy-then-generate mechanism. As a single arXiv perspective paper with no disclosed benchmark, experiment number, or production result, it stays in the mid-interest band.

editor take

The paper offers destroy-then-generate, with no empirical numbers; I don’t buy the exploration claim, but data-scarce training is testable.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Can Subgraph Explanations Be Weaponized to Steal Graph Neural Networks?

The paper presents the first strict black-box model extraction attack for graph classification, where the attacker observes only discrete class labels and binary explanation masks, then uses Monte Carlo edge-sensitivity estimation and explanation subgraphs to narrow the decision-boundary search space.

#Interpretability#Safety#Benchmarking#LabRAI

why featured

HKR-H/K/R pass: the weaponized-explanation angle is clicky, and the attack conditions are concrete. Kept in all because it is one arXiv paper, no success rates, datasets, or code details in the snippet, and GNN graph classification is niche.

editor take

XSTEAL extracts GNNs using only class labels and binary explanation masks; don't ship explainability APIs blindly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Scaling Multi-Agent Environment Co-Design with Diffusion Models

DiCoDe uses Projected Universal Guidance and critic distillation for multi-agent environment co-design; on the warehouse benchmark, it reports 39% higher rewards with 66% fewer simulation samples than the prior state of the art.

#Agent#Robotics#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper gives mechanisms and two concrete metrics, and it hits multi-agent training cost. Single arXiv paper with a narrow warehouse-simulation scope keeps it in all.

editor take

DiCoDe reports 39% higher warehouse reward with 66% fewer samples; I want the PUG constraints tested on real robots.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Rays as Pixels: Learning a Joint Distribution of Videos and Camera Trajectories

Rays as Pixels uses one Video Diffusion Model to learn a joint distribution over videos and camera trajectories, representing cameras as dense ray pixels in the same latent space as frames. The single trained model handles 3 tasks: pose prediction from video, trajectory-conditioned video generation from images, and joint synthesis of video and trajectory from images.

#Vision#Multimodal#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the paper has a concrete “rays as pixels” modeling hook and states one video diffusion model handles 3 trajectory/video tasks. No metrics, open-source artifact, or product path are disclosed, so it stays mid-band.

editor take

Rays as Pixels folds 3 camera-video tasks into one VDM; I buy raxels if closed-loop consistency beats pose-only score chasing.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→VeriGate: Verifier-Gated Step-Level Supervision for GRPO

VeriGate trains 1.5B and 7B Qwen2.5-Instruct models on MATH and improves average accuracy by about 20% and 12% across six reasoning benchmarks, using verifier-gated step-level rewards only when GRPO verifier rewards are degenerate.

#Reasoning#Alignment#Benchmarking#Aakriti Agrawal

why featured

HKR-K/R pass: it has a concrete training mechanism and six-benchmark gain claims, with relevance to open reasoning post-training. HKR-H is weak, and this is a single arXiv paper without code or production evidence, so it stays in 60–71.

editor take

VeriGate lifts Qwen2.5-Instruct by 20%/12% across 6 reasoning benchmarks; GRPO’s zero-gradient failure gets a cleaner patch than blunt PRM reward hacking.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Quantifying Error Propagation and Model Collapse in Diffusion Models

The paper analyzes distribution drift in recursively trained score-based diffusion models, assuming each round mixes synthetic data with fresh target-distribution samples, and derives upper and lower bounds on accumulated divergence between generated and target distributions, with regimes determined by score estimation error and the fresh-data proportion.

#Fine-tuning#Benchmarking#arXiv#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv theory paper with bounds, not code, scale, or product impact. The technical-accessibility drag keeps it in all, below featured.

editor take

2602.16601 bounds drift in recursive diffusion training; I buy the fresh-data-ratio knob more than another scary collapse plot.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Autoregressive Visual Generation Needs a Prologue

Prologue prepends learned tokens to autoregressive image sequences and trains them only with AR cross-entropy, while visual tokens keep reconstruction duties. On ImageNet 256×256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance, and 16 prologue tokens reach 35.88% Top-1 in linear probing versus 23.71% for the first 16 standard tokenizer tokens.

#Vision#Benchmarking#ImageNet#Research release

why featured

HKR-K is strong and HKR-H has a clean hook, but HKR-R is narrow. Without a major lab, open-source artifact, or production-pipeline claim, this stays in the interesting research band.

editor take

Prologue-Base cuts ImageNet gFID to 10.75; I buy the split—stop forcing one token stream to serve reconstruction and generation.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→The Fundamental Limits of Fraud Detection in Card Payment Networks

The paper formalizes card authorization as a sequential decision problem and derives a minimax regret lower bound where delayed, censored, corrupted, and counterfactually missing feedback reduce the achievable learning rate through a multiplicative denominator.

#Reasoning#Benchmarking#Research release

why featured

HKR-K/R pass: the paper adds a concrete sequential-decision framing and multiplicative feedback-limit claim. Niche payments-risk scope and no product/model impact keep it in the 60–71 band.

editor take

The paper gives a minimax regret bound: delay, censoring, corruption, missing counterfactuals multiply the learning drag; bigger models won’t fix issuer feedback.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Survival Reinforcement Learning: Toward Scalable Self-Supervised RL

The paper introduces Survival Reinforcement Learning, an online classification alternative that maximizes an agent’s dwell time at target goals and outperforms CRL by 2x to 8x on stable long-horizon locomotion tasks.

#Agent#Robotics#Reasoning#Research release

why featured

HKR-H and HKR-K pass: the paper has a clear reframing and a 2–8x experimental claim. HKR-R is weak, and this is a niche arXiv RL methods paper, so it stays in the 60–71 band.

editor take

SRL beats CRL by 2–8x on long-horizon locomotion; I’m not buying it until benchmarks and code land.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Advancing Creative Physical Intelligence in Large Multimodal Models

The paper introduces MM-CreativityBench to test affordance-grounded creative tool use in LMMs, using scenario images plus candidate entity and part views, and reports that Direct Preference Optimization improves correct entity and part selection while reducing visual hallucination errors.

#Multimodal#Vision#Alignment#Research release

why featured

HKR-K and HKR-R pass: the paper offers a new benchmark, concrete eval mechanisms, and DPO for hallucination reduction. As a single arXiv research item without visible open-source uptake or cross-source traction, it sits in 60–71.

editor take

MM-CreativityBench tests creative tool use in LMMs; scale is undisclosed. DPO reduces hallucination, but benchmark gains aren't physical intelligence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→PRISM: Preference-Aware Influence Function Based Data Selection for Fine-Tuning

PRISM weights target examples with model preferences and selects training samples by their influence on that preference-aware direction for efficient fine-tuning; the abstract says experiments cover diverse architectures and parameter scales, but the post does not disclose the specific models, datasets, metrics, or scores.

#Fine-tuning#Alignment#Safety#Research release

why featured

HKR-K and HKR-R pass: the mechanism is clear and the problem maps to fine-tuning cost. HKR-H fails, and the post lacks model names, datasets, and scores, so it stays in the lower research band.

editor take

PRISM uses preference-weighted influence functions for fine-tuning data selection; only the abstract is disclosed, with no models, datasets, or scores.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

idSCD uses semantic correlation descriptors for white-box dataset-level membership inference, comparing against RMIA, Attack-P, LiRA, and SIF across three task settings; the paper reports perfect separation in a controlled leave-one-dataset-out diagnostic and a largest relative ROC-AUC gain above 60% when dataset groups show distinct semantic particularities.

#Safety#Interpretability#Benchmarking#Andrada Gobeaja

why featured

HKR-K and HKR-R are clear, but this is a single arXiv paper without visible industry uptake. The method is niche, so it fits the 60–71 research-release band.

editor take

idSCD beats 4 baselines across 3 tasks; white-box membership inference gets sharper, but weak semantic separation limits the trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Munawar Hasan defines bounded behavioral indistinguishability as an (ε,q,t,A) condition over a prompt distribution, then tests Qwen and Llama teacher-student pairs on 5,000 behavioral probes; LoRA raises semantic similarity to 0.862 for Qwen and 0.874 for Llama, but learned discriminators still retain nonzero distinguishing advantage.

#Fine-tuning#Benchmarking#Alignment#Munawar Hasan

why featured

HKR-K/R pass: the paper offers a new metric, a 5,000-prompt test, and a LoRA finding tied to distillation mimicry. HKR-H is weak, and a single technical arXiv paper stays below featured.

editor take

LoRA lifts Llama similarity to 0.874, yet discriminators still separate it; semantic-score-only distillation eval is too lax.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Conformal Reliability: A New Evaluation Metric for Conditional Generation

The paper proposes reliability score, a conformal-prediction metric for conditional generation that measures worst-case performance within a prediction set at a preset confidence level. It also introduces CReL to construct covered prediction sets and optimize the score, with experiments on synthetic data, image-to-text, and text-to-image tasks.

#Benchmarking#Multimodal#arXiv#Research release

why featured

HKR-K and HKR-R pass: the paper offers a conformal-prediction reliability metric plus code for generation evaluation. HKR-H is weak, and the method-paper angle keeps it below featured.

editor take

CReL scores worst-case generation at preset confidence; I like the move, single-output metrics deserve pressure from risk-set audits.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

BOKBO adds a conformal abstention layer to K-sample VLA inference and gives finite-sample distribution-free guarantees on executed-violation rate. On libero_object_temp_x0.1 with OpenVLA-OFT at ε=0.05, its learned violation predictor reaches 78% coverage and 70% net task success, while Mondrian-BOKBO raises the minimum per-task conditional hold fraction from 0.71 to 0.93.

#Robotics#Vision#Safety#BOKBO

why featured

HKR-H and HKR-K pass: the abstention-over-bad-options angle is fresh, and the paper gives ε=0.05 plus 0.71→0.93 retention. HKR-R is narrow because VLA robot safety is specialist-facing, so it stays below featured.

editor take

BOKBO lifts per-task hold fraction from 0.71 to 0.93 at ε=0.05; stop trusting internal confidence for VLA safety.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→CSULoRA: Closest Safe Update Low-Rank Adaptation

CSULoRA estimates a safety-aligned subspace from weight displacement between aligned and base checkpoints. It corrects trained LoRA adapters with a closed-form penalized minimum-change update. Adversarial fine-tuning tests report lower attack success rate while preserving most LoRA utility gains, but the snippet does not disclose exact numbers.

#Fine-tuning#Alignment#Safety#CSULoRA

why featured

HKR-K/R pass: the mechanism is concrete and relevant to LoRA safety after fine-tuning. HKR-H is weak, and the post withholds attack-success-rate numbers, so this stays an interesting research release, not featured.

editor take

CSULoRA post-corrects trained LoRA via weight displacement, but ASR numbers are missing; neat closed-form fix, pending subspace validation.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Eigenvectors of Experts Are Training-Free Non-Collapsing Routers

The paper proposes SSMoE, a training-free routing framework that uses SVD-derived spectral features from expert weight matrices, and evaluates expert collapse across language tasks, vision tasks, clean data, and corrupted data; the abstract reports public code but does not disclose model names, dataset counts, or numeric gains.

#Inference-opt#Interpretability#SSMoE#Research release

why featured

HKR-H/K pass: SSMoE offers an SVD-based training-free router and collapse tests. This remains a technical arXiv paper; code, scale numbers, and production impact are not disclosed, so it stays in 60–71.

editor take

SSMoE routes via expert-weight SVD with zero training; the abstract omits models and gains, so treat “non-collapsing” as unverified.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→MAAT: Multi-phase Adapter-Aware Targeted Unlearning

The paper introduces 5WBENCH and MAAT. 5WBENCH has 5,000 samples, with 1,000 per 5W category, while MAAT applies a three-phase LoRA-adapter procedure to target Why-type causal unlearning failures.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the item gives 5WBENCH size and a 3-phase LoRA mechanism, tied to unlearning/compliance. As an arXiv method paper without disclosed metrics or strong source authority, it stays in the normal research-signal band.

editor take

5WBENCH gives Why 1,000 cases; I buy the angle—0.06% causal coverage let unlearning scores hide failures.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack

The paper introduces OU@epsilon and the Prototypical Relearning Attack for class-level machine unlearning, then proposes Spotter, a plug-and-play objective tested on CIFAR, TinyImageNet, and CASIA-WebFace to reduce over-unlearning and block prototype-based relearning.

#Safety#Alignment#Benchmarking#arXiv

why featured

HKR-H and HKR-K pass: the paper introduces a metric, an attack, and a mitigation tested on CIFAR, TinyImageNet, and CASIA-WebFace. No major lab or product impact is disclosed, so it stays in the 60–71 research-signal band.

editor take

Spotter reports 3 datasets; if few samples restore a forgotten class, forget accuracy is a weak deletion receipt.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Rationalize: Shared Semantic Reasoning for Human-AI Alignment

Rationalize proposes four human-AI role pairs for data-driven sensemaking, making purposes, questions, assumptions, evidence, inferences, and implications explicit to support bidirectional alignment between humans and AI systems.

#Reasoning#Alignment#Rationalize#Research release

why featured

HKR-K comes from the 4 role-pair mechanism; HKR-R comes from the safety boundary in human-AI collaboration. HKR-H is weak, and no results, artifact, or production claim are disclosed.

editor take

Rationalize defines 4 human-AI role pairs; no experiments disclosed, so read it as interaction design, not model progress.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study

Vincent Granville proposes an RBF-style alternative architecture for LLMs in a 9-page arXiv paper, claiming it finds the global optimum of the loss function in closed form in one iteration; the post does not disclose reproducible experimental details beyond a high-level case study and comparison.

#Reasoning#Interpretability#Vincent Granville#arXiv

why featured

HKR-H and HKR-K pass: the title attacks the DNN premise and offers an RBF/closed-form claim. As a lone arXiv paper with no disclosed benchmarks or replication details, it stays in the 60–71 band.

editor take

Vincent Granville claims closed-form one-pass LLM training in 9 pages; no code or benchmarks, so I’m filing this as RBF repackaging.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→HetCCL: Enabling Collective Communication for Mixed-Vendor Heterogeneous Clusters

HetCCL uses heterogeneous P2P transport and a border-communicator mechanism for collective communication in mixed-vendor clusters; across 4 heterogeneous settings, it delivers 17-19x higher bandwidth than Gloo and reduces end-to-end LLM training per-step time by up to 16.9%.

#Inference-opt#HetCCL#Gloo#OpenMPI

why featured

HKR-K and HKR-R pass: the paper has concrete mechanisms and numbers, and mixed-vendor training clusters touch cost. The low-level collective-communication focus keeps it below featured.

editor take

HetCCL shows 17-19x Gloo bandwidth across 4 mixed-vendor setups; 16.9% step-time gain is modest, but the baseline matters.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

TRINE runs single-bitstream multimodal inference on Alveo U50 and ZCU104, reducing latency by up to 22.57x versus RTX 4090 at 20–21 W, while int8 quantization keeps accuracy drops below 2.5% across representative tasks.

#Multimodal#Inference-opt#Vision#TRINE

why featured

HKR-H and HKR-K pass via the 22.57x latency and <2.5% accuracy-loss claims. FPGA inference hardware is narrow for this audience, so it stays in the lower interesting band.

editor take

TRINE claims 22.57x lower latency than RTX 4090 at 20–21W; I want batch sizes, because FPGA papers love dunking on underfed GPUs.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→DisasterLex: An Expert Concept-to-Schema Knowledge Graph for Geospatial Reasoning in Disaster Analytics

DisasterLex links user queries to disaster databases through an expert knowledge graph with 107 concepts, 117 causal edges, and 52 concept-to-schema links, and on a 75-query test set over 36 geospatial tables it outperforms four baselines by 1.4x to 2.75x across seven base models.

#RAG#Reasoning#Tools#DisasterLex

why featured

HKR-K passes because the paper gives concrete dataset, graph, and baseline-gain numbers. HKR-H/R are weak: the domain is vertical disaster geospatial analytics, so this belongs in all, not featured.

editor take

DisasterLex wins 1.4–2.75x with 107 concepts and 117 causal edges; 3.56/5 says expert graphs remain a hard patch for geo-SQL.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→A Kinetic Energy Perspective of Flow Matching

The paper introduces Kinetic Path Energy, a per-sample diagnostic that accumulates kinetic effort along an ODE trajectory; experiments report two correspondences with semantic fidelity and sparse representation regions, and Kinetic Trajectory Shaping uses a two-phase training-free inference strategy to reduce memorization.

#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes with KPE, KTS, and a testable training-free memorization claim. HKR-H/R are weak, and the flow-matching energy framing is specialist, so this stays in all.

editor take

KPE scores per-sample ODE trajectory energy; I buy the diagnostic, but KTS needs disclosed benchmark numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→dashi: A Python Library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment

dashi provides an open-source Python library for dataset shift analysis, using unsupervised information-geometry metrics and supervised performance-degradation checks across user-defined temporal or source batches, with demonstrations on 3 health AI case studies: gestational diabetes, COVID-19, and emergency medical dispatch.

#Tools#Safety#Benchmarking#dashi

why featured

HKR-K/R pass: dashi has a concrete tool shape and 3 health-AI examples for trustworthy deployment. HKR-H is weak, and this is not a model or major platform update, so it sits in 60-71.

editor take

dashi packages dataset-shift checks into Python and shows 3 health cases; I buy the tooling, not the “trustworthy AI” wrapper.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

The paper introduces Flow Equivariant World Modeling, which makes latent memory transform equivariantly with self-motion and inferred object motion. It evaluates the method on 2D and 3D partially observed video world-modeling benchmarks against diffusion, memory-augmented, and recurrent architectures, but the snippet does not disclose exact metric values.

#Memory#Vision#Benchmarking#Research release

why featured

HKR-K passes for a concrete memory mechanism and 2D/3D partially observed video benchmarks, but metrics are not disclosed. HKR-H/R are weak, so this stays in the normal research-release band at 64.

editor take

Flow Equivariant World Modeling compares 3 architecture classes, with no metrics disclosed; I buy the bet—memory must move with motion.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning

Polaris separates semantics and hierarchy with angular geometry and radius, then evaluates taxonomy expansion across trees, multi-parent DAGs, and multimodal hierarchies; against fourteen baselines, it improves top-K retrieval by up to about 19 points and reduces mean rank by up to about 60%.

#Embedding#Multimodal#RAG#Polaris

why featured

HKR-K passes because the paper gives a testable embedding mechanism and benchmark gains. HKR-H/R are weak: the hook is a niche method name, and the practical nerve is limited to retrieval, taxonomy, and multimodal hierarchy work.

editor take

Polaris gains up to 19 top-K points over 14 baselines; angle/radius separation looks worth testing for RAG taxonomies.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving

ProofWala provides a unified ITP interface for Lean 4 and Rocq, open-sources two repositories, and supports repository-scale extraction, parallel proof search, and multilingual training across theorem-proving datasets.

#Reasoning#Code#Tools#ProofWala

why featured

HKR-K passes: the post gives a unified ITP interface, 2 open-source repos, and parallel proof search. The theorem-proving toolchain is niche and technical, but not a hard-exclusion case, so it stays in the 60-71 band.

editor take

ProofWala bridges Lean 4 and Rocq; no lift numbers are disclosed, so treat it as proof-data plumbing, not reasoning progress.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

CHARM adds channel-level text descriptions to a channel-order-equivariant Transformer and trains semantic time-series embeddings with JEPA, evaluating the learned representations with only a linear probe across anomaly detection, classification, and short- and long-term forecasting.

#Multimodal#Embedding#Interpretability#CHARM

why featured

HKR-K passes because CHARM has a concrete mechanism and evaluation setup. HKR-H/R are weak: this is niche time-series representation learning, useful signal but below the featured bar.

editor take

CHARM trains JEPA time-series embeddings and tests four tasks with linear probes; I buy text as channel IDs, not sensor semantics.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs

ZO-Finetuner learns per-LLM perturbation strategies for zeroth-order fine-tuning, and experiments on 4 LLMs and 7 datasets show it beats prior zeroth-order baselines in 82.1% of task-model combinations.

#Fine-tuning#Inference-opt#ASTRAL-Group#Research release

why featured

HKR-K passes with concrete scale and an 82.1% win rate. HKR-H is weak and HKR-R is narrow; zeroth-order optimization remains specialist, so this stays mid-band all.

editor take

ZO-Finetuner wins 82.1% across 4 LLMs and 7 datasets; model-version drift is the obvious tax on its train-once story.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

DISCO introduces the SAM causal framework plus DISCO_m and sDISCO estimators, evaluates them against observed bias mitigation methods on six datasets, and releases source code on GitHub.

#Alignment#Benchmarking#DISCO#Research release

why featured

HKR-K/R pass: the paper offers a concrete mechanism, 6-dataset evaluation, and code, with fairness relevance. HKR-H is weak; this is an academic methods paper without a product or industry-event hook.

editor take

DISCO matches or beats bias baselines on 6 datasets; I want repo-level reproduction and multi-bias compute cost first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Softsign: Smooth Sign in Your Optimizer for Better Parameter Heterogeneity Handling

The paper proposes SoftSignum and SoftMuon, replacing hard sign updates with a temperature-controlled soft-sign transform and an adaptive quantile temperature schedule. Experiments across deep learning tasks, including LLM pretraining, report consistent gains over hard sign-based optimizers and AdamW, while the paper proves stochastic non-convex convergence through a geometry-relaxation framework.

#Inference-opt#Benchmarking#Research release#Benchmark

why featured

HKR-K has concrete mechanisms and HKR-R matters to LLM pretraining practitioners. The post does not disclose gains, scale, or reproducibility details, and the optimizer-paper angle stays niche, so it lands in all.

editor take

SoftSignum swaps hard sign for temperature soft-sign; LLM scale is undisclosed, so don’t bury AdamW yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Plain Transformers are Surprisingly Powerful Link Predictors

PENCIL uses an encoder-only plain Transformer with attention over sampled local subgraphs, and the paper reports stronger results than heuristic-informed GNNs across multiple benchmarks while releasing code publicly.

#Reasoning#Benchmarking#PENCIL#arXiv

why featured

HKR-H/K pass: the angle is a plain Transformer challenging GNN link predictors, and the post gives PENCIL’s mechanism plus code. No major lab or product impact; graph link prediction is niche, so this stays in all.

editor take

PENCIL uses a plain encoder Transformer on local subgraphs, but no scores are disclosed here; I’d reproduce before buying the GNN-beating claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

GEM reformulates LLM pre-training data curation as a variational problem on the hypersphere with a mixing-balance regularizer, and experiments on 1.1B-parameter models show up to 1.2% higher average downstream accuracy when integrated with DoReMi and RegMix.

#Fine-tuning#Benchmarking#GEM#DoReMi

why featured

HKR-K passes with a testable mechanism and +1.2% result; HKR-R passes on pretraining cost. HKR-H is weak, and the gain is small and specialized, so this stays at 63.

editor take

GEM reports up to +1.2% on 1.1B models; I’d want replication before buying geometry as the cure for data-mix noise.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

Researchers trained a one-layer, one-head encoder-decoder Transformer on the zeta map for Dyck paths and analyzed it with decoder cross-attention, linear probing, and causal intervention. The study extracts a level-based mechanism and converts it into a peak-centered scaffolding algorithm, then proves agreement with the zeta map up to a labeling reversal convention.

#Interpretability#Reasoning#Research release

why featured

HKR-H and HKR-K pass: the paper turns a tiny Transformer’s internals into a provable algorithm. The Dyck-path/zeta-map setup is niche and has no direct product or safety impact, so it stays in all.

editor take

A 1-layer 1-head Transformer learns Dyck zeta maps; I buy this—interpretability produced a provable algorithm, not vibes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

The paper models benchmark aggregation as a multitask principal-agent game and audits OLMES items across 3 item-level primitives: welfare alignment, marginal improvability, and performance variance. It uses WORKBank, EvoLM 4B, and PolyPythias 410M, identifies Pareto-inferior OLMES items under a pro-worker welfare operationalization, and releases code on GitHub.

#Benchmarking#Alignment#OLMES#WORKBank

why featured

HKR-K comes from a testable benchmark aggregation mechanism and code; HKR-R comes from evaluation trust. The academic framing and narrow impact keep it in the mid all band.

editor take

The paper audits OLMES with 3 item-level primitives; uniform averaging deserves scrutiny, but the pro-worker welfare choice carries the punchline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Breaking Information Cocoons: A Hyperbolic Framework for Balancing Exploration and Exploitation in Recommender Systems

HERec aligns textual semantics with collaborative signals in hyperbolic space and optimizes Dasgupta's cost for automatic hierarchy clustering, reporting up to 5.49% utility improvement and 11.39% diversity increase over Euclidean and hyperbolic recommender baselines.

#Embedding#Benchmarking#HERec#Research release

why featured

HKR-H/K pass: the hook is hyperbolic geometry against information cocoons, and the post gives HERec plus +5.49% utility/+11.39% diversity. HKR-R fails because the impact is narrow recommender research, so it stays all.

editor take

HERec reports up to 5.49% utility and 11.39% diversity gains; honestly, deployment hinges on controllable exploration, not hyperbolic elegance.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→What Does Preference Learning Recover from Pairwise Comparison Data?

The paper formalizes CPRD from triplet comparison data, gives conditions under which the Bradley-Terry model fits the distribution, and identifies margin and connectivity as two factors controlling sample efficiency.

#Alignment#Fine-tuning#Benchmarking#Research release

why featured

HKR-K passes because the paper adds BT-model conditions and a sample-efficiency mechanism. HKR-H/R are weak: the title is academic, and the feed gives no experiments, numbers, or deployment stakes.

editor take

CPRD formalizes triplet preferences; when BT assumptions fail, your learned reward scores may lack stable meaning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→ForecastCompass: Guiding Agentic Forecasting with Adaptive Factor Memory

ForecastCompass adds factor memory and reasoning memory for agentic forecasting, and experiments on Prophet Arena and FutureX with GPT-5-mini and Gemini-2.5-Flash report improved probabilistic accuracy and calibration.

#Agent#Memory#Reasoning#ForecastCompass

why featured

HKR-K passes: the paper offers a memory mechanism and two benchmark settings for agent forecasting. HKR-H and HKR-R are weak, and the post does not disclose gain size or artifacts, so it stays in the normal research band.

editor take

ForecastCompass reports gains on 2 benchmarks and 2 models, but no deltas; I’d scrutinize time leakage before buying it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Diving into Kronecker Adapters: Component Design Matters

The paper proposes CDKA, which tunes the dimensions and number of Kronecker components and adds parameter-budget-aware configuration guidelines; the abstract says experiments cover multiple architectures and modalities, but the post does not disclose specific metrics.

#Fine-tuning#Multimodal#Research release#Open source

why featured

HKR-K passes because CDKA offers a concrete adapter-configuration mechanism and budget guide. HKR-H/R are weak, and no experiment metrics are disclosed, so this stays in all.

editor take

CDKA tunes Kronecker component dimensions and counts; no metrics disclosed, so I’d treat it as LoRA-family tuning work.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Assessing Predictive Models for Fairness Based on Movement Patterns

The paper proposes assessing spatial fairness in predictive models using individual movement patterns, with multi-resolution spatial partitions and a spatial scan statistic, and evaluates the method on thousands of synthetic unfair datasets.

#Alignment#Benchmarking#Research release

why featured

HKR-K passes via a concrete spatial-fairness mechanism and synthetic test scale. HKR-H/R are weak because the title is dry and the movement-pattern setting is narrow; no hard-exclusion rule applies.

editor take

The paper tests movement-pattern fairness across thousands of synthetic datasets; without real mobility data, the claim stays methodological.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Learning to Reason with Insight for Informal Theorem Proving

The paper proposes DeepInsight, a three-part training framework for informal theorem proving that teaches LLMs to identify core proof techniques; the abstract says it outperforms baselines on mathematical benchmarks, but the post does not disclose exact scores.

#Reasoning#Fine-tuning#Benchmarking#DeepInsight

why featured

HKR-K passes because the article gives a three-part DeepInsight training mechanism. HKR-H and HKR-R are weak: no concrete benchmark numbers, product angle, safety issue, or competitive trigger.

editor take

DeepInsight trains proof-technique recognition with 3 components; scores are undisclosed, and “insight” needs reproducible rewards or it’s branding.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

The paper introduces LARK for reasoning distillation trajectory selection, using a learnability factor ρ to estimate the student model’s loss reduction rate and a χ²-regularized selection policy to balance learnability with distributional coverage.

#Reasoning#Fine-tuning#Tianrun Yu#Research release

why featured

HKR-K lands: LARK’s trajectory-selection mechanism is concrete. HKR-H is weak and HKR-R lacks benchmark gains or cost numbers, so this is useful research signal but not featured.

editor take

LARK scores trajectories by student loss-drop rate ρ; gains aren’t disclosed, so I buy the learnability angle pending replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection

AxonAD predicts future multi-head attention query vectors from past context and combines reconstruction error with query mismatch, improving ranking quality and temporal localization on TSB-AD’s 17 datasets and 180 series.

#Benchmarking#AxonAD#TSB-AD#Research release

why featured

HKR-K lands with a concrete AxonAD mechanism and TSB-AD coverage across 17 datasets and 180 sequences. HKR-H is only a research hook, and HKR-R misses broader practitioner nerves.

editor take

AxonAD improves ranking and localization on TSB-AD’s 17 datasets, 180 series; query drift is a cleaner anomaly signal than residuals.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Toward Identifiable Sparse Autoencoders

The paper introduces two iSAE variants for unstable TopK SAE training; the abstract reports lower reconstruction error and improved stability, but the RSS snippet does not disclose experiment scale or benchmark details.

#Interpretability#Research release

why featured

HKR-K passes: iSAE targets TopK SAE instability with a new mechanism and performance claim. HKR-H and HKR-R are weak, and experiment scale is not disclosed, so this stays as niche research signal.

editor take

iSAE claims lower TopK SAE error and stabler dictionaries; RSS gives no scale, so don’t equate identifiability with usability.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Assign and Add: A Mechanistic Study of Compositional Arithmetic

The paper trains small transformers on a controlled variable-assignment and modular-addition task, finds generalization to unseen variable-number combinations, and reports three learning phases: modular addition, variable-assignment structure, and refinement on hard unseen sequences.

#Reasoning#Interpretability#Research release

why featured

HKR-K and HKR-R pass: the paper gives a controlled setup, generalization condition, and 3 learning stages. It remains a narrow mechanistic-interpretability paper, with no production claim or frontier-model result, so it stays in 60–71.

editor take

Small transformers reuse one modular-addition MLP for direct and variable inputs; controlled tasks beat mystical LLM attribution here.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Research discovers randomized self-reductions to improve query efficiency

Bitween discovers randomized self-reductions for 64 of 80 functions on RSR-Bench, with Agentic Bitween using LLM agents to propose new query functions and raising the hit rate from the linear-regression backend’s 54% to 80%.

#Agent#Reasoning#Benchmarking#Bitween

why featured

HKR-K is solid with 80 functions, 64 findings, and a 54%→80% hit-rate gain; HKR-H and HKR-R stay weak because the paper is theory-heavy and narrow.

editor take

Agentic Bitween hits 64/80 functions; here the LLM is a search heuristic, not a proof machine.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection

ReTabAD releases 20 tabular datasets with structured textual metadata, plus implementations of classical, deep learning, and LLM-based anomaly detection methods and a zero-shot LLM baseline that uses semantic context without task-specific training.

#Reasoning#Benchmarking#ReTabAD#arXiv

why featured

HKR-K passes: ReTabAD provides 20 datasets with structured text metadata and zero-shot LLM baselines. HKR-H/R are weak, so it sits in the 60–71 band as a niche benchmark resource.

editor take

ReTabAD ships 20 metadata-rich tabular sets; I buy the direction, but the abstract hides LLM baseline gains.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

The paper proposes online test-time training on a masked autoencoder head to select the domain LoRA matching the current video-stream input, and evaluates the method on domain-incremental action recognition and semantic segmentation tasks.

#Vision#Fine-tuning#Research release

why featured

HKR-K passes on a concrete mechanism: MAE-based test-time training selects a domain LoRA for video streams. HKR-H/R miss due to no result number, product path, or practitioner pain hook, so it stays in all.

editor take

The paper uses MAE test-time training to pick domain LoRAs; no gains disclosed, but treating forgetting as routing is neat.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Memory by Design: Probabilistic Sequence Layers

The paper introduces a design-model framework that writes memory through exact Bayesian filtering; its Bayesian Layer propagates both mean and covariance, and the authors show linear attention, GLA, and Mamba-2/SSD as exact filters under one design model.

#Memory#Reasoning#Benchmarking#arXiv

why featured

HKR-H/K pass: the Bayesian filtering view across Mamba-2/SSD, GLA, and linear attention is a concrete mechanism. The paper is theory-heavy and gives no experiment numbers or deployment condition here, so technical accessibility keeps it in all.

editor take

Bayesian Layer keeps covariance and distills into 340M Gated DeltaNet for RULER gains; I buy the frame, but scores are missing.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Research proposes TimeRCD foundation model for zero-shot time series anomaly detection

TimeRCD uses Relative Context Discrepancy pre-training to detect time-series anomalies by comparing a query pattern with its surrounding context, and the arXiv abstract says it outperforms existing general-purpose and anomaly-specific foundation models in most zero-shot TSAD benchmark settings while staying competitive with dataset-specific full-shot baselines.

#Reasoning#Benchmarking#TimeRCD#Research release

why featured

HKR-K passes: the paper gives TimeRCD, RCD pretraining, and claimed wins across zero-shot TSAD benchmarks. HKR-H and HKR-R are weak because this is a narrow research item with no product, safety, or major-lab hook.

editor take

TimeRCD uses RCD for zero-shot TSAD; benchmark counts are undisclosed, so discount the strong claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→SWIM: Single-Instance Whole-Body Imitation for Swimming

SWIM learns whole-body swimming control from a single swimming motion and generalizes to unseen environments, body conditions, and swimming styles; the abstract does not disclose dataset size, metric values, or code availability.

#Robotics#Agent#Benchmarking#Research release

why featured

HKR-H and HKR-K pass on the single-instance swimming-control claim, but HKR-R fails: no product tie, code, metrics, or mainstream model angle is disclosed.

editor take

SWIM trains on one swim motion; no metrics or code disclosed, so I don’t buy the style-generalization claim yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→NeUQI: Near-Optimal Uniform Quantization Parameter Initialization for Low-Bit LLMs

NeUQI reduces uniform quantization initialization from joint scale and zero-point optimization to scale-only optimization, then reports stronger results than existing low-bit uniform quantization methods across LLaMA and Qwen settings and tasks. The arXiv snippet does not disclose exact bit widths, datasets, latency numbers, or performance deltas.

#Inference-opt#LLaMA#Qwen#Research release

why featured

HKR-K/R pass because the paper offers a concrete quantization mechanism tied to inference cost. HKR-H fails, and the post lacks bit widths, datasets, and lift numbers, so it stays in the lower 60–71 band.

editor take

NeUQI collapses scale/zero-point init to scale-only; without bit widths or deltas, I’m not buying the PV-tuning win yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

The paper unifies SVD LLM and Basis Sharing under one optimization problem and reports up to 46% lower weight reconstruction error on Pythia models, but downstream perplexity and accuracy degrade versus standard per-layer SVD LLM.

#Inference-opt#Pythia#Research release

why featured

HKR-K passes: the paper adds a unified optimization framing plus a 46% reconstruction-error result that fails on downstream metrics. HKR-H/R are weak; the framing is niche and no production impact is shown.

editor take

Cross-Layer Subspace Coupling cuts Pythia reconstruction error 46%; perplexity still loses to per-layer SVD, so weight-space compression fails again.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Improving Selective Classification with Pairwise Queries for Binary Classification

The paper proposes pairwise queries to the same model for detecting high-error samples in selective binary classification, and reports better accuracy-cost tradeoffs than raw confidence estimates such as LLM next-token logits on 1 synthetic and 4 real in-context learning datasets.

#Reasoning#Benchmarking#Research release

why featured

HKR-K passes: the paper offers a concrete pairwise-query mechanism and dataset scope. HKR-H and HKR-R are weak because the title is academic and the impact is narrow, so it fits the low-60s research band.

editor take

Pairwise queries beat raw logits on 5 binary datasets; when confidence is inconsistent, asking the same model twice saves expert budget.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Improving Relative Representations with Learned Anchors and Whitened Inner Products

The paper proposes learned semantic anchors and whitened inner products for Relative Representations, replacing random anchors and cosine similarity to improve cross-model communication on vision and language tasks, including stable zero-shot communication between heterogeneous small language models.

#Embedding#Multimodal#Research release

why featured

HKR-K passes: the paper names learned anchors and whitened inner products, with a zero-shot heterogeneous SLM communication claim. HKR-H/R are weak, and no numbers or deployment conditions are disclosed.

editor take

Learned anchors plus whitened inner products replace random anchors and cosine; “nearly lossless” has no numbers, so treat this as RR repair work.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Paper proposes Mixture of Concept Bottleneck Experts framework extending CBM

The paper proposes M-CBE, extending CBM task predictors from one preset expression to multiple expert expressions, and evaluates two instances: Linear M-CBE and Symbolic M-CBE.

#Interpretability#Research release

why featured

HKR-K passes: M-CBE extends CBM task predictors into multiple expert expressions, with Linear and Symbolic variants. No metrics, code, or production claim are disclosed, so it stays in the 60-71 band.

editor take

M-CBE turns CBM predictors into multiple expert expressions; no metrics disclosed, so this reads like interpretability tuning, not proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Generalizing Multi-Scale Time-Series Modeling with a Single Operator

SiGMA uses a learnable discrete Gaussian kernel for distance-aware scaling, ranks best in 13 of 16 long-term forecasting settings, and reports up to 5.3x faster training plus up to 3.8x lower memory use than the strongest competitors.

#Benchmarking#SiGMA#Research release#Open source

why featured

HKR-K is solid because the post gives a concrete mechanism and benchmark numbers. HKR-H and HKR-R are weak: this is a niche time-series modeling paper, not a broad model, agent, or product update.

editor take

SiGMA wins 13/16 long-horizon settings; I’d trust the 5.3x speedup only after reproducing their code.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

ScaleMAP rescales pairwise embedding displacements by original-space local radii, preserving density without adding a competing penalty. It matches DensMAP on density preservation, maintains UMAP-level neighborhood preservation, recovers sparse transcriptomic bridges collapsed by UMAP, and represents flow-cytometry density across 17 orders of magnitude; the same mechanism also improves PaCMAP density preservation.

#Embedding#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: ScaleMAP has a concrete mechanism and a 17-order-magnitude evaluation claim. The topic remains algorithmic research with limited product or industry resonance, so it stays in all.

editor take

ScaleMAP rescales displacements by local radii and spans 17 density orders; I buy this cleaner than bolting penalties onto UMAP.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

The paper evaluates audio SSL probing on 13 datasets and 6 spectrogram-based encoders, introducing binarized prototypical probes that use class-wise prototypes to aggregate localized token information and outperform linear and attentive probing.

#Audio#Embedding#Benchmarking#arXiv

why featured

HKR-K passes with concrete test scope and a named probe mechanism. HKR-H/R are weak: the hook is niche, and the paper lacks product, cost, safety, or competitive impact, so it sits in the low-60 research band.

editor take

This tests 13 datasets and 6 spectrogram encoders; for audio SSL, CLS linear probes are a bad proxy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→PINE: Pruning Boosted Tree Ensembles with Conformal In-Distribution Prediction Equivalence

PINE prunes boosted tree ensembles by using conformal calibration with a single alpha parameter to control an in-distribution region, and experiments on 12 public tabular datasets report up to a 30% higher compression ratio while preserving predictions at a level comparable to existing faithful pruning methods.

#Inference-opt#Benchmarking#PINE#arXiv

why featured

HKR-K passes with a concrete mechanism and 12-dataset result. HKR-H/R are weak: tabular tree-ensemble pruning is useful but narrow, so this stays as regular research signal.

editor take

PINE reports 30% more compression on 12 tabular sets; limiting equivalence to in-distribution regions is the pragmatic trade.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Generalistic or Specific Embeddings, Which Is Better? An Empirical Study on Clinical Coding Search in Non-English Languages

The study fine-tunes a Spanish biomedical two-stage retriever on about 19,500 Gemini-generated pairs, raising aggregate R@5 to 0.822 versus BioBERT-ST’s 0.790 while improving four of five evaluated languages.

#Embedding#RAG#Fine-tuning#Gemini

why featured

HKR-K has concrete metrics, and HKR-R touches domain-adaptation costs for multilingual medical RAG. The topic is academic and narrow, with no product, framework, or broad mechanism, so it stays in the 60-71 all band.

editor take

19.5k Gemini pairs push R@5 to 0.822; I trust this narrow clinical recipe more than generic embedding leaderboards.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Smaller and Faster 3DGS via Post-Training Dictionary Learning

The paper introduces a post-training dictionary-learning compression pipeline for 3DGS and reports average compression ratios of 3.95x, 3.10x, and 4.55x on 3DGS, 3DGS-MCMC, and PixelGS across 13 benchmark scenes, with rendering speedups of 23.3%, 24.3%, and 25.3% while maintaining image quality.

#Vision#Inference-opt#Benchmarking#Research release

why featured

HKR-K passes with a concrete post-training compression method and 13-scene ratios. The 3DGS dictionary-learning angle is niche, so HKR-H/R are weak and it stays in the 60–71 band.

editor take

Post-training dictionary learning gives PixelGS 4.55x compression without retraining; I’d check PSNR off those 13 scenes first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Pairwise Reference Alignment as a Model-Level Ordinal Observable

The paper defines pairwise reference alignment as the probability that a model score ranks y+ above y- under a reference pair distribution P_pair, then gives finite-sample estimators, concentration bounds, a margin extension, and an initial study on Qwen2.5 models and RewardBench.

#Alignment#Benchmarking#Qwen#RewardBench

why featured

HKR-K passes with a concrete alignment observable, estimator, bounds, and Qwen2.5/RewardBench tests. HKR-H/R are weak, so this is useful eval research but too narrow for featured.

editor take

The paper defines one preference-order probability; Qwen2.5 and RewardBench results lack scale, so this reads as metric hygiene.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL trains a diffusion-based action-conditioned world model with a KL-constrained adversarial curriculum and evaluates it in MineRL. Its PAT buffer re-ranks trajectories by prediction error, action fidelity, and learning progress, while the abstract says robustness improves over passive-data training but does not disclose numeric gains.

#Agent#Vision#Fine-tuning#PROWL

why featured

Only HKR-K lands: the PAT buffer and KL-constrained curriculum are testable mechanisms, but MineRL metrics are not disclosed and the title is paper jargon. This fits all, below featured.

editor take

PROWL reports MineRL and the mechanism, not numeric gains; I don't buy broad generalization, but PAT targets the right world-model failure mode.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Reward Learning from Best-of-N Preference Data: Targets, Tradeoffs, and Design Principles

The paper analyzes Bradley–Terry reward learning from Best-of-N preference data, where N candidates are sampled and the best is paired with a rejected response. It derives closed-form targets for independent-reference variants, shows Best-vs-Random and Best-vs-Worst generally fail exact BT representability, and reports that larger N increases pairwise margins while reducing connectivity.

#Alignment#Benchmarking#Research release

why featured

HKR-K passes via a testable Best-of-N tradeoff between margin and connectivity. HKR-H/R are weak, and the reward-modeling scope is too niche for featured.

editor take

Best-of-N widens margins and hurts connectivity; crank N only when labels are costly, not when generation is the bottleneck.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Trust-Region Behavior Blending for On-Policy Distillation

The paper proposes TRB, a warmup method that replaces early rollout policy within a student-centered KL trust region, keeps the reverse-KL OPD loss unchanged, and reports the strongest average performance across two math-reasoning distillation settings.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-K passes with a concrete distillation mechanism and two math-reasoning settings. HKR-H/R are weak, and no code, model name, or major-lab source is disclosed, so this stays in the 60–71 research-signal band.

editor take

TRB only changes early rollouts and wins in 2 math distillation settings; I’d probe whether KL annealing erases the gain.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Federated Learning with Enhanced Privacy via Model Splitting and Random Client Participation

The paper proposes MS-PAFL, a federated learning framework that splits each client model into a local private submodel and an aggregated public submodel, injects calibrated Gaussian noise only into the public part, and analyzes single-round and total privacy loss under random client participation and local data subsampling.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-K passes on a concrete mechanism and privacy-loss analysis. HKR-H/R fail: this is a narrow arXiv federated-privacy paper with limited immediate industry pull.

editor take

MS-PAFL adds Gaussian noise only to the public submodel; no datasets, ε, or accuracy numbers in the snippet, so I don’t buy “significant.”

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Conditional Attribution for Root Cause Analysis in Time-Series Anomaly Detection

The paper proposes a conditional attribution framework that retrieves contextually similar normal states via VAE latent spaces and UMAP embeddings, then evaluates root-cause identification, temporal localization, and robustness on the SWaT and MSDS benchmarks across multiple anomaly detection models.

#Interpretability#Benchmarking#Research release#Benchmark

why featured

HKR-K passes with a concrete attribution mechanism and SWaT/MSDS evaluation. HKR-H/R are weak, and this is a single arXiv method paper without production replacement or strong SOTA numbers, so it sits in 60–71.

editor take

The paper tests conditional attribution on SWaT and MSDS; gains aren’t disclosed, so don’t crown VAE+UMAP retrieval as RCA’s fix.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→BAT: Better Audio Transformer Guided by Convex Gated Probing

The paper introduces Convex Gated Probing and BAT for audio SSL evaluation, using gated access to frozen layers; the abstract claims new SOTA on audio benchmarks, but the post does not disclose benchmark scores.

#Audio#Benchmarking#Research release#Benchmark

why featured

HKR-K passes: the paper adds Convex Gated Probing and a frozen-layer gating mechanism, but the summary gives no scores or production impact. No hard exclusion; this fits a standard research-release score.

editor take

BAT claims SOTA via CGP, but scores are undisclosed; I’d treat this as a probing paper before buying the leaderboard claim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

FlexRank uses low-rank weight decomposition and importance-ordered nested consolidation to extract submodels from pretrained LLMs and ViTs under different compute budgets; the arXiv abstract does not disclose benchmark scores, latency numbers, or implementation details.

#Inference-opt#Research release

why featured

HKR-K passes because the paper offers a testable adaptive-deployment mechanism. HKR-H/R are weak, and no performance numbers are disclosed, so it stays below the interesting band.

editor take

FlexRank extracts budgeted submodels, but reports no scores or latency; I don't buy “train once, deploy everywhere” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→What Changes After Deployment? A Survey on On-device Learning in TinyML

The survey organizes about 70 TinyML on-device learning works by distribution-change regime, then analyzes how change types affect deployable applications, hardware choices, and solution structure.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-K passes with about 70 surveyed works and a distribution-shift taxonomy. HKR-H and HKR-R are weak: the niche TinyML survey lacks a click hook and broad industry tension, so it stays in all.

editor take

This survey maps ~70 TinyML ODL papers; centering distribution shift beats another benchmark leaderboard for deployment reality.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Subspace-Decomposed JEPAs: Disentangling Progression and Content in Latent World Models

SD-JEPA splits JEPA latents into two orthogonal subspaces, including an 8-dimensional progression subspace. That subspace is 4.2% of the latent, explains 72–95% of task-progress variance across four environments, and improves semantic event localization on 40 held-out cube episodes by up to +0.18 pooled AUROC.

#Agent#Reasoning#Benchmarking#arXiv

why featured

HKR-K passes on a concrete mechanism and numbers. HKR-H/R miss: JEPA latent-space decomposition is narrow, with no product or open-source hook, so it sits in low all rather than featured.

editor take

SD-JEPA’s 8-D subspace explains 72–95% progress variance; I buy the split, but 40 cube episodes is thin.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

The paper identifies “zero collapse” in policy-gradient RL for discontinuous reward environments, demonstrated across REINFORCE and actor-critic variants. In first-price auctions, flat zero-reward regions and sharp reward thresholds let stochastic exploration and gradient updates overshoot high-reward regions, after which missing gradient signals make recovery sample-inefficient.

#Reasoning#Benchmarking#Research release

why featured

HKR-H/K pass via a named failure mode and a concrete RL mechanism. HKR-R is weak; no product, open-source artifact, or major-lab move, so it stays in the low-value upper band.

editor take

Zero collapse hits REINFORCE and actor-critic; in auction RL, exploration tuning won’t save you when reward cliffs erase gradients.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

The paper proposes FedVPA-GP, a federated variational preference alignment framework that uses a Federated Mixture Prior and Orthogonal Loss to separate user preferences, and evaluates it against monolithic reward-model baselines on the HH-RLHF dataset.

#Fine-tuning#Alignment#Research release#Safety/alignment

why featured

HKR-K passes via the FedVPA-GP mechanism and HH-RLHF evaluation. HKR-H/R are weak: the title is specialist-heavy, and the paper lacks a production-impact or safety-incident hook.

editor take

FedVPA-GP is tested only on HH-RLHF, with client count undisclosed; the idea is sane, but “significantly outperforms” needs runs.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Adaptive NAD: Online and Self-adaptive Unsupervised Network Anomaly Detector

Adaptive NAD evaluates unsupervised network anomaly detection on three security datasets, reporting false alarm rates of 1.33%, 0.71%, and 0.08%, plus more than 3x faster online inference latency than state-of-the-art baselines on CIC-Darknet2020, NSL-KDD, and Edge-IIoTset.

#Benchmarking#Adaptive NAD#Research release#Open source

why featured

HKR-K passes on concrete false-positive and latency numbers. HKR-H/R are weak, and network anomaly detection is specialized, so this stays in the lower research-news band without hard exclusion.

editor take

Adaptive NAD reports 0.08% false alarms on Edge-IIoTset; I care whether its online self-training survives poisoned traffic.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Bridging the Gap Between Natural Language and Market Dynamics via High-Dimensional Representation Learning

The paper replaces scalar sentiment scores with dense FinBERT embeddings in a Transformer forecasting architecture, benchmarking raw embeddings, attention-weighted aggregation, and Siamese-optimized embeddings on the FNSPID dataset; Siamese embeddings outperformed the scalar baseline and raw embeddings, while attention aggregation struggled under financial data’s low signal-to-noise condition.

#Embedding#Benchmarking#FinBERT#FNSPID

why featured

HKR-K passes via the FinBERT-embedding mechanism and three strategy comparisons. HKR-H/R fail, and no performance numbers are disclosed, so this stays a narrow research item in all.

editor take

Siamese FinBERT embeddings beat scalar sentiment baselines here; stop worshipping sentiment scores, though the snippet omits effect size.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT unifies UAV traversal, target acquisition, and tracking in one relative formulation using onboard instantaneous observables such as attitude, altitude, and velocity; the abstract reports outdoor tests in forests, container compounds, and SAR scenes, but does not disclose speed, success rate, or quantitative baselines.

#Robotics#Research release

why featured

HKR-K passes: HUNT proposes one relative-frame mechanism using onboard instantaneous observations for three UAV tasks. No speed, success-rate, or baseline numbers are disclosed, so HKR-H/R stay weak.

editor take

HUNT unifies search and tracking via instantaneous relative frames; no speed or success rate disclosed, so I don’t buy “high-speed robust” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Performance and Complexity Trade-off Optimization of Speech Models During Training

The paper proposes a feature-noise-injection reparameterization method that lets SGD jointly optimize speech-model task performance and computational complexity during training, instead of applying post hoc pruning or quantization; the authors evaluate it in 3 case studies, covering a synthetic setup, voice activity detection, and audio anti-spoofing, and state that the related code is public.

#Audio#Inference-opt#Research release#Open source

why featured

HKR-K and HKR-R pass via a concrete training mechanism and cost angle; HKR-H fails because this is a niche academic optimization paper with no disclosed code, savings number, or product impact.

editor take

Feature-noise injection lets SGD optimize speech-model error and FLOPs in 3 cases; this smells useful, not another pruning wrapper.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction

FlagGAM converts numerical and categorical variables into sparse human-readable rule bases, then uses a default additive head that stays close to EBM on tabular benchmarks and shows smaller AUROC degradation under missing and noisy perturbations.

#Interpretability#Benchmarking#FlagGAM#Research release

why featured

HKR-K passes via a concrete mechanism and robustness claim, but HKR-H/R are weak: this is a niche tabular interpretability paper with no product pull or broad practitioner debate. No hard exclusion applies.

editor take

FlagGAM keeps a sparse rule-basis matrix; the EBM-close claim lacks concrete benchmark numbers, so don’t crown it yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

The paper evaluates four learning rules using 720 THINGS images and fMRI data from three subjects across six visual ROIs. One training epoch reduces V1 alignment by 25–90%, with backpropagation showing the largest drop and predictive coding plus STDP preserving more alignment.

#Vision#Benchmarking#Alignment#arXiv

why featured

HKR-H/K pass: the counterintuitive drop has concrete setup and numbers. HKR-R is weak because this is niche neuro/vision representation research with limited product or practitioner impact.

editor take

One epoch drops V1 alignment 25–90%; stop using brain-similarity as BP halo, even this 3-subject fMRI cut stings.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Graph Machine Learning in the Era of Large Language Models

arXiv:2404.14928v3 surveys two-way links between Graph ML and LLMs, covering graph feature enhancement, reduced labeled-data reliance, graph heterophily, OOD generalization, and graph-based improvements to LLM pre-training and inference.

#Reasoning#RAG#Research release

why featured

HKR-K passes because the survey gives a mechanism map for graph ML and LLM integration. HKR-H/R fail, and the post lacks a new model, benchmark number, or product impact, so it stays in all.

editor take

arXiv 2404.14928v3 is survey-only here, with no benchmarks disclosed; Graph-LLM work needs reproducible wins, not another taxonomy.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Forecasting with Hyper-Trees

The paper introduces Hyper-Trees, a gradient-boosted tree framework that learns parameters for target time-series models such as ARIMA or Exponential Smoothing, and uses a shallow network to reduce scaling limits when estimating high-dimensional parameter sets.

#Benchmarking#Research release

why featured

HKR-K passes on a concrete modeling mechanism, but no benchmark numbers, code, or production-replacement claim is disclosed. HKR-H and HKR-R are weak, so this stays in all.

editor take

Hyper-Trees uses GBDT to predict ARIMA/ES parameters; I buy the direction, but no benchmark numbers are disclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization

The paper proposes H-EARS, which encodes known dominant energy terms into reward potentials with O(n) per-step computation, and reports gains in convergence speed, policy stability, and final performance across 4 continuous-control benchmarks and 4 baseline algorithms.

#Robotics#Reasoning#Benchmarking#Research release

why featured

HKR-K passes: H-EARS adds dominant energy terms to the reward potential, with O(n), 4 benchmarks, and 4 baselines. The RL-paper framing lacks HKR-H and HKR-R, so it stays in the 40–59 band.

editor take

H-EARS adds known energy terms to reward at O(n); 4 benchmarks are thin, so verify the extreme-road sim.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Learning to Perceive the World Through Control: Empowerment-Based Representation Learning

arXiv:2605.30656 studies empowerment-based representation learning in reinforcement learning environments where observations exceed control-relevant variables. The paper shows empowerment agents induce two complementary representations, forward and backward, both invariant to control-irrelevant features, and argues that interaction aimed at maximizing control is required for these invariance properties.

#Agent#Reasoning#Research release

why featured

HKR-H and HKR-K pass via the agent-control framing and concrete representation claims. HKR-R is weak: single arXiv theory paper, no product path, artifact, or industry debate disclosed.

editor take

arXiv 2605.30656 proves two empowerment representations; I buy the invariance angle, but sample complexity is undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

The paper proposes DSP for few-shot atypical layout-to-image generation, using Semantic Anchoring, Primitive Imbuing, and Conceptual Steering to improve visual fidelity and alignment in the 5-shot regime.

#Vision#Multimodal#iCVTEAM#Research release

why featured

HKR-K passes on the 5-shot atypical L2I setup and DSP mechanisms. HKR-H/R are weak, and the post lacks metrics, code quality, or reproducibility details, so it stays in the lower research-release band.

editor take

DSP claims 5-shot gains but exposes no metrics here; I’d file it as a patch for long-tail L2I layouts.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

The paper analyzes four challenges in deploying reinforcement learning to a real industrial thermal heating network: partial observability, action-space design, reward design, and the simulation-to-reality gap; the real deployment reaches operational stability, but the abstract does not disclose the size of the performance gap versus simulation.

#Agent#Robotics#Research release

why featured

HKR-K passes because the paper gives four RL deployment blockers for industrial heat networks; HKR-R is limited to real-world control practitioners. No performance delta or AI product angle, so it stays in the lower research band.

editor take

RL ran stably on a real heating network, but gap size is undisclosed; control papers need failure boundaries, not SOTA theater.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Enhancing Regime Shift Detection Using Unstructured Data: A Study on the Treasury Market

The paper proposes a text-enhanced regime shift detection pipeline that uses LLM reasoning over FOMC minutes, validates candidates with a bootstrap likelihood-ratio test on VAR, and evaluates 2010-2024 data with a 14-variable U.S. Treasury and macro panel; it reports F1 = 0.82 and same-day modal detection latency against verified monetary-policy regime shifts.

#Reasoning#Benchmarking#FOMC#U.S. Treasury

why featured

HKR-K passes via a concrete LLM-plus-FOMC-minutes setup, a 2010-2024 panel, and F1=0.82. HKR-H and HKR-R miss because this is a narrow finance paper, not a core AI product or model-capability story.

editor take

FOMC minutes plus a 14-variable panel hit F1=0.82; I buy LLM-as-candidate, not LLM-as-trading-signal.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling

Haochen Yuan and three coauthors propose Unicorn, a high-dimensional time-series forecasting framework that uses a latent prototype codebook to decouple correlation modeling from channel identities for multi-dataset pretraining and few-shot transfer.

#Benchmarking#Haochen Yuan#Yichen Song#Yunbo Wang

why featured

HKR-K passes: Unicorn uses a latent prototype codebook for multi-dataset pretraining and few-shot transfer. HKR-H/R fail, and no benchmark number or production impact is disclosed.

editor take

Unicorn decouples channel identity via a prototype codebook; no benchmark numbers disclosed, so I’d file it as a promising time-series pretraining bet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→How Well Does Classification Accuracy Capture Concept Drift Detection Quality?

The paper studies the relationship between eight drift detection quality metrics and classifier performance across seven synthetic data stream generators, with drift dynamics included as an evaluation condition.

#Benchmarking#arXiv#Research release#Benchmark

why featured

HKR-K passes on concrete evaluation scope: 8 metrics and 7 stream generators. HKR-H/R are weak, and the body does not disclose the main finding, so this stays in the lower-value all tier.

editor take

This tests 8 drift metrics across 7 synthetic stream tools; judging drift detection by accuracy alone was overdue for a teardown.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

The paper proposes HADT for autonomous resource management in heterogeneous EO satellite clusters, modeling the task as sequential decision-making with relational observation-action tokenization and differential attention; the RSS snippet does not disclose baseline names, dataset settings, or exact performance gains.

#Agent#Reasoning#Robotics#Research release

why featured

HKR-K passes for the HADT mechanism and tokenization design. HKR-H and HKR-R fail: no baseline names or gains are disclosed, and satellite resource management is distant from mainstream AI product practice.

editor take

HADT frames heterogeneous EO satellite scheduling as sequential decisions; baseline names and gains are undisclosed, so treat it as an engineering idea.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Student Capacity Moderates Knowledge Distillation Effectiveness Across ResNet Teacher-Student Pairs on CIFAR-10

The paper tests three ResNet teacher-student pairs on CIFAR-10 under three seeds with mean and standard deviation reported. R50→R34 Feature-KD gains +0.30pp over baseline, while a 32×32-aware ResNet stem correction raises teacher accuracy by more than 5pp, far larger than any distillation gain.

#Vision#Benchmarking#arXiv#ResNet

why featured

HKR-K passes with reproducible teacher-student pairs and concrete point gains. HKR-H/R fail because this is a narrow distillation ablation on an old vision benchmark, not broad industry signal.

editor take

R50→R34 Feature-KD gains just 0.30pp; the 32×32 stem fix adds 5pp+, so check implementation before praising KD.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→MADQI: An Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection

The paper proposes MADQI to evaluate unlabeled AIS-based maritime anomaly detection, combining four metrics—ARC, PPS, SDS, and ECE—and reports a MADQI score of 80.37% on an AIS dataset.

#Benchmarking#Ismet Gocer#Zakirul Bhuiyan#Raza Hasan

why featured

HKR-K passes because the paper names a metric, four components, and an 80.37% result. HKR-H/R fail: AIS maritime anomaly detection is narrow, with no agent, product, or frontier-model implication, so it sits in the low-value research band.

editor take

MADQI combines 4 metrics and reports 80.37%. I don’t buy it yet: unlabeled evaluation easily turns heuristics into a score.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning

The paper constructs two linear filters for partially observable reinforcement learning: one exactly reproduces belief-vector pre-softmax logits under deterministic HMM transitions, and the other drives state-decoding error to zero under nearly deterministic transitions.

#Reasoning#Memory#Research release

why featured

HKR-K passes for a testable mechanism around linear filters and HMM assumptions. HKR-H/R are weak, and the POMRL theory barrier keeps it in the lower research-signal band.

editor take

The paper gives two linear filters; deterministic HMMs recover belief logits exactly. Linear memory gets a mechanism, not emergence folklore.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

8d ago

arXiv · cs.LG· atomEN04:00 · 06·01

→Early Prediction of Future Behavioral Strategy from Process Traces

The paper introduces PLVM, a process-level latent variable model that fuses partial traces from two cleaning tasks to predict whether PowerWash Simulator players use locally persistent Zone Planner behavior or frequent Zone Hopper behavior in the held-out Fire Station level; the abstract does not disclose dataset size or accuracy numbers.

#Benchmarking#PowerWash Simulator#Research release

why featured

HKR-H comes from the odd game setting, and HKR-K has a concrete PLVM trace-prediction setup. No metrics or product/agent implications are disclosed, so this stays in the low-value research band.

editor take

PLVM predicts Fire Station strategy from two cleaning traces; no sample size or accuracy disclosed, so this reads like telemetry modeling, not agent benchmarking.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

03:10

8d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN03:10 · 06·01

→EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

EvoPool uses three specialized agents to iteratively generate executable annotator code, and on 7 of 8 LLM-weak specialized tasks it beats the strongest LLM annotation baseline by an average 0.141 macro-F1, while running 4,500 to 31,000 times faster than LLM annotation on 100K examples.

#Agent#Fine-tuning#Benchmarking#EvoPool

why featured

HKR-H/K/R all pass: EvoPool links agent-written annotators to measurable F1 gains and large speedups. It stays in the 78-84 band because this is a single research release, not a widely adopted framework or major-lab launch.

editor take

EvoPool demotes LLMs from annotators to annotator authors, which is the harsher fit for repeatable domain labeling.

sharp

EvoPool’s sharp move is shifting labeling from token calls to executable code. Three agent types write annotators, a small validation set supplies the fitness signal, and deterministic gates filter for viability, diversity, and marginal contribution. On 7 of 8 LLM-weak tasks, it beats the strongest LLM annotation baseline by +0.141 macro-F1 on average, with +0.301 on ChemProt and +0.265 on PubMed. Honestly, this is a better production shape than “use a bigger LLM as judge” for repeatable domain labeling. The 4,500–31,000x speedup on 100K examples is exactly the kind of number that matters once annotation becomes a pipeline, not a demo. The catch is also clear: the paper picks LLM-weak specialized tasks. Open-ended taxonomies and shifting label definitions remain unproven.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

02:59

8d ago

HuggingFace Papers (takara mirror)· rssEN02:59 · 06·01

→Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression

SPRDiff applies a diffusion-based triple-encoder design and a distortion-aware reconstruction module to ultra-low-bitrate image compression, using pretrained distortion-oriented and semantic-oriented encoders to compensate for a frozen VAE encoder; benchmark experiments report better rate-distortion-perception trade-offs than state-of-the-art methods below 0.03 bpp, and the authors say code and trained models will be released on GitHub.

#Vision#Multimodal#Benchmarking#SPRDiff

why featured

HKR-K passes with testable details: below 0.03 bpp, a tri-encoder design, and distortion-aware reconstruction. HKR-H/R stay weak because this is niche image-compression research without product or broad cost impact.

editor take

SPRDiff beats SOTA below 0.03 bpp; I care whether inference latency eats the compression win after weights ship.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

01:44

8d ago

HuggingFace Papers (takara mirror)· rssEN01:44 · 06·01

→CRePE: Convolution-aware Relative Importance in Efficient Post-training Pruning

CRePE adds 2D local neighborhood context and adaptive coefficients to relative-importance post-training pruning, while PHO replaces repeated perplexity evaluations and reduces coefficient search time from about 11 hours to about 20 minutes.

#Inference-opt#CRePE#PHO#RIA

why featured

HKR-K is strong and HKR-R is moderate: the 11h-to-20m search cut is concrete and cost-relevant. HKR-H is weak because the paper is narrow pruning research, so it stays in all.

editor take

PHO cuts search from 11 hours to 20 minutes; I buy transferable pruning knobs, but accuracy numbers aren't disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

01:28

8d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN01:28 · 06·01

→Research Shows Sliding-Window Transformers without Positional Encoding Remain Turing Complete

The paper proves that sliding-window transformers without positional encoding remain Turing complete by introducing the HIST model, where each update uses only constant-size internal state and the token-count histogram inside the current window, then showing that window evolution reveals the token that just left the window and can simulate Turing-complete Post machines.

#Reasoning#Research release

why featured

HKR-H/K pass: the title challenges PE necessity, and the post names the HIST/Post-machine mechanism. It remains a narrow theory result with no model, code, or product impact disclosed, so it stays in the 60–71 band.

editor take

Three feeds point to one arXiv paper: don’t read “no PE remains Turing-complete” as “drop RoPE”; the proof rides on sliding-window leakage.

sharp

All three entries use the same title and route back to arXiv:2606.01532; this is single-paper propagation, not independent confirmation. The hard hook is HIST: each update sees only constant-size state plus the token-count histogram inside the current window, yet window evolution reveals the token that just left, enough to simulate Post machines. I don’t buy the leap to “positional encoding doesn’t matter.” Turing completeness is an expressivity floor, not trainability or long-context quality. The body gives a constant alphabet, a finite sliding window, and a constructive simulation. For long-context practitioners, this reads closer to a theory-side nod for state-update models like Mamba or RWKV: time can leak through dynamics, but RoPE and ALiBi still pay their rent in optimization and generalization.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

00:08

8d ago

HuggingFace Papers (takara mirror)· rssEN00:08 · 06·01

→Agent Operating Systems (AOS): Integrating Agentic Control Planes into and Beyond Traditional Operating Systems

The paper defines an Agent Operating System architecture for agent workloads, decomposing its control plane into five responsibility areas: scheduling, context and memory management, tool and capability registries, policy and trust enforcement, and observability and audit, while mapping integration models onto Linux and Windows primitives rather than proposing wholesale OS replacement.

#Agent#Memory#Safety#Linux

why featured

HKR-H/K/R all pass, but the item gives only a paper title and architecture summary, with no implementation, benchmark, or code. It stays in the 60–71 band.

editor take

AOS splits agent control planes into 5 duties; I buy the systems problem, not the OS-name ambition.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

papers · 2026-06-01

more

feeds

admin