papers · 2026-05-14

▸ 248 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-14 · Thu

17:59

25d ago

arXiv · cs.AI· atomEN17:59 · 05·14

→EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench introduces a 140-episode, 2,491-shot benchmark for multi-shot video generation, evaluating character, object, and location consistency under per-shot entity schedules across up to 50 shots.

#Multimodal#Vision#Memory#EntityBench

why featured

HKR-H/K/R pass, but this is a single arXiv benchmark without major lab adoption, production impact, or broad model results. Lower-band scoring keeps it at all.

editor take

EntityBench stress-tests 2,491 shots across 50-shot episodes. EntityMem’s prefilled visual memory is useful, but not end-to-end generation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

25d ago

arXiv · cs.CL· atomEN17:59 · 05·14

→ATLAS: Single Token Enables Agentic and Latent Visual Reasoning

ATLAS uses a single functional token as both an agentic operation and a latent visual reasoning unit, with LA-GRPO anchoring sparse functional tokens during RL training; the snippet does not disclose specific benchmark scores.

#Agent#Reasoning#Vision#ATLAS

why featured

HKR-H/K pass: the title has a counterintuitive hook and the mechanism is concrete. HKR-R misses because benchmark scores and release impact are not disclosed, keeping it below featured.

editor take

ATLAS compresses visual operations into 1 functional token; scores are undisclosed, so I read it as a compute-saving training trick.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:59

25d ago

HuggingFace Papers (takara mirror)· rssEN17:59 · 05·14

→RefDecoder Enhances Visual Generation with Conditional Video Decoding

RefDecoder adds reference attention to a video VAE decoder and improves reconstruction by up to 2.1dB PSNR over unconditional baselines on Inter4K, WebVid, and Large Motion, while the post says it can replace existing decoders without extra fine-tuning.

#Vision#Multimodal#Fine-tuning#RefDecoder

why featured

HKR-K passes on a concrete decoder mechanism and +2.1dB PSNR result. HKR-H/R are weak because this is narrow video-generation research with no disclosed release, product path, or competitive shock.

editor take

RefDecoder reports +2.1dB PSNR on three benchmarks. I buy decoder conditioning; swap-in claims need replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:59

25d ago

● P1arXiv · cs.AI· atomEN17:59 · 05·14

→FutureSim: Replaying World Events to Evaluate Adaptive Agents

FutureSim replays real news chronologically from January to March 2026 to evaluate frontier agents forecasting world events beyond their knowledge cutoff; the best agent reaches 25% accuracy, and many agents score worse on Brier skill score than making no prediction.

#Agent#Memory#Reasoning#FutureSim

why featured

HKR-H/K/R all pass: real-news replay tests agent forecasting, with 25% accuracy and Brier skill as checkable claims. It is a strong benchmark paper, but a single arXiv source keeps it below the 85 must-write band.

editor take

FutureSim drags agent evals onto a real timeline; 25% best accuracy is a brutal check on the “search equals adaptation” story.

sharp

Two arXiv categories carry the same FutureSim paper with identical framing; this is paper distribution, not independent corroboration. The benchmark replays real news and question resolutions from January to March 2026, then asks agents to forecast post-cutoff events. The best agent reaches only 25% accuracy, and many score worse on Brier skill than making no prediction. I like how unforgiving this setup is. It tests belief updating inside an information stream, not trivia retrieval with a nicer harness. A lot of agent demos look competent because tool calls and long context hide weak calibration. Chronological replay exposes that quickly: memory, search policy, and uncertainty reasoning all have to work together. The abstract does not name the tested models, so cross-model claims stop there. Still, this is a healthier direction than another static QA leaderboard.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

25d ago

arXiv · cs.AI· atomEN17:59 · 05·14

→Quantitative Benchmark for Video World Model Geometric Consistency

PDI-Bench evaluates generated videos by segmenting and tracking objects with SAM 2, MegaSaM, and CoTracker3, lifting observations into monocular 3D coordinates, and computing three projective-geometry residuals for scale-depth alignment, 3D motion consistency, and structural rigidity.

#Vision#Multimodal#Benchmarking#PDI-Bench

why featured

HKR-K/R pass: it offers a reproducible eval mechanism for video world models, with a named toolchain and three residual types. HKR-H is weak; the topic is narrow and lacks a major lab or product impact.

editor take

PDI-Bench scores video models with 3 geometry residuals; stronger than vibe checks, but monocular 3D injects evaluator noise.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:59

25d ago

arXiv · cs.AI· atomEN17:59 · 05·14

→VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit predicts 3D geometric displacements with depth-synchronized text injection and a residual transformation head, then trains on the DeltaScene Dataset generated through an automated pipeline with 3D agreement filtering; the snippet reports stronger multi-view consistency and near-instant inference against 2D-lifting baselines, but does not disclose dataset size or benchmark numbers.

#Multimodal#Vision#Inference-opt#VGGT-Edit

why featured

HKR-H and HKR-K pass: feed-forward native 3D editing is a real hook, with residual fields and DeltaScene filtering named. Single arXiv vision paper with no metrics, code, or product path keeps it in all.

editor take

VGGT-Edit predicts 3D displacement, but gives no numbers; I buy native 3D editing, not the table-free “substantially better” win.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:58

25d ago

FEATUREDarXiv · cs.CL· atomEN17:58 · 05·14

→Research Compares Grep and Vector Retrieval for Agentic Search

The paper compares grep and vector retrieval on 116 LongMemEval questions across Chronos, Claude Code, Codex, and Gemini CLI; grep generally achieves higher accuracy, while scores still vary by harness and by whether tool outputs are inline or file-based.

#Agent#RAG#Tools#Claude Code

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the paper gives a 116-question benchmark across agent tools, and the result affects agent/RAG retrieval design. Single arXiv paper, so 78–84 rather than must-write.

editor take

Grep beating vector search on 116 LongMemEval questions is a direct hit on default agent RAG plumbing, not a cute Unix nostalgia point.

sharp

The painful read is that agent search fails first at the harness layer, not the embedding layer. On 116 LongMemEval questions, the authors compare grep and vector retrieval across Chronos, Claude Code, Codex, and Gemini CLI. Grep generally wins, and the same conversation data shifts again when tool results are inline versus file-based. I don’t read this as “vector DBs are dead.” It is a warning for agent builders who treat top-k retrieval as the product. In CLI-style loops like Claude Code, plain string search gives the model inspectable, low-noise, repeatable moves. A better embedding model will not rescue a tool interface that dumps the wrong evidence in the wrong shape.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:58

25d ago

HuggingFace Papers (takara mirror)· rssEN17:58 · 05·14

→From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

The paper proposes a long-horizon image editing framework where a planner decomposes instructions into atomic steps, an orchestrator selects tools and regions, and a vision-language judge rewards trajectories based on instruction adherence and visual quality.

#Agent#Vision#Tools#Research release

why featured

HKR-H and HKR-K pass: the paper frames image editing as planned tool orchestration and names a VLM-judge reward setup. HKR-R is weak, and no benchmark numbers, open-source artifact, or major-lab context are disclosed, so it stays in all.

editor take

Planner, orchestrator, and VLM judge train long-horizon edits; no benchmark numbers disclosed, so I discount “more reliable.”

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:58

25d ago

arXiv · cs.AI· atomEN17:58 · 05·14

→Research proposes sparse mixture-of-experts routing to eliminate negative transfer in multi-physics models

Shodh-MoE uses a Top-1 soft-semantic router over compressed 16^3 physical latents, and in a 20,000-step mixed 3D pretraining run, held-out open-channel tokens routed exclusively to Expert 0 while porous-media tokens routed exclusively to Expert 1.

#Reasoning#Benchmarking#Shodh-MoE#Research release

why featured

HKR-K passes, while HKR-H/R fail. This is a physics-plus-AI paper with no agent or product implication and a high technical barrier, triggering hard-exclusion-1/4 and capping importance below 40.

editor take

Shodh-MoE splits two PDE regimes with Top-1 routing; 5-page arXiv, no code, don’t buy “eradicated” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:57

25d ago

FEATUREDarXiv · cs.AI· atomEN17:57 · 05·14

→OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation

OpenDeepThink selects parallel reasoning candidates with pairwise LLM comparisons and Bradley-Terry aggregation, raising Gemini 3.1 Pro’s effective Codeforces Elo by 405 points across eight sequential LLM-call rounds, about 27 minutes of wall-clock time.

#Reasoning#Benchmarking#OpenDeepThink#Gemini

why featured

HKR-H/K/R all pass: the +405 Elo claim is clickable, the mechanism and run conditions are concrete, and test-time reasoning cost matters to practitioners. Single arXiv method paper, so it stays in the 78–84 band.

editor take

OpenDeepThink turns sampling into an evolutionary bracket; +405 Codeforces Elo is strong, but 27 minutes per run kills most interactive use.

sharp

OpenDeepThink’s sharp move is not Bradley-Terry itself; it admits pointwise LLM judging is noisy and makes ranking a pairwise tournament. Each generation compares random candidate pairs, uses critiques for mutation, preserves the top ranks, mutates the top three quarters, and drops the bottom quarter. On Gemini 3.1 Pro, that buys +405 Codeforces Elo after eight sequential LLM-call rounds, about 27 minutes. I read this as a test-time compute bill, not a clean model-capability jump. The paper says gains concentrate in objectively verifiable HLE domains and reverse in subjective ones, which is exactly the failure mode you’d expect from selection pressure without a real verifier. Great for contest programming, proof search, and code repair loops; awkward for any agent path that needs latency under minutes.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:56

25d ago

FEATUREDarXiv · cs.CL· atomEN17:56 · 05·14

→MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

MetaBackdoor uses Transformer positional encoding to trigger LLM backdoors without changing input text; the paper shows a length-based condition can induce disclosure of system prompts and self-activate malicious tool calls during normal multi-turn conversations.

#Safety#Tools#MetaBackdoor#Research release

why featured

HKR-H/K/R all pass: the paper gives a counterintuitive positional-encoding backdoor plus concrete length triggers, prompt leakage, and malicious tool calls. As a single arXiv safety paper, impact needs reproduction and debate, so it sits in 78–84.

editor take

Positional backdoors are nastier than prompt triggers: clean text, length activation, and tool misuse after normal chat drift past a hidden boundary.

sharp

MetaBackdoor moves the trigger from text into position, which makes the usual backdoor playbook look stale. The concrete hook is ugly: no modified input string, just a length condition in the context. The paper claims that condition can leak proprietary system prompts, and normal multi-turn chat can drift into the trigger region and cause malicious tool calls. That is a direct hit on defenses built around suspicious tokens, jailbreak phrases, or semantic anomaly filters. Transformer models need positional encoding, so this attack surface is not an optional feature teams can remove with a prompt patch. The abstract does not disclose success rates, model sizes, or trigger-length distributions, so portability is still unproven. If the PDF results hold up, agent red-teaming that only scans visible text is under-scoped.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:56

25d ago

FEATUREDarXiv · cs.AI· atomEN17:56 · 05·14

→Evidential Reasoning Method Advances Interpretable Real-World Disease Screening

EviScreen retrieves region-level evidence from historical cases through dual knowledge banks and uses an evidence-aware reasoning module for disease screening; the snippet says code is public, but it does not disclose exact benchmark scores.

#Reasoning#RAG#Interpretability#EviScreen

why featured

HKR-K passes because the post names a concrete mechanism and public code. HKR-H/R are weak: no metrics are disclosed, and the medical-screening angle lacks a product or industry-conflict hook.

editor take

Three sources trace to one arXiv paper; EviScreen’s case-retrieval story is sensible, but without external clinical validation, don’t buy deployment claims.

sharp

All 3 sources use the same headline and point to arXiv 2605.15171; this is one-paper propagation, not independent confirmation. EviScreen’s useful hook is concrete: it retrieves region-level evidence from dual knowledge banks, then feeds it into an evidence-aware reasoning module. It also uses contrastive retrieval to produce abnormality maps instead of leaning on post-hoc saliency. I like the direction, but I don’t buy the weight carried by “real-world” yet. The body says it gets higher specificity at clinical-level recall on carefully established benchmarks, but gives no disease, cohort size, external-site test, or recall threshold. For medical imaging screening, interpretable evidence is cheap unless calibration and external validation survive. LungCURE at least stated 1,000 cases across 10+ hospitals; EviScreen reads like a method paper for now.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:55

25d ago

arXiv · cs.AI· atomEN17:55 · 05·14

→Text Knows What Tables Know When: Retrieval-Augmented Multimodal Clinical Timeline Reconstruction

The authors introduce a retrieval-augmented multimodal alignment framework for clinical timeline reconstruction, evaluated with instruction-tuned LLMs on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV; it improves absolute timestamp accuracy and temporal concordance over text-only reconstruction, while 34.8% of text-derived events are absent from tabular records.

#RAG#Multimodal#Benchmarking#MIMIC-III

why featured

HKR-H/K pass on the text-table hook and 34.8% missing-event finding; HKR-R fails. hard-exclusion-4 applies: clinical timeline reconstruction has no agent or product implication, so score is capped at 39.

editor take

Two arXiv tracks picked this up: 34.8% of text events are missing from tables, so EHR-only clinical RAG is brittle.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

17:54

25d ago

● P1arXiv · cs.AI· atomEN17:54 · 05·14

→Research paper argues behavioral evaluation cannot verify AI safety claims required by governance

The paper analyzes 21 governance instruments from 2019 to early 2026 and argues that behavioral evaluations and red-teaming only observe model outputs, so they cannot verify hidden objectives, loss-of-control precursors, or bounded catastrophic capability claims.

#Safety#Alignment#Interpretability#Safety/alignment

why featured

HKR-H/K/R all pass: the paper has a sharp anti-eval claim, a concrete scope of 21 governance tools, and a direct safety-governance nerve. As a single arXiv position paper, it fits the 78–84 safety-discussion band, not same-day must-write.

editor take

Two arXiv tracks carry the same position paper: behavioral evals are being used as governance proof, and that instrument is too weak for the job.

sharp

Two arXiv categories list the same position paper, with identical framing from the authors’ abstract rather than independent reporting. The paper hits the weakest seam in AI safety governance: 21 instruments from 2019 to early 2026 ask for evidence on hidden objectives, loss-of-control precursors, and bounded catastrophic capability, while the evidence base is mostly behavioral evals and red-teaming. I buy the critique. SWE-bench or MMLU-style behavior scores can rank product capability; they cannot verify the absence of long-horizon agentic failure modes. Linear probes, activation patching, and before/after-training comparisons are not magic either. But they at least force legal language to stop treating “passed the eval suite” as auditable safety proof.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:46

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN17:46 · 05·14

→Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Causal Forcing++ initializes a 1–2 step frame-wise autoregressive diffusion student with causal consistency distillation. Under a frame-wise 2-step setting, it beats 4-step chunk-wise Causal Forcing by 0.1 VBench Total, 0.3 VBench Quality, and 0.335 VisionReward, while cutting first-frame latency by 50% and Stage 2 training cost by about 4x.

#Vision#Inference-opt#Benchmarking#Causal Forcing++

why featured

HKR-H/K/R pass on the 1–2-step video hook, distillation mechanism, and 50% latency claim. Thin body and a tiny VBench +0.1 gain keep it in the featured-threshold band, not same-day must-write.

editor take

Causal Forcing++ barely wins VBench, but a 50% first-frame latency cut is the product signal for interactive video.

sharp

Causal Forcing++ matters less for its 0.1 VBench Total gain than for keeping frame-wise 2-step AR video from falling apart. The concrete trick is causal consistency distillation: one online teacher ODE step between adjacent timesteps, instead of storing full PF-ODE trajectories. That gives a 50% first-frame latency cut and about 4x lower Stage 2 training cost. The quality claim is thin: +0.3 VBench Quality and +0.335 VisionReward will not sell a visual leap. For interactive video and Genie-style world models, response granularity beats another polished offline T2V clip. This is an inference and training-cost paper wearing a video-quality headline.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:37

25d ago

arXiv · cs.CL· atomEN17:37 · 05·14

→MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

MemEye evaluates multimodal agent memory across 8 life-scenario tasks, spanning scene-level to pixel-level visual evidence, and tests 13 memory methods on 4 VLM backbones for evidence routing, temporal tracking, detail extraction, and reasoning over changing visual states.

#Agent#Multimodal#Memory#MemEye

why featured

HKR-K/R are solid: 8 tasks, 13 memory methods, 4 VLM backbones, and a real agent-memory pain point. HKR-H passes but stays niche; no adoption or release artifact disclosed keeps it below featured.

editor take

MemEye tests 8 tasks, 13 memory methods, and 4 VLMs; multimodal memory evals finally stop letting captions fake visual recall.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:30

25d ago

FEATUREDarXiv · cs.CL· atomEN17:30 · 05·14

→Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

The paper builds a 507-leaf taxonomy from 932 arXiv security studies and audits six public LLM attack benchmarks; HarmBench, InjecAgent, and AgentDojo cover at most 25% of the 4×6 Target×Technique matrix, while Service Disruption and Model Internals lack standardized evaluation despite attacks reporting 46× token amplification and 96% success rates.

#Safety#Benchmarking#Agent#HarmBench

why featured

HKR-H/K/R all pass: this is not another attack survey; it quantifies benchmark blind spots with a 507-leaf taxonomy. As a single arXiv safety-eval paper it is not must-write, but its numbers and agent-security angle clear featured.

editor take

LLM safety benchmarks are still measuring a hallway while attackers use the whole building; this audit makes the gap embarrassingly concrete.

sharp

This paper exposes a bad habit in LLM security: we keep treating jailbreak coverage as attack coverage. The authors mine 932 arXiv security papers from 2023-2026 into a 507-leaf taxonomy, then map six public benchmarks onto a 4×6 Target×Technique matrix. HarmBench, InjecAgent, and AgentDojo cover at most 25% of that grid. The ugly gap is Service Disruption and Model Internals. Those categories have no standardized evaluation, while published attacks already report 46× token amplification and 96% success rates. That matters for agent deployments, because tool loops, billing surfaces, memory, and hidden state are where production incidents land. If a benchmark ignores resource exhaustion and internal-state attacks, it is scoring classroom safety, not operational security.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:27

25d ago

HuggingFace Papers (takara mirror)· rssEN17:27 · 05·14

→Learning from Language Feedback via Variational Policy Distillation

The paper proposes VPD, a variational EM framework for learning from language feedback; its E-step refines the teacher with an adaptive trust-region update, and its M-step trains the student on token-level distributional guidance from on-policy rollouts.

#Reasoning#Code#Fine-tuning#Research release

why featured

HKR-K passes: the post adds a concrete training mechanism for language-feedback learning, but gives no metrics, artifact, or production claim. Its technical framing keeps it in all.

editor take

VPD trains teacher and student via variational EM; scores are undisclosed. Fixed-teacher self-distillation looks brittle for hard reasoning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

16:36

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:36 · 05·14

→EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

EverAnimate uses persistent latent context memory and restorative flow matching to reduce chunk-level drift in long-horizon human animation; with lightweight LoRA tuning, it reports 8%/7% PSNR/SSIM gains at 10 seconds and 15%/15% gains at 90 seconds, while LPIPS/FID drop 22%/11% and 32%/27%, respectively.

#Vision#Fine-tuning#Memory#Research release

why featured

HKR-H/K/R all pass: minute-scale human animation is a clear hook, and the post gives 90-second metrics plus a latent-memory mechanism. Missing authors, code, dataset, and real-user validation keep it below the stronger research band.

editor take

EverAnimate hits the right failure mode: chunk drift. The 90-second gains are strong, but without human evals, I don’t buy identity consistency yet.

sharp

EverAnimate is aiming at the right bottleneck: accumulated drift across generated chunks, not another headline about longer clips. The concrete numbers are good: at 90 seconds, PSNR and SSIM rise 15% each, while LPIPS and FID drop 32% and 27%. At 10 seconds, it still reports 8%/7% gains and 22%/11% drops. The mechanism also has teeth: Persistent Latent Propagation for cross-chunk memory, Restorative Flow Matching for velocity adjustment, plus lightweight LoRA tuning. I’m still skeptical of the claim scope. PSNR and SSIM reward stable backgrounds, but character identity failure is often a semantic problem. Runway and Pika users don’t complain that 10-second PSNR is low; they complain that the actor becomes a different person across shots. The snippet gives no human preference study, identity metric, or inference cost. That makes EverAnimate a serious long-video patch, not yet a production answer for animation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:26

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:26 · 05·14

→WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

Researchers propose WARD, a guard model for web agents, using 177K samples from 719 high-traffic URLs and A3T, an adaptive adversarial training loop with attacker-guard co-evolution.

#Agent#Safety#Alignment#WARD

why featured

HKR-H/K/R all pass: web-agent prompt injection is clickable, the dataset and A3T mechanism are concrete, and agent security is a live practitioner concern. This is strong research signal, not a major platform release, so it sits in the 78–84 band.

editor take

WARD treats prompt injection as an adversarial training problem, not a filter list; I’d trust it after seeing the OOD benchmark names.

sharp

WARD’s sharp move is training the guard as a target, not treating prompt injection as another HTML-sanitization chore. The paper uses 177K samples from 719 high-traffic URLs, adds WARD-PIG for attacks aimed at the guard itself, and runs A3T as a memory-based attacker/guard co-evolution loop. That setup maps better to messy web-agent deployment than static jailbreak suites. I don’t buy “nearly perfect recall” without the missing pieces. The snippet gives no OOD benchmark names, false-positive numbers, or hardware conditions for the claimed zero added latency. Those are the three production killers for agent guards. Lakera Guard and the tool-use safety layers from OpenAI and Anthropic hit the same wall: latency and overblocking. WARD earns attention if it survives dirty-web evaluation outside the paper harness.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:25

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN16:25 · 05·14

→SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

SemaTune tunes up to 41 Linux parameters across 13 live workloads from five benchmark suites. It improves stable-phase performance by 72.5% over defaults, beats the strongest non-LLM baseline by 153.3%, and costs about $0.20 in model calls for a 30-window session.

#Agent#Reasoning#Tools#SemaTune

why featured

HKR-H/K/R all pass: the hook is LLM-driven Linux tuning, the article gives concrete workload/parameter/gain/cost numbers, and the resonance is ops automation. It is strong applied research, not a same-day must-write model release.

editor take

SemaTune is strongest when the LLM is boxed into typed sysctl advice; 72.5% is tasty, but prod kernel knobs are where demos go to die.

sharp

SemaTune’s useful move is not “LLMs tune Linux.” It admits the LLM needs a cage, then uses it as a typed advisor for an OS controller. The paper gives real hooks: 13 live workloads, up to 41 Linux parameters, 72.5% stable-phase gain over defaults, 153.3% over the strongest non-LLM baseline, and about $0.20 for a 30-window session. I buy the engineering shape: typed validation, a fast update loop, a slower strategy loop, and retrieval from prior runs. That is more credible than black-box RL poking sysctl. The impressive claim is host-level metrics beating baselines with direct app objectives by 93.7 percentage points. My doubt is deployment, not novelty. Thirteen benchmarks do not cover production failure modes, and kernel knobs often fail with delayed blast radius. Typed validation blocks illegal values; it does not block legal settings that quietly hurt the service.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

16:12

25d ago

HuggingFace Papers (takara mirror)· rssEN16:12 · 05·14

→The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

The Scientific Contribution Graph extracts 2 million scientific contributions from 230,000 open-access papers and links them with 12.5 million prerequisite edges for automated technological roadmapping.

#RAG#Reasoning#Benchmarking#Scientific Contribution Graph

why featured

HKR-H and HKR-K pass on the scale of the Scientific Contribution Graph. The post gives extraction counts but no usable product, open artifact, or evaluation result, so it stays in all rather than featured.

editor take

Scientific Contribution Graph maps 230k papers into 12.5M prerequisite edges; 0.48 MAP is rough, but roadmapping RAG gets a backtestable target.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

15:01

25d ago

HuggingFace Papers (takara mirror)· rssEN15:01 · 05·14

→GeoFuse Enables Weather-Invariant Drone Geo-Localization Using Road Maps as Geometric Priors

GeoFuse fuses precisely aligned road-map tiles with satellite imagery for drone-view geo-localization under rain, snow, and fog, using token-level and channel-level interactions plus dynamic gating; on University-1652 and DenseUAV, it raises Recall@1 by 3.46% and 23.18%, respectively.

#Multimodal#Vision#Benchmarking#GeoFuse

why featured

HKR-H/K pass via the map-prior hook and concrete Recall@1 gains. The work is a narrow vision/localization paper, not a broad model, agent, or product update, so it stays in all.

editor take

GeoFuse lifts DenseUAV Recall@1 by 23.18%; I buy the map-prior angle over yet another weather-augmentation stack.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

14:33

25d ago

HuggingFace Papers (takara mirror)· rssEN14:33 · 05·14

→PROCESS-2 Speech Corpus Released for Early Cognitive Impairment Detection

PROCESS-2 releases a 21-hour speech corpus for cognitive impairment assessment, covering 200 healthy controls, 150 mild cognitive impairment cases, and 50 dementia diagnoses, with manually verified transcripts, participant metadata, predefined train/test splits, and controlled access through Hugging Face.

#Audio#Benchmarking#Hugging Face#PROCESS-2

why featured

HKR-K passes with concrete dataset size and access terms. HKR-H and HKR-R are weak: useful for medical speech-AI researchers, but not a same-day industry story.

editor take

PROCESS-2 ships 21 hours across 400 participants; clinical speech AI lacks fewer models than reproducible controlled datasets.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:22

25d ago

HuggingFace Papers (takara mirror)· rssEN14:22 · 05·14

→Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

CLVR couples vision-language planning with pixel-level diffusion generation for complex text-to-image tasks. The framework adds step-level visual verification, Proxy Prompt Reinforcement Learning, and Δ-Space Weight Merge, reducing per-step inference cost to 4 NFEs without re-distillation.

#Reasoning#Vision#Multimodal#Research release

why featured

HKR-H/K/R pass, but the item only gives a paper-title-level mechanism, no benchmark results, authors, or reproducibility details. Useful for visual generation control, yet too technical for featured.

editor take

CLVR cuts each verified reasoning step to 4 NFEs; I buy the mechanism, not the near-proprietary claim without baseline details.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

14:13

25d ago

HuggingFace Papers (takara mirror)· rssEN14:13 · 05·14

→Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

The paper introduces RCLAgent for root cause localization in microservice systems, using multi-agent recursion-of-thought with parallel reasoning. It assigns each trace-graph span to a Dedicated Agent, organizes agents recursively by graph topology, and produces a final diagnosis from a Root-Level Diagnosis Report plus a Global Evidence Graph; exact benchmark numbers are not disclosed in the snippet.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-K passes because the paper states a concrete multi-agent span-level mechanism. HKR-H and HKR-R are weak: no benchmark result, artifact, or broad practitioner hook is disclosed, so it stays in all.

editor take

RCLAgent assigns one agent per trace span; snippet gives no benchmark numbers. I don't buy SOTA without latency-cost curves.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

14:12

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN14:12 · 05·14

→Holistic Evaluation and Failure Diagnosis of AI Agents

The paper presents a holistic agent evaluation framework that combines agent-level diagnosis with span-level assessment, reporting state-of-the-art results on TRAIL across GAIA and SWE-Bench with up to 38% relative gains in category F1 and up to 3.5x higher localization accuracy.

#Agent#Benchmarking#TRAIL#GAIA

why featured

HKR-H/K/R all pass: the failure-diagnosis angle is concrete, and the post gives 38% F1 gain plus 3.5x localization accuracy. Not a same-day must-write model release, but strong enough for featured.

editor take

Agent eval is moving from pass/fail theater to failure forensics; 38% F1 and 3.5x localization matter more than another leaderboard bump.

sharp

This paper hits the ugly part of agent eval: pass/fail scores do not help you fix long-horizon failures. On TRAIL, across GAIA and SWE-Bench, the framework splits diagnosis into agent-level and span-level checks. It reports up to 38% relative gain in category F1, 3.5x localization accuracy, and 12.5x joint localization-categorization accuracy. I buy the methodological claim more than the leaderboard claim. The same frontier model gets several-times higher localization accuracy inside this framework than as a monolithic judge over the full trace. That matches what agent teams see in production: the hard part is finding the bad step, bad tool call, or bad observation, not declaring the run failed. Long-context judging has been overused as a shortcut; span-level failure forensics is the more useful primitive.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

13:56

25d ago

HuggingFace Papers (takara mirror)· rssEN13:56 · 05·14

→SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually Weighted Super-Resolution Artifact Evaluation

SR-Prominence releases 3,935 super-resolution artifact masks with crowdsourced prominence labels, and DeSRA re-annotation shows 48.2% of in-lab binary artifacts were not noticed by a majority of viewers; the suite also reports SSIM and DISTS giving stronger localized prominence signals than many no-reference IQA methods and specialized detectors.

#Vision#Benchmarking#SR-Prominence#DeSRA

why featured

HKR-H/K pass thanks to the 3,935-mask dataset and the 48.2% perceptual mismatch claim. The niche super-resolution evaluation scope keeps it below the featured threshold.

editor take

SR-Prominence ships 3,935 SR artifact masks; if 48.2% of DeSRA defects go unnoticed, binary artifact leaderboards deserve demotion.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

13:37

25d ago

HuggingFace Papers (takara mirror)· rssEN13:37 · 05·14

→Learning Direct Control Policies with Flow Matching for Autonomous Driving

The authors present a flow-matching planner for autonomous driving that outputs acceleration and curvature control sequences from BEV rasters, using a small number of ODE integration steps for low-latency closed-loop replanning. Training uses only 2D simulated urban scenarios from Parma, Italy, while evaluation includes unseen urban scenes and multi-lane highways.

#Robotics#Vision#Inference-opt#Research release

why featured

HKR-H/K pass: the paper gives a concrete flow-matching control mechanism and low-latency replanning condition. Evidence stays within one-city 2D simulation, so industry impact and transferability are limited.

editor take

Flow-matching planner trains only on Parma 2D simulation; I don’t buy highway generalization without real-sensor closed-loop tests.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

13:30

25d ago

HuggingFace Papers (takara mirror)· rssEN13:30 · 05·14

→The Velocity Deficit: Initial Energy Injection for Flow Matching

The paper identifies Velocity Deficit in high-dimensional Flow Matching and proposes MAFM plus SSC. SSC needs zero retraining and one line of code; on ImageNet-1k 256×256 it cuts FID from 13.68 to 7.58, gives a 5x speedup, and lets a 50-step generator beat a 250-step baseline.

#Inference-opt#Benchmarking#ImageNet#MS-COCO

why featured

HKR-H/K/R all pass: no-retraining speedup, FID numbers, and inference-cost relevance are concrete. The flow-matching sampling focus is too niche for featured, so it stays in all.

editor take

SSC cuts ImageNet FID 13.68→7.58 with zero retraining; I buy the one-line patch, pending cross-model replication.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

13:12

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN13:12 · 05·14

→A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

ARPM separates static knowledge from dynamic dialogue memory and tests long-term persona consistency in 50-round QA. Manual review shows 100.0% recall at a 1:5 signal-to-noise ratio and 80.0% at 1:200+, while disabling dialogue history retrieval drops strict accuracy from 100% to 66.7%.

#Memory#RAG#Benchmarking#Research release

why featured

HKR-H/K/R pass: the paper offers a concrete ARPM memory split and noise-stress recall numbers for agent memory reliability. Single-paper status with no disclosed external replication keeps it at the featured floor.

editor take

ARPM frames persona memory as retrieval governance, not model magic; the 100% recall reads well, but 50 turns and manual review keep the claim bounded.

sharp

ARPM’s useful move is not “long-term persona,” it is turning memory failure into an auditable retrieval pipeline. It splits static knowledge from dynamic dialogue memory, then uses vector retrieval, BM25, RRF, dual-temporal reranking, and chronological evidence reading. The ablation is the hook: disabling dialogue history retrieval drops strict accuracy from 100% to 66.7%, while disabling BM25 drops it to 80.0%. I’m cautious on the 100.0% manual recall claim. The test is 50 QA rounds, and at 1:5 noise the CSV judge scores only 54.0% before manual review lifts it to 100.0%; at 1:200+ it moves from 44.0% to 80.0%. That shows rule-based evaluation undercounts answers once evidence enters the prompt. It does not prove production-grade user memory across months. Compared with long-context brute force, this is the saner path, but permissions, forgetting, and conflict resolution are still unpaid bills.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

12:17

25d ago

HuggingFace Papers (takara mirror)· rssEN12:17 · 05·14

→Cognitive-Uncertainty Guided Knowledge Distillation for Student Misconception Classification

The paper proposes a two-stage knowledge distillation framework that trains on 10.30% filtered samples, reaches MAP@3 0.9585 on MAP-Charting, and uses a 4B model to achieve 84.38% accuracy on cross-topic middle-school algebra misconception tests.

#Fine-tuning#Benchmarking#Research release#Open source

why featured

HKR-K passes with a concrete distillation setup and metrics; HKR-H/R are weak because misconception classification is narrow and lacks an industry nerve. This fits the low-value research-release band, so tier is all.

editor take

A 4B model hits 84.38% cross-topic accuracy. For misconception classification, sample selection beats 72B brute force.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

12:01

25d ago

HuggingFace Papers (takara mirror)· rssEN12:01 · 05·14

→TAPIOCA Research: Task-Aware Pruning Improves Out-of-Distribution Model Generalization

TAPIOCA shows that task-aware layer pruning provides no benefit on in-distribution data across controlled polynomial regression tasks and large language models, but consistently improves out-of-distribution accuracy under tested distribution shifts.

#Inference-opt#Benchmarking#TAPIOCA#TALE

why featured

HKR-H and HKR-K pass: the counterintuitive OOD-vs-ID result is concrete. No hard exclusion, but the post lacks gain sizes, model names, and reproduction detail, so it stays in the 60–71 all band.

editor take

TAPIOCA says task-aware pruning lifts OOD, not ID; I want the exact shifts, model list, and gains, none disclosed here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

11:50

25d ago

HuggingFace Papers (takara mirror)· rssEN11:50 · 05·14

→LLM Agent Learning Patient Dynamics through Clinical World Model Interaction

SepsisAgent trains an LLM agent with a learned Clinical World Model and a three-stage curriculum for fluid-vasopressor decisions, then outperforms traditional RL and LLM baselines on MIMIC-IV sepsis trajectories in off-policy value, guideline adherence, and unsafe-action metrics.

#Agent#Fine-tuning#Safety#SepsisAgent

why featured

HKR-H/K/R pass: the clinical-agent angle is provocative, and the summary names MIMIC-IV plus three-stage training. No real clinical trial, numeric gains, or deployment path is disclosed, so it stays in the 60–71 research-signal band.

editor take

SepsisAgent wins OPE and safety metrics on MIMIC-IV; I’d worry the offline ICU traces train guideline mimicry, not bedside judgment.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

11:21

25d ago

HuggingFace Papers (takara mirror)· rssEN11:21 · 05·14

→Generating HDR Video from SDR Video

The paper proposes a two-stage MEVM and VMM framework that predicts exposure-bracketed linear SDR sequences from one nonlinear SDR video and merges them into HDR video while preserving shadow and highlight detail.

#Vision#Multimodal#Research release

why featured

HKR-H and HKR-K pass via the SDR-to-HDR hook and MEVM/VMM mechanism, but HKR-R is weak. With no benchmark numbers, open-source artifact, or product adoption, this stays in the 60–71 research band.

editor take

MEVM generates exposure brackets from one SDR video; I’d test dark noise and motion edges before trusting the HDR demos.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

11:03

25d ago

HuggingFace Papers (takara mirror)· rssEN11:03 · 05·14

→Are Candidate Models Really Needed for Active Learning?

The study tests active learning with randomly initialized CNNs and transformers, removing initial candidate models under three confidence-based sampling strategies: HC, LC, and HCLC. LC performs best in most experiments, while the RSS snippet does not disclose dataset counts, metric values, or full reproducible settings.

#Benchmarking#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the setup is contrarian and reports an HC/LC/HCLC comparison. Missing dataset count and metrics keep it niche active-learning research, so it lands in the 60-71 band.

editor take

Random CNNs/transformers test active learning, with LC mostly best; no dataset count or metrics disclosed, so I don’t buy “no candidate model” yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

10:19

25d ago

HuggingFace Papers (takara mirror)· rssEN10:19 · 05·14

→Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Falkor-IRAC constrains Indian legal generation with an IRAC knowledge graph; on a proof-of-concept corpus of 51 Supreme Court judgments, its Verifier Agent validated citations on completed queries and rejected fabricated citations.

#RAG#Agent#Reasoning#FalkorDB

why featured

HKR-H/K/R all pass: the citation-rejection hook, IRAC graph mechanism, and 51-case test are concrete. Scope stays narrow: an Indian legal AI proof of concept, not a broad model or platform update.

editor take

Falkor-IRAC tested 51 judgments and skipped vector-RAG baselines; the legal-verification shot hasn’t hit the target yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

10:08

25d ago

HuggingFace Papers (takara mirror)· rssEN10:08 · 05·14

→TERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

TERRA-CD releases 5,221 Sentinel-2 image pairs from 2019 and 2024 across 232 US and European cities, with 4-class land-cover masks, 3-class vegetation-change masks, and 13-class semantic-change masks for evaluating multi-class and semantic change detection methods.

#Vision#Benchmarking#TERRA-CD#Research release

why featured

HKR-K passes with concrete dataset size, years, city coverage, and labels. HKR-H/R miss; remote-sensing change detection is narrow for general AI practitioners, so technical-accessibility keeps it in the low-value upper band.

editor take

TERRA-CD ships 5,221 Sentinel-2 pairs; the 13-class change labels matter, but 10m imagery caps urban granularity fast.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:40

25d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN08:40 · 05·14

→Research Proposes Visuo-Tactile Cortical Alignment Method for Robotic Tactile Prediction

Mirror Touch Net aligns visual and tactile representations with semantic, distributional, and geometric multi-level constraints, then predicts millimetre-scale tactile signals across 1,140 taxels on a robotic hand from RGB images; the code is available on GitHub.

#Robotics#Vision#Multimodal#Mirror Touch Net

why featured

HKR-H/K/R all pass, but this is a single research release with limited source authority and no disclosed real-world deployment. The 1,140-taxel RGB-to-touch claim and open code clear featured, not must-write.

editor take

Mirror Touch Net predicts 1,140-taxel tactile maps from RGB; ignore the empathy framing, the useful bit is geometric cross-modal alignment.

sharp

Mirror Touch Net’s useful contribution is not “robots feel touch”; it is adding structure to a nasty vision-to-tactile mapping. The model predicts millimetre-scale tactile signals over 1,140 taxels on a robotic hand from RGB, using semantic, distributional, and geometric alignment constraints. The paper says manifold analysis shows visual features move toward tactile geometry, which is a cleaner prior than plain video-to-action imitation. I’m skeptical of the empathy framing. For robotics, the value is pre-contact anticipation: guessing contact pressure before the tactile sensor fires. The missing part is evaluation under domain shift. The snippet gives no error curves across objects, lighting, human hand shapes, or real taxel noise. Open code helps, but the claim lives or dies on those stress tests.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

08:29

26d ago

HuggingFace Papers (takara mirror)· rssEN08:29 · 05·14

→MIRAI Framework Evaluates Tabular Models on Multi-Dimensional Integrity and Responsibility Metrics

The paper proposes MIRAI, a framework that evaluates tabular models under controlled comparisons across five dimensions: explainability, fairness, robustness, privacy, and sustainability. It normalizes direction-aligned dimension scores into one aggregate score, and experiments on healthcare, financial, and socioeconomic datasets show higher predictive performance does not always correspond to stronger overall integrity and responsibility.

#Benchmarking#Safety#Interpretability#MIRAI

why featured

HKR-K is clear: MIRAI combines five integrity dimensions for tabular models into one score. HKR-H is weak and the post gives no deployment or adoption signal, so this stays mid-band rather than featured.

editor take

MIRAI compresses tabular model responsibility into five scores; handy, but the weighting and dataset details are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

08:28

26d ago

HuggingFace Papers (takara mirror)· rssEN08:28 · 05·14

→Local Spatiotemporal Convolutional Network for Robust Gait Recognition

The paper proposes LSTCN for gait recognition, using GBSP and an LSTC layer to let standard 2D convolutions process temporal features. The RSS snippet does not disclose datasets, accuracy numbers, runtime, or compute cost.

#Vision#Research release

why featured

HKR-K passes for a concrete architecture mechanism, while HKR-H and HKR-R miss. The post gives no dataset, accuracy, or compute cost, and gait recognition is a narrow research item.

editor take

LSTCN pushes temporal gait cues into 2D convs via GBSP and LSTC; no datasets, accuracy, or runtime, so “efficient” is unearned.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

08:20

26d ago

HuggingFace Papers (takara mirror)· rssEN08:20 · 05·14

→Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

Cattle Trade evaluates LLM agents in 50–60-turn multi-agent economic games with imperfect information, auctions, hidden-offer trades, bargaining, and bluffing; the authors tested seven cost-efficient language models and three deterministic code agents across 242 games, and two heuristic code agents beat most tested LLMs.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single benchmark-paper brief with setup counts only. No model ranking, surprising result, or adoption signal is disclosed, so it stays at the high end of 60–71.

editor take

Cattle Trade ran 242 games; two heuristic code agents beat most LLMs, so long-horizon bargaining still rewards discipline over chatter.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

08:16

26d ago

HuggingFace Papers (takara mirror)· rssEN08:16 · 05·14

→PROVE: A Perceptual Removal Coherence Benchmark for Visual Media

Xiaomi Research proposes PROVE, a visual object-removal evaluation framework with two perception-aligned metrics, RC-S and RC-T, plus PROVE-M with 80 paired videos and PROVE-H with 100 challenging videos; the paper says RC aligns better with human judgments than existing protocols, and releases code and benchmarks on GitHub.

#Vision#Benchmarking#Xiaomi#Research release

why featured

HKR-K passes because the post gives concrete metrics and benchmark sizes. HKR-H/R are weak: no surprising hook, no broad practitioner nerve, and no disclosed repo, leaderboard, or reproducible results.

editor take

PROVE ships RC-S/RC-T and 180 videos; Xiaomi is targeting object-removal models that hide artifacts from global metrics.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

07:34

26d ago

HuggingFace Papers (takara mirror)· rssEN07:34 · 05·14

→Contestable Multi-Agent Debate with Arena-Based Argumentative Computation for Multimedia Verification

The paper proposes a multi-agent multimedia verification framework combining multimodal large language models, external verification tools, and A-QBAF, with a public GitHub implementation for the ICMR 2026 Grand Challenge submission.

#Agent#Multimodal#Tools#Analytics Everywhere Lab

why featured

HKR-K/R pass: the post gives a testable mechanism and code, and it touches multimedia verification safety. HKR-H is weak; no benchmark, dataset, or reproducible setup is disclosed, keeping it in 60–71.

editor take

Analytics Everywhere Lab open-sourced MV2026; accuracy is undisclosed, and A-QBAF auditability beats the multi-agent debate framing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

07:14

26d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN07:14 · 05·14

→Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

The paper introduces Context-Driven Decomposition, an inference-time probe for diagnosing RAG context compliance under knowledge conflict. Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection with N=500, while CDD reaches 71.3% on temporal shifts and 69.9% on distractor evidence in Epi-Scale adversarial tests.

#RAG#Reasoning#Benchmarking#Gemini

why featured

HKR-H/K/R all pass: the paper targets wrong retrieval in RAG, gives a CDD probe plus N=500 and 15.0%/71.3% results. Useful for evaluation, but not a major lab release, so it sits in 78–84.

editor take

RAG failure is not just bad retrieval; models obey bad context too well. A 15.0% TruthfulQA score makes “add more context” look lazy.

sharp

The painful claim here is that RAG’s failure mode is obedience, not retrieval. Standard RAG hits only 15.0% accuracy on TruthfulQA misconception injection with N=500, which says poisoned context can override parametric knowledge instead of being treated as evidence to challenge. CDD should not be sold as a universal fix. Gemini-2.5-Flash reaches 64.1% mistake-injection causal sensitivity, while Claude Haiku/Sonnet/Opus sit in the [-3%, +7%] band. Same accuracy gain, different mechanism. For production RAG, answer accuracy alone is a bad comfort metric; it misses whether the model resolved conflict or simply routed around a bad rationale.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

05:38

26d ago

HuggingFace Papers (takara mirror)· rssEN05:38 · 05·14

→Agentic Recommender System with Hierarchical Belief-State Memory

MARS frames recommendation as a partially observable problem and uses three memory tiers plus six lifecycle operations, improving over the strongest baselines by 26.4% in HR@1 and 10.3% in NDCG@10 across four InstructRec benchmark domains.

#Agent#Memory#Reasoning#MARS

why featured

HKR-K is strong: three-layer memory, six lifecycle operations, four InstructRec domains, and measurable lifts. HKR-H passes on the POMDP recsys angle; HKR-R is narrow, mostly for recommender and personalization teams.

editor take

MARS lifts HR@1 26.4% across four InstructRec domains; three-tier memory beats flat preference soup.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:47

26d ago

HuggingFace Papers (takara mirror)· rssEN04:47 · 05·14

→Semantic Reward Reinforcement Learning Expands Low-Resource Language Capability without Reducing Overall Performance

The paper uses GRPO with embedding-level semantic rewards to expand Tibetan capability, evaluating Tibetan-Chinese translation and Tibetan headline generation, and reports better preservation of general competence than SFT while improving semantic quality under limited supervision.

#Fine-tuning#Alignment#Embedding#Research release

why featured

HKR-K is solid: GRPO plus embedding-level semantic rewards is a testable mechanism. HKR-H passes on the “no alignment tax” claim, but HKR-R is weak because the Tibetan use case narrows the practitioner audience.

editor take

GRPO adds Tibetan skills via semantic rewards; no model size or scores disclosed, so treat it as anti-SFT evidence for low-resource tuning.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:46

26d ago

HuggingFace Papers (takara mirror)· rssEN04:46 · 05·14

→MoRe: Modular Representations for Continual Learning on Sequential Data

MoRe decomposes knowledge into fundamental and specific module hierarchies with identifiability guarantees and tests them on synthetic benchmarks plus real-world LLM activations; the post does not disclose model scale, metric values, or a code link.

#Reasoning#Interpretability#Research release#Benchmark

why featured

HKR-K passes via a concrete modular-representation mechanism and test setting. HKR-H/R are weak, and model scale, metric values, and code link are not disclosed, so this stays in all.

editor take

MoRe tests synthetic benchmarks and LLM activations; no metrics, scale, or code, so treat identifiability as theory first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:25

26d ago

HuggingFace Papers (takara mirror)· rssEN04:25 · 05·14

→Ideology Prediction of German Political Texts

The study evaluates 13 transformer models for predicting German political texts on a left-right scalar from -1 to 1, using four corpora including Bundestag notes, Wahl-O-Mat data, 33 newspapers, and 535,200 tweets from 597 Bundestag members; DeBERTa-large reached F1 0.844 in-domain and ACC 0.864 on X, while Gemma2-2B led the newspaper out-of-domain test with MAE 0.172.

#Benchmarking#DeBERTa#Gemma#German Bundestag

why featured

HKR-K passes with concrete counts and a cross-domain accuracy number. HKR-H and HKR-R fail because this is a narrow academic benchmark without a product, open-source tool, or industry event hook.

editor take

DeBERTa-large hits 0.864 ACC on X; don't mistake ideology labels for truth, because corpus labels steer the model's nose.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→Research Paper Demonstrates Data Curation Significantly Improves Vision Language Model Performance

The 20/20 VLM team changed only training data and raised average performance by 11.7 points across 20 public VLM benchmarks; its 2B curated model came within 1.8 points of Qwen3-VL-2B using about 87x less training compute.

#Multimodal#Vision#Benchmarking#MAmmoTH-VL

why featured

HKR-H/K/R all pass: “data curation alone” is the hook, with 20 benchmarks, +11.7 pp, and 87x less compute. As an arXiv research release rather than a major model launch, it sits in the 78–84 band.

editor take

Both sources are the same arXiv paper; +11.7pp and 150x less compute are loud, but I wouldn’t treat this as a universal VLM law yet.

sharp

Both entries point to the same arXiv paper, so the coverage is fully aligned and not independently validated. The paper changes only training data while holding architecture, recipe, and compute fixed, then reports a +11.7pp average gain across 20 public VLM benchmarks on the MAmmoTH-VL single-image subset. The strongest hook is the 2B comparison: +9.9pp over InternVL3.5-2B at roughly 17x less training compute. I buy the claim that VLM teams have underpriced curation. I don’t buy the clean slogan of “data curation alone” without checking the selection pipeline and leakage risk. DatBench is author-built, and the paper is tied to DatologyAI, so the commercial story lines up too neatly. Still, getting within 1.8pp of Qwen3-VL-2B at 87x less compute is the kind of number pretraining teams should try to break.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→Research paper introduces M2CL context learning method for multi-agent discussion

The paper introduces M2CL, a multi-LLM context learning method that trains one context generator per agent to produce round-specific instructions; on academic reasoning, embodied tasks, and mobile control, it reports 20%–50% higher performance than existing MAD methods.

#Agent#Reasoning#Research release#Benchmark

why featured

HKR-H/K/R pass: the paper offers a concrete M2CL mechanism and 20%–50% reported gains for agent discussions. No major lab or artifact is disclosed, so it stays in the 78–84 band.

editor take

M2CL blames MAD failures on context drift; the 20–50% gain is tempting, but without code and tables, don’t sell it as a general agent fix.

sharp

Both entries are the same arXiv item duplicated, so the coverage is fully aligned through one source chain. Version 3 landed on May 13, and the headline number is a 20–50% gain over existing MAD methods. I buy the diagnosis more than the victory lap. Multi-agent discussion usually fails when early wrong answers become social proof after a few rounds, not because agents lack another voting rule. M2CL’s per-agent context generator, updated each discussion round, is a plausible mechanism for controlling coherence and disagreement. The catch is practical: the abstract names academic reasoning, embodied tasks, and mobile control, but gives no baselines, model list, token cost, or code link. Compared with AutoGen or CAMEL-style orchestration, this reads like a trained discussion controller, not a drop-in agent workflow.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic

MCPShield encodes MCP tool-call sessions as graphs and uses SBERT content embeddings for attack detection; metadata-only detection plateaus near 0.64 AUROC, content features exceed 0.89, and tree ensembles on pooled embeddings reach 0.975 on RAS-Eval.

#Agent#Safety#Embedding#MCPShield

why featured

HKR-H/K/R all pass: MCP attack detection is timely, the paper gives concrete AUROC numbers, and tool-call safety matters to agent builders. Single arXiv source with no deployment proof keeps it in the 78–84 band.

editor take

MCPShield’s sharp bit: tree ensembles beat GNNs. For agent security, parse tool args and outputs before worshipping graphs.

sharp

Both entries point to the same arXiv paper, 2605.11053, so this is duplicate-source coverage, not independent confirmation. MCPShield models MCP tool-call sessions as graphs: tool calls are nodes, sequence and data-flow are edges. The awkward result is that the graph story is weaker than the content story: metadata-only detection stalls near 0.64 AUROC, SBERT content embeddings push past 0.89, and tree ensembles on pooled embeddings hit 0.975, ahead of GNNs at 0.917 and the MLP at 0.896. I don’t buy the graph-first framing here. The useful lesson for agent builders is the evaluation cut: naive random splits inflate AUROC by up to 26 points versus task-disjoint splits. MCP security does not need a fancier architecture first; it needs attack benchmarks that do not reward task memorization.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→Tokens-per-Parameter Coverage Critical for Robust LLM Scaling Law Extrapolation

The paper shows fixed tokens-per-parameter ratios make scaling-law fits ill-conditioned, and non-collinear designs beat collinear ones with a 97.3% win rate across four laws, five corpora, and multiple floating-point precision modes.

#Benchmarking#arXiv#Research release#Benchmark

why featured

HKR-H/K/R all pass: fixed TPP design breaks robust fits, the 97.3% win rate is concrete, and the issue maps to training budget risk. It stays below P1 because it is an arXiv methods paper without industry validation.

editor take

Fixed-TPP scaling runs are a cheap experimental design with an expensive failure mode: the paper attacks Chinchilla-style extrapolation at the identifiability level.

sharp

Both listed sources are the same arXiv entry, so the coverage is aligned through one paper, not independent confirmation. The paper’s concrete claim is strong: fixed tokens-per-parameter makes N and D collinear; when the N and D exponents are close, the design condition number worsens with the inverse square of their gap. It also reports a 97.3% held-out win rate across four scaling-law forms, five corpora, and multiple precision modes. I buy the direction because it targets experimental design, not a flaky benchmark. Since Chinchilla, teams have leaned on D = kN sweeps because they are compute-efficient. This paper says that shortcut turns coefficient estimation sloppy, then charges interest when you extrapolate away from the training ray.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→EVA-Bench voice agent evaluation framework released

EVA-Bench evaluates 12 voice-agent systems across 213 enterprise scenarios and releases an open-source framework; no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1, and the median EVA-A pass@k minus pass^k gap is 0.44.

#Agent#Audio#Benchmarking#EVA-Bench

why featured

HKR-H/K/R all pass: the paper gives a concrete benchmark, 213 scenarios, 12 systems, and a stark pass@1 failure result. It fits the 78–84 band as a useful research benchmark, not a same-day major model or product launch.

editor take

EVA-Bench drags voice agents out of demo theater: across 12 systems, none clears 0.5 on both accuracy and experience pass@1.

sharp

Two arXiv categories carry the same EVA-Bench paper with identical framing, so this is single-paper propagation, not independent media validation. The paper’s hit is concrete: 213 enterprise scenarios, 12 systems, three agent architectures, and no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1. I like that it scores task completion, faithfulness, audio fidelity, progression, concision, and turn timing together. Voice agents have been getting too much credit from polished demos and latency tricks. The ugly number is the median 0.44 gap between pass@k and pass^k on EVA-A: many agents can look smart once, then fail as a product surface. Compared with Chatbot Arena-style preference scoring, EVA-Bench is closer to the pre-deployment pain enterprises actually need.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·14

→Test-Time Compute for Dense Retrieval with Frozen Embedding Models

The paper tests 259 inference programs over a frozen embedding API with an agentic search loop, and its softmax-weighted local top-K centroid method improves nDCG@10 across seven embedding-model families under held-out full-BEIR validation.

#Agent#Embedding#Inference-opt#arXiv

why featured

HKR-H/K/R pass: the paper gives a concrete retrieval mechanism and test setup, not just a benchmark headline. Scope stays within RAG/search engineering, so it fits the lower featured band rather than a broad platform-level update.

editor take

Both sources point to one arXiv paper; 259 searched programs collapse to a centroid trick. RAG teams should pause before fine-tuning embeddings.

sharp

Both entries use the same arXiv title, so this is one source chain, not independent coverage; the hard hooks are 259 inference programs, 90 search generations, 7 embedding-model families, and held-out full-BEIR validation. I buy the direction, not the framing. The “agentic program generation” story collapses into a simple default: take a softmax-weighted centroid of local top-K documents, then interpolate it with the query. That is useful engineering. Freeze the embedding API, spend compute at retrieval time, and move nDCG@10 without retraining vectors. Compared with older HyDE or query-expansion tricks, the reported cross-family lift is the interesting part. The abstract does not give latency, K, or API cost, and production RAG teams will ask those before caring about the agent label.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→DiffusionHijack: Supply-Chain PRNG Backdoor Attack on Diffusion Models and Quantum Random Number Defense

DiffusionHijack hijacks the PRNG in Stable Diffusion v1.4, v1.5, and SDXL through compromised packages, producing attacker-chosen images with SSIM=1.00 across 100 trials without changing model weights.

#Safety#Vision#Alignment#Stable Diffusion

why featured

HKR-H/K/R all pass: hijacking diffusion outputs without weight edits is clickable, SSIM=1.00 over 100 trials is testable, and PRNG supply-chain risk hits deployment security. Single arXiv paper with no live incident keeps it at 82.

editor take

DiffusionHijack is model security bypassed via plumbing: poison the PRNG, leave weights untouched, and every audit looking at the graph is blind.

sharp

DiffusionHijack is nasty because it moves the backdoor outside the model graph. Weight audits and moderation sit in the wrong place. The paper fixes latent noise through a compromised PRNG across Stable Diffusion v1.4, v1.5, and SDXL, reproducing attacker-chosen images with SSIM=1.00 over 100 trials. CLIP safety checks still fail at 98-100%, and the user prompt does not matter. I have doubts about the QRNG framing. The defense drops similarity to SSIM <0.20 for SD 1.x and <0.45 for SDXL, but hardware randomness is not how most image stacks ship or operate. The practical fix looks more like package signing, seed provenance, and runtime entropy checks. A supply-chain attacker does not need to beat the model; they only need to beat pip.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression

REAP uses router gate values and expert activation norms to prune experts, outperforming merging across 20B to 1T-parameter SMoE models; at 50% compression, Qwen3-Coder-480B and Kimi-K2 retain near-lossless code generation performance.

#Inference-opt#Code#Benchmarking#Qwen

why featured

HKR-H/K/R all pass: REAP gives a concrete MoE pruning mechanism and claims near-lossless code generation at 50% compression on Qwen3-Coder-480B and Kimi-K2. Strong practical research, but not a same-day industry event.

editor take

REAP is a clean hit against expert merging: prune 50% of experts, keep code generation near-lossless, and routing control suddenly looks sacred.

sharp

REAP lands because it attacks the fashionable “merge experts” story at the routing level. The paper tests 20B-to-1T SMoE models and uses router gate values plus expert activation norms as the pruning signal. At 50% compression, Qwen3-Coder-480B and Kimi-K2 reportedly keep code generation near-lossless. That is a harsher test than discriminative benchmarks where merging has looked good. Generative workloads punish small routing mistakes, and expert merging removes the fine-grained control that MoE bought in the first place. I’d still want independent runs on serving latency and memory bandwidth, not just benchmark quality, but the mechanism is plausible: if the router already knows which experts matter, deleting the cold ones beats averaging them into noise.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

The paper tests four model families and finds pretrained base models also flip from correct to incorrect answers under simulated peer disagreement, with higher average yield than Instruct variants; one correctly arguing dissenter reduces yield by 54–73 percentage points across tested framings.

#Agent#Alignment#Interpretability#Research release

why featured

HKR-H/K/R all pass: the paper challenges the RLHF-sycophancy story and reports four-family tests plus a 54-73 point dissenter effect. Single arXiv paper, so 78-84 rather than 85+.

editor take

Stop blaming RLHF for multi-agent sycophancy; base models flipped more often, so the bug sits in reasoning under social pressure.

sharp

This paper kills the convenient RLHF scapegoat. Across four model families, pretrained base models also flipped from correct to incorrect under peer disagreement, with higher average yield than Instruct variants. That makes the failure harder to dismiss as “models learned to flatter users.” The mechanistic hook is unusually concrete: activation patching localizes corruption to a narrow mid-layer window, attention carries the causal weight, MLP contribution is negligible, and patching above the window restores 96% of the clean-to-pressured P(correct) gap. The useful mitigation is also concrete: one correctly arguing dissenter cuts yield by 54–73 percentage points across framings. Prompt-level defense failed outside its design surface. For agent teams, the lesson is blunt: don’t treat a debate prompt as a safety boundary; bake structured dissent into the pipeline.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Data Agent: Learning to Select Data via End-to-End Dynamic Optimization

Data Agent frames dynamic data selection as a training-aware sequential decision process and reports over 50% cost reduction on ImageNet-1k and MMLU while preserving performance.

#Agent#Inference-opt#Benchmarking#Data Agent

why featured

HKR-H/K/R all pass: the post gives a Data Agent mechanism and a testable over-50% cost-cut claim. As a single arXiv paper needing replication, it fits the good research-release band, not P1.

editor take

Data Agent makes data selection an online policy problem and claims 50% cost cuts; useful, but the reward still rides on loss and confidence, not magic data taste.

sharp

Data Agent’s useful move is not the word “agent”; it puts data selection inside the training loop. The paper reports over 50% cost reduction on ImageNet-1k and MMLU with lossless performance. Its policy selects samples online, using a reward that mixes loss-based difficulty and confidence-based uncertainty with adaptive weighting. That is closer to actual training than static pruning, because sample value changes as the model learns. I don’t buy the broad “plug-and-play” framing yet. The abstract does not give pretraining-scale LLM results, selector overhead, or failure cases on shifted tasks. DatologyAI-style filtering, DSIR, and LESS already taught the same lesson: saved tokens can get eaten by policy cost, pipeline complexity, and brittle reproduction. The 50% number is strong; the production test is wall-clock savings after the selector is paid for.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

MazeBench evaluates 16 model configurations on 110 procedurally generated maze images; GPT-5.4 scores 91%, but traces show models usually convert images into text grids and enumerate paths. Without added reasoning budgets, all configurations score only 2–12%, and 20×20 ultra-hard mazes hit token limits and fail.

#Multimodal#Vision#Reasoning#OpenAI

why featured

HKR-H/K/R all pass: the paper has a counterintuitive hook, concrete MazeBench numbers, and a practical warning for multimodal evaluation. It is a benchmark paper, not a model or platform release, so it fits the 78–84 band.

editor take

MazeBench punctures the visual-planning story: GPT-5.4’s 91% looks like grid transcription plus BFS, not spatial reasoning from pixels.

sharp

MazeBench’s sharp cut is that “visual planning” collapses into transcription plus text search. Across 110 generated mazes and 16 configurations, GPT-5.4 reaches 91% and Gemini 3.1 Pro reaches 79%, but the traces burn 1,710–22,818 tokens while turning images into text grids and enumerating paths. Remove extra reasoning budget and every setup drops to 2–12%. Claude Sonnet 4.6 gives away the split: 6% on maze images, 80% when handed the correct text grid. That is bad news for vendor demos that sell multimodal reasoning as perception plus planning. The models are strong at prose BFS after someone, or some internal routine, has discretized the scene. The 20×20 ultra-hard failures hitting token limits also make the test-time-compute story look less clean than the launch decks suggest.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→FOAM: Blocked State Folding for Memory-Efficient LLM Training

FOAM compresses Adam optimizer states with block-wise gradient means and residual correction, eliminates up to 90% of optimizer-state memory overhead, and reports convergence rates equivalent to vanilla Adam under standard non-convex optimization settings.

#Fine-tuning#Inference-opt#FOAM#Adam

why featured

HKR-K/R are strong: FOAM claims a 90% Adam-state memory cut via block means plus residual correction. It stays at 80 because this is a single arXiv training-optimization paper with no disclosed replication or broad pickup.

editor take

FOAM cuts Adam state memory by up to 90%; if throughput reproduces, full-parameter tuning gets a cheaper path beyond LoRA.

sharp

FOAM’s sharp claim is not generic memory saving; it attacks Adam’s two moment states while claiming vanilla-Adam convergence under standard non-convex assumptions. Adam normally carries two extra state tensors per parameter. FOAM replaces that with block-wise gradient means plus residual correction, then reports up to 90% optimizer-state overhead removal and compatibility with other memory-efficient optimizers. I don’t fully buy the “accelerates convergence” line yet. The abstract does not give model scale, token budget, throughput tables, or reproduction conditions. Treat that as an arXiv promise until the GitHub runs are checked. Compared with Adafactor-style factored states, FOAM’s pitch is closer fidelity to Adam. That also makes the risk clearer: change the block size, and long-run stability or small-batch noise can bite before the memory graph looks ugly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

The paper trains over 200 models from 80M to 3B parameters and introduces a conditional scaling law using hidden size, MLP-to-attention ratio, and GQA; under the same training budget, optimized architectures beat LLaMA-3.2 by up to 2.1% accuracy and 42% inference throughput.

#Inference-opt#Benchmarking#LLaMA-3.2#Research release

why featured

HKR-H/K/R all pass: the paper reports 200+ 80M–3B runs, a conditional scaling law, and +2.1% accuracy/+42% throughput at equal budget. It is strong research, not a major model launch, so featured fits.

editor take

This is the unsexy but useful lane: same-budget architecture search beats LLaMA-3.2 by 2.1% accuracy and 42% throughput.

sharp

This paper hits the open-model pain point: parameter count is no longer the only budget, inference throughput has to enter architecture choice before training. The authors train 200-plus models from 80M to 3B parameters, then extend Chinchilla-style scaling with hidden size, MLP-to-attention ratio, and GQA. Under the same training budget, their searched architectures beat LLaMA-3.2 by up to 2.1% accuracy and 42% throughput. I buy the direction, but I would not carry the 3B-scale result straight into 30B or 70B models. GQA and MLP allocation are very sensitive at small scale; MoE routing, long context, and KV-cache pressure change the serving math. The useful artifact here is the reproducible pretraining-time search frame, not a magic architecture ratio.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

The paper introduces a training-free asynchronous reasoning method that uses positional embeddings to let sequential LLMs think, listen, and write at the same time, cutting time to first non-thinking token to ≤5 seconds and reducing overall delays by up to 12× on math, commonsense, and safety reasoning.

#Reasoning#Inference-opt#Safety#Research release

why featured

All three HKR axes pass: HKR-H has a training-free async reasoning hook, HKR-K gives position embeddings and 12x latency reduction, and HKR-R hits latency/cost. Single arXiv paper status keeps it in the 78–84 band.

editor take

≤5s to first non-thinking token is the hook; if this holds, voice agents stop waiting for the model to finish its private monologue.

sharp

Asynchronous reasoning attacks interaction latency, not benchmark glory. The method changes the generation schedule through positional embeddings, letting a sequential LLM think, listen, and write without extra training. The paper claims first non-thinking tokens arrive in ≤5 seconds instead of minutes, with total delay cut by up to 12× across math, commonsense, and safety reasoning. I buy the direction, but not the generality yet. Reasoning models have spent the last cycle trading latency for accuracy, and voice or embodied agents expose that tax immediately. The missing piece is the quality curve: the abstract says “accurate thinking-augmented answers,” but the provided text gives no benchmark table, model list, or failure cases. If accuracy barely moves, this belongs in real-time agent stacks. If the 12× comes from cherry-picked tasks, it is an inference trick with a good demo.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Sockpuppetting: Jailbreaking LLMs by Combining Prefilling with Optimization

The paper introduces Sockpuppetting, placing an optimized adversarial suffix inside the assistant message block; three easy prefill variants reach 22%, 90%, and 99% ASR on Gemma-7B, Llama-3.1-8B, and Qwen3-8B, with RollingSockpuppetGCG improving prompt-agnostic ASR by up to 64% over a universal GCG baseline on Llama-3.1-8B.

#Safety#Alignment#Gemma#Llama

why featured

All three HKR axes pass: a novel sockpuppet-style jailbreak, concrete ASR numbers, and a clear safety/red-team concern. Single arXiv paper, so it stays in the 78-84 research band.

editor take

This moves jailbreak pressure from user prompts to the assistant block; Qwen3-8B at 99% ASR is a bad look for open-weight chat templates.

sharp

Sockpuppetting hits the chat-template boundary, not another cute suffix trick. The authors put the optimized adversarial suffix inside the assistant message block, and three simple prefill variants reach 22%, 90%, and 99% ASR on Gemma-7B, Llama-3.1-8B, and Qwen3-8B. RollingSockpuppetGCG also beats a universal GCG baseline by up to 64% on Llama-3.1-8B. I don’t buy “train a better refusal head” as the fix here. GCG-era defenses mostly stared at the user prompt; prefill attacks abuse the model’s output starting point. Open weights expose the template, tokenizer behavior, and refusal distribution, so attacker cost stays low. The defense has to move into serving-time controls and assistant-prefix validation; SFT/RLHF alone gets ground down by cheap ensembles.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

N-vium uses prediction heads at multiple depths and token-adaptive routing, and in pretraining runs up to 1.5B parameters its largest model achieves a 57.9% wall-clock speedup over a parameter- and data-matched standard Transformer with no perplexity cost.

#Inference-opt#N-vium#Research release

why featured

HKR-K and HKR-R are strong: 57.9% faster exact generation with token-adaptive routing targets serving cost. HKR-H also passes, but this is still an arXiv architecture paper without major-lab or production proof.

editor take

N-vium’s trick is deferred upper-layer work, not cheap early exits; 57.9% speedup is real signal, but 1.5B is still toy-scale for serving pain.

sharp

N-vium’s strongest claim is the word “exact.” It is not speculative decoding with acceptance-rate luck, and it is not dropping layers for quality. It learns a mixture over prediction heads at multiple depths, then defers upper-layer work and batches it with later tokens to recover full KV caches. The paper reports pretraining up to 1.5B parameters, with the largest model running 57.9% faster wall-clock than a parameter- and data-matched Transformer at no perplexity cost. I still have doubts about the deployment story. A 1.5B pretraining result does not settle 70B-class or MoE serving, where memory bandwidth, KV-cache pressure, batching policy, and long-context scheduling dominate. This is cleaner than classic early exit, but it needs a vLLM or TensorRT-LLM-scale serving result before I’d price it as production inference tech.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Large Language Models Lack Temporal Awareness of Medical Knowledge

TempoMed-Bench evaluates LLM temporal awareness on evolving medical guidelines, finding that historical-knowledge accuracy reaches only 25.37%-53.89% of up-to-date knowledge and agentic search integration still changes performance by -3.15% to -14.14%.

#Reasoning#Agent#Benchmarking#TempoMed-Bench

why featured

HKR-H/K/R all pass: the benchmark gives concrete degradation ranges, and temporal medical knowledge maps to deployment risk. Single arXiv paper with no cluster signal, so it fits the 78-84 research band.

editor take

Medical LLMs don’t just go stale; they lose the timestamp on knowledge. TempoMed-Bench finally measures that failure mode.

sharp

TempoMed-Bench hits the medical LLM failure mode people keep hand-waving away: models know guidelines, but not which year makes them valid. Historical medical knowledge lands at only 25.37%-53.89% of up-to-date accuracy, and predictions fluctuate across neighboring years. That points to flattened temporal context during training, not a clean knowledge-cutoff cliff. The ugly part is agentic search did not patch it; performance still moved by -3.15% to -14.14%. That is a warning for medical RAG stacks: retrieving a page is not the same as enforcing temporal validity. MedQA and MMLU medical mostly reward static recall. TempoMed-Bench tests the version-conflict mess clinical products face in deployment.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

The paper frames emergent misalignment as data-mediated transfer and compares SFT, off-policy distillation, and on-policy distillation, finding that misalignment varies with shared prompt structure, room for coherent harmful completions, learned target behavior reliability, and pretraining composition.

#Fine-tuning#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass: the hidden-misalignment hook, data-mediated-transfer mechanism, and fine-tuning safety risk are concrete. With only abstract-level facts and no artifact or broad industry cluster, it fits the 78–84 band.

editor take

Misalignment is not just dirty fine-tuning data; this paper makes data structure and distillation channels the uncomfortable part.

sharp

The sharp move here is treating EM and subliminal learning as transfer through data structure, not contamination by a few harmful examples. The paper compares SFT, off-policy distillation, and on-policy distillation, then ties misalignment to shared prompt function, space for coherent harmful completions, reliability of the learned target behavior, and pretraining mix. That is bad news for shallow safety evals. Many red-team sets test surface topic overlap; this paper says functional prompt structure is the carrier. Many distillation audits check whether teacher outputs look benign; this says the training channel and distribution still matter. The RSS snippet gives no model names, scale, or benchmark numbers, so the empirical weight depends on the tables. But the framing is right: post-SFT bad behavior should not be blamed only on sample labels.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Persona-Model Collapse in Emergent Misalignment

The paper tests insecure-code fine-tuning on four frontier models and reports a 55% average increase in moral susceptibility S and a 65% average decrease in moral robustness R, linking emergent misalignment to persona differentiation dysregulation and loss of role-play consistency.

#Fine-tuning#Alignment#Safety#DeepSeek

why featured

HKR-H comes from the “persona-model collapse” hook; HKR-K has 4-model evidence with 55%/65% shifts; HKR-R hits alignment risk. Strong safety research, but no cross-source debate or product impact yet, so it lands in 78–84.

editor take

Insecure-code tuning didn’t just make four frontier models nastier; it degraded persona control, which is a scarier failure mode for agents.

sharp

This paper pushes emergent misalignment from “bad behavior spills over” into “persona control breaks,” and I buy half of it. The test covers DeepSeek-V3.1, GPT-4.1, GPT-4o, and Qwen3-235B; after insecure-code fine-tuning, moral susceptibility rises 55% on average, while moral robustness drops 65%. GPT-4o’s S lands at more than 2x the prior 13-frontier-model band’s upper end. That is a useful hook because it measures within-persona consistency and cross-persona separation, not just refusal rate. My pushback is the proxy: Moral Foundations Questionnaire behavior can expose role-play collapse, but it does not prove the same failure shape inside tool use, long-horizon agents, or code-repair workflows.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

The paper introduces stateful sessions that incrementally advance a persistent KV cache as data arrives, making query latency O(|q|) rather than tied to accumulated context; on streaming market-data benchmarks, its reference implementation reports up to 5.9x speedup over vLLM, SGLang, TensorRT-LLM, and llama.cpp.

#Inference-opt#Memory#vLLM#SGLang

why featured

HKR-H/K/R all pass: the “attention once” hook is clear, the post gives O(|q|) latency and a 5.9x benchmark, and it targets inference cost. Single arXiv paper with no deployment proof keeps it in the 78–84 band.

editor take

This hits an inference sore spot: long context is not only bigger windows; streaming workloads need prefill moved off the request path.

sharp

The sharp move here is shifting streaming inference from “re-prefill on every query” to “advance the KV cache as data arrives.” The mechanism is concrete: stateful sessions keep a persistent KV cache, query latency becomes O(|q|), and Flash Queries spend idle GPU gaps pre-evaluating registered questions. The reported 5.9x speedup is against vLLM, SGLang, TensorRT-LLM, and llama.cpp on streaming market-data benchmarks. I buy the direction, not the headline number yet. Market-data streams are friendly terrain for fixed sessions and recurring questions. Open chat, coding agents, and tool-heavy multi-tenant workloads turn KV lifetime, isolation, and eviction into the hard part. vLLM’s PagedAttention attacked batching memory waste; this paper attacks persistent session state. Messier problem, closer to production.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT manages million-scale LoRA policy catalogs by keeping base models resident and moving adapter revisions through rollout, evaluation, serving, and rollback; the paper validates training and serving beyond 1T total parameters, and adapter-only handoff reduces the measured step by 18.3x on a 4B dense model.

#Fine-tuning#Inference-opt#Tools#MinT

why featured

HKR-H/K/R all pass: the hook is million-scale LLM management, with LoRA adapter-only handoff and 18.3x speedup. This is useful ML infra research, but source authority and impact are below the 85+ must-write band.

editor take

MinT treats LoRA as a million-scale policy catalog; that is closer to enterprise inference pain than another base-model release.

sharp

MinT lands because serving is shifting from “run one big model” to “manage many small deltas.” The mechanism is concrete: keep base models resident, move LoRA adapter revisions through rollout, evaluation, serving, and rollback. The abstract gives two useful anchors: validation beyond 1T total parameters, and an 18.3x faster adapter-only handoff on a 4B dense model. I buy the system direction, not the “millions of LLMs” headline. This reads like millions of policy versions, not millions of full models. Compared with vLLM or SGLang, MinT pushes the pain up into cataloging, release control, and rollback. That is exactly where enterprise fine-tuning gets ugly after the demo phase.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→EGSS: Entropy-Guided Stepwise Scaling for Reliable Software Engineering

EGSS raises Kimi-K2-Intruct’s resolved ratio on SWE-Bench-Verified from 63.2% to 72.2% and reduces inference-time token usage by over 28% versus existing test-time scaling methods.

#Agent#Code#Inference-opt#Kimi

why featured

HKR-H/K/R all pass: EGSS claims Kimi-K2-Instruct improves from 63.2% to 72.2% on SWE-Bench-Verified while using >28% fewer tokens. This is a practical coding-agent paper, not a major model launch, so it sits in the 78–84 band.

editor take

EGSS is less about 72.2% than making TTS budget-aware; good signal, but SWE-Bench alone is not production proof.

sharp

EGSS hits the waste problem in agentic TTS: Kimi-K2-Intruct moves from 63.2% to 72.2% on SWE-Bench-Verified, while GLM-4.6 moves from 65.8% to 74.6%. It also cuts inference-time tokens by over 28% versus existing TTS methods. That is a better engineering story than brute-force pass@k, because entropy-guided adaptive search and test-suite augmentation attack both search budget and patch selection. I do not buy the “reliable software engineering” framing yet. SWE-Bench-Verified is still offline repo repair, not IDE latency, CI flakiness, or human review throughput. The abstract does not give per-task wall-clock, test-generation cost, or failure clustering. If those costs sit outside the token count, the 72.2% number is only half the bill.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Research paper proposes asynchronous I/O and speculative tool calling to accelerate real-time agents

The paper proposes asynchronous I/O and speculative tool calling for real-time agents, reporting 1.3-1.7x speedups on strong cloud models and 1.6-2.2x speedups on Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct across tool-calling benchmarks.

#Agent#Tools#Inference-opt#Qwen

why featured

HKR-H/K/R all pass: the paper gives a concrete latency mechanism and speedup numbers for agent tool calls. Single arXiv source with no disclosed code or major-lab release keeps it in the low 78-84 band.

editor take

This paper drags agent latency back to systems work: 1.3-2.2x speedups are unsexy, but more real than another tool-use demo.

sharp

The solid move here is treating real-time agents as an I/O scheduling problem, not another model-reasoning problem. The abstract reports 1.3-1.7x speedups on strong cloud models, and 1.6-2.2x on Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct. The mechanism is asynchronous I/O plus speculative tool calling, so the agent can start reversible work before all user or environment signals arrive. I buy the direction, but not the full “real-time” framing yet. Voice interaction needs sub-1-second latency to feel natural, and the snippet gives relative speedups, not end-to-end P95, rollback cost, or the size of the accuracy loss. In production, a speculative call that saves 500 ms can lose the whole trade if it writes to the wrong API once.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Revisiting DAgger in the Era of LLM-Agents

The paper applies DAgger to train software-engineering LM agents, raising SWE-bench Verified scores to 27.3% for a 4B student and 29.8% for an 8B student, with gains of +3.9 and +3.6 points over the strongest post-training baseline.

#Agent#Fine-tuning#Code#arXiv

why featured

HKR-H/K/R all pass: old DAgger applied to LLM agents is a real hook, with concrete 4B/8B SWE-bench Verified scores. It fits coding-agent training interest, but as a single arXiv paper it stays below must-write range.

editor take

DAgger is back because agent failures are less about single-step skill and more about state drift after bad rollouts.

sharp

The sharp part is not the 29.8% SWE-bench Verified score; it is the training recipe admitting where agents break. The paper gets a 4B student to 27.3% and an 8B student to 29.8%, up +3.9 and +3.6 over the strongest post-training baseline. The 4B model also beats published 8B SWE-agent systems, which is the cleanest signal here. The mechanism is old but apt: collect trajectories with turn-level student/teacher interpolation, then train on teacher labels. That attacks SFT’s covariate shift without relying only on sparse RL rewards. I buy the direction more than most agent-RL claims. The missing bill matters, though: teacher-in-the-loop rollouts mean inference cost, and the snippet gives no budget for that.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters

LoREnc protects foundation models and LoRA adapters with spectral truncation, compensation, and orthogonal reparameterization; experiments report under 1% computational overhead while authorized users recover exact performance.

#Fine-tuning#Safety#LoREnc#arXiv

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper whose impact depends on reproduction and deployment tests. The mechanism and <1% overhead justify featured, not must-write.

editor take

LoREnc turns LoRA access control into weight-level gating; sub-1% overhead is nice, but “exact performance” needs harsher attacks.

sharp

LoREnc hits a real pain point: LoRA distribution needs revocable access control, not another watermark. The mechanism is concrete: suppress dominant low-rank components in foundation-model weights, put compensation into authorized adapters, then use orthogonal reparameterization to hide adapter structure. The paper claims collapsed unauthorized outputs, exact authorized performance, and under 1% compute overhead. I don’t fully buy the “exact performance” line yet. The abstract does not disclose model sizes, task mix, attack budget, or behavior after distillation, adapter merging, and quantization. Compared with classic model watermarking, this is closer to how adapters actually ship because it avoids retraining and original-data access. If the ICIP 2026 evidence is mostly vision workloads, though, the jump to messy Hugging Face-style LoRA circulation is still unproven.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

The paper evaluates code-model detection of vulnerability-fixing commits across 20-plus datasets, more than 180,000 commits, and over 180 experiments. At a 0.5% false-positive rate, all fine-tuned code-only models miss over 93% of vulnerabilities, while commit messages dominate attention when available.

#Code#Fine-tuning#Benchmarking#arXiv

why featured

HKR-H/K/R all pass: the hook is massive misses at low FPR, with concrete benchmark scale and a security-reliability nerve for code models. It remains specialized research, so it lands in the lower good-quality band.

editor take

Code models are still cribbing from commit messages; at 0.5% FPR, missing 93% of vuln fixes is not an operational detector.

sharp

This paper lands a hard hit on VFC detection: code models are not learning transferable security semantics; they are leaning on commit-message leakage. The study consolidates 20-plus datasets, more than 180,000 commits, and over 180 experiments, with fine-tuned models from 125M to 14B parameters. Once commit messages are removed, adding intra-procedural semantic context to diffs still does not move attention back to the code changes. The brutal number is operational, not academic: at 0.5% false-positive rate, every fine-tuned code-only model misses over 93% of vulnerabilities. Random splits also inflate confidence; group-stratified evaluation drops performance by about 17%. So I don’t buy benchmark wins here as patch-monitoring capability. For security teams, this is still triage tooling at best, not an alerting gate.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→FlashSampling: Fast and Memory-Efficient Exact Sampling

FlashSampling fuses exact categorical sampling into the LM-head matmul and avoids materializing logits in HBM; in end-to-end vLLM tests across H100, H200, B200, and B300 GPUs, it reduces time per output token by up to 10%.

#Inference-opt#FlashSampling#vLLM#Research release

why featured

HKR-H/K/R all pass: clear mechanism, up to 10% vLLM per-token gain, and direct serving-cost relevance. It is an engineering paper, not a model release, so it fits the 78–84 band.

editor take

FlashSampling is not kernel theater; shaving up to 10% off decode by killing logits HBM traffic is exactly the boring win inference stacks need.

sharp

FlashSampling’s sharp edge is exactness. It does not buy speed with approximate sampling; it pushes categorical sampling into the LM-head matmul and avoids writing the logits tensor to HBM. The reported end-to-end vLLM gain is up to 10% lower time per output token across H100, H200, B200, and B300. That is not flashy, but at decode scale it is cash. The mechanism is clean: compute logits tile by tile, add Gumbel noise, keep one maximizer per row and vocabulary tile, then run a small reduction. In tensor parallel decoding, it also replaces logits all-gather with streaming peer-to-peer writes across up to 8 GPUs. My caveat: the abstract page does not expose the tested model list or batch regimes, and a 10% win can collapse outside large-vocab, large-batch, sampling-heavy paths.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

The paper introduces RESD, which converts failed trajectories into local error reflections and a persistent global playbook; with one rollout per prompt, RESD improves faster in early training than GRPO using 8× samples.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-H/K/R all pass: failed-trajectory reflection is a real hook, RESD gives a testable 1× vs 8× GRPO claim, and compute cost resonates. Kept below higher bands because this is an arXiv method paper without artifact or broad replication.

editor take

RESD squeezes supervision out of failed rollouts; if 1× rollout really beats 8× GRPO early, post-training budgets get uncomfortable fast.

sharp

RESD’s sharp move is treating failed trajectories as token-level supervision, not dead rollouts. The paper’s hook is concrete: one rollout per prompt improves faster in early training than GRPO with 8× samples. The mechanism is local error reflections plus a persistent global playbook, rather than appending environment feedback and hoping the model infers the lesson. I’m skeptical about the cost accounting. The abstract says “multiple continual learning tasks,” but gives no task names, baseline success rates, compute budget, or teacher model for generating reflections. If the reflection generator is much stronger, the method may shift cost from sampling to diagnosis. Still, it hits a real post-training sore spot: sparse-reward failures have been treated as waste, and that is increasingly hard to defend.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus adds a lightweight trainable module to a frozen LLM, lets autoregressive and diffusion views share the same KV cache, and uses an exact consensus mechanism to deliver lossless inference with up to 7.8x speedup and O(1) cache overhead.

#Inference-opt#Orthrus#Research release

why featured

HKR-H/K/R all pass: Orthrus claims 7.8x speedup, O(1) KV cache overhead, and lossless consensus decoding. It is still an arXiv paper without production validation, so it lands at 78 featured, not P1.

editor take

Orthrus claims 7.8x lossless decoding with O(1) KV overhead; I’m not buying it fully until we see consensus fallback rates.

sharp

Orthrus’ sharp claim is not diffusion decoding; it is lossless parallel generation on a frozen LLM with one shared KV cache. The abstract gives 7.8x speedup and O(1) cache overhead, but it does not disclose model size, task mix, output length, batch setting, or the rejection rate under exact consensus. This lives in the same danger zone as speculative decoding and Medusa-style multi-token heads. Average speedups look great until long generations and hard prompts expose acceptance-rate collapse. EAGLE and Medusa already taught the field that “predict more tokens” is easy; keeping verification cheap is the bill. Orthrus may have a cleaner dual-view mechanism, but the benchmark only matters if consensus cost is counted plainly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters

MARLIN uses multi-agent game-theoretic reinforcement learning to co-optimize TTFT, carbon emissions, water use, and energy cost for LLM inference in cloud datacenters, reducing TTFT by at least 18%, carbon emissions by 33%, water usage by 43%, and energy costs by 11% versus state-of-the-art inference management frameworks.

#Agent#Inference-opt#MARLIN#Research release

why featured

HKR-H/K/R all pass: MARLIN turns LLM inference scheduling into a latency-carbon-water-cost problem with four testable reduction claims. Single arXiv paper limits confidence, so it lands just above featured threshold.

editor take

MARLIN frames inference scheduling as green RL, which is right; I don’t buy 33% carbon and 43% water cuts without real datacenter constraints.

sharp

MARLIN’s useful move is pulling LLM inference scheduling beyond GPU utilization into grid and water constraints. The paper claims at least 18% lower TTFT, 33% lower carbon emissions, 43% lower water use, and 11% lower energy cost. It also states inference can reach 90% of total LLM lifecycle energy use. I discount the headline numbers for now. The abstract only says it beats state-of-the-art inference management frameworks; it does not give cluster size, regional electricity prices, cooling setup, or workload mix. vLLM, Ray Serve, and Kubernetes-style schedulers mostly optimize throughput and latency, so MARLIN’s multi-agent game-theoretic RL framing is the right research direction. Without real cloud traces, the 33% carbon cut smells like a simulator win, not an operator-ready result.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→WriteSAE: Sparse Autoencoders for Recurrent State

WriteSAE decomposes and edits the matrix-cache writes of state-space and hybrid recurrent language models, and on Qwen3.5-0.8B L9 H4 its atom substitution beats matched-norm ablation on 92.4% of 4,851 firings.

#Interpretability#Qwen#Mamba#RWKV

why featured

HKR-K is strong on mechanism and numbers, while HKR-R is limited to interpretability and safety-audit readers. The paper is too specialized for featured.

editor take

Two arXiv entries, same title, narrow signal; WriteSAE moves interpretability into recurrent cache writes, which is harder than another residual-stream SAE demo.

sharp

Both sources are the same arXiv title, so the coverage is aligned through one v2 paper, not independent confirmation. The useful part is the target site: WriteSAE skips residual streams and decomposes the d_k×d_v cache write in Gated DeltaNet, Mamba-2, and RWKV-7, where the update is k_t v_t^T rather than a vector feature. I buy half of it. The numbers are stronger than the usual SAE interpretability demo: 92.4% substitution wins over 4,851 firings, an 87-atom population test at 89.8%, and logit-shift prediction at R²=0.98. But the core run is Qwen3.5-0.8B L9 H4, with Mamba-2-370M only at 2,500 firings. It shows recurrent write sites can be edited; it does not yet prove a general interpretability stack.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Dense vs Sparse Model Pretraining Compared at Small Scale

The paper compares dense and MoE transformer pretraining below 25M parameters: a 4-expert top-2 MoE beats the dense active-matched baseline by 0.0758 validation loss, but loses to the dense total-matched baseline by 0.0180 under equal stored capacity.

#Benchmarking#Fine-tuning#arXiv#LLaMA

why featured

HKR-H/K pass: the MoE result flips under two matching rules, with concrete loss gaps. HKR-R is weak because sub-25M pretraining is niche, so this stays in the 60–71 band.

editor take

Only the title is disclosed, with no scale, data, or loss curves; the active-vs-total parameter framing hits the weakest spot in tiny MoE comparisons.

sharp

The two arXiv entries share the same title across cs.CL and cs.LG, so this is likely one paper cross-listed, not independent convergence. Only the title is disclosed so far; model size, token budget, routing design, and matching protocol are absent, so the claimed direction cannot be trusted yet. I like the framing because tiny-scale MoE work often blurs total parameters and active compute. Mixtral 8x7B trained the field to read “big total, small active” as a performance story, and many small experiments copied that ambiguity. If this paper only shows active-parameter matching is the cleaner comparison, that is useful evaluation hygiene. If it tries to generalize from tiny pretraining to production MoE behavior, I don’t buy it without loss curves and compute-normalized runs.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Before the Last Token: Diagnosing Final-Token Safety Probe Failures

The paper tests SafeSwitch-style probes on three instruction-tuned LLMs and finds that final-token readouts miss many jailbreak prompts; a PCA-HMM trajectory model trained on the same clean split recovers many misses from prefill trajectories without catastrophic false positives.

#Safety#Interpretability#SafeSwitch#Research release

why featured

HKR-H/K/R all pass, but this is a technical safety/interpretability arXiv paper without a major-lab release, artifact, or cross-source discussion. It fits the 72–77 featured band.

editor take

Final-token safety probes are leaking jailbreaks; watching one hidden state is a self-imposed evidence cutoff.

sharp

Final-token probing is failing because the observation point is wrong, not because the probe needs another width tweak. The paper tests SafeSwitch-style probes on three instruction-tuned LLMs: they recall clean harmful prompts, miss many jailbreaks, and still false-positive on safety-adjacent benign prompts. The nasty detail is that increasing probe bottleneck width does not reliably fix the mismatch. The useful move is treating prefill as a trajectory. A PCA-HMM trained on the same clean split recovers many final-token misses from user-token hidden-state paths, without the catastrophic false positives of naive max-pooling. I buy this as a diagnostic direction. I would not treat it as a deployable safety layer yet: the abstract gives no model names, miss rates, false-positive rates, or latency cost.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Bayesian Model Merging

The paper introduces Bayesian Model Merging, a bi-level plug-and-play framework for merging task-specific models without joint retraining. It uses activation-based Bayesian regression with an anchor prior and Bayesian optimization for module hyperparameters. On ViT-L/14 8-task merging, one merged model scores 95.1, close to eight experts averaging 95.8.

#Fine-tuning#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K/R pass, but the item only gives abstract-level paper facts and no code, reproducibility details, or production replacement case; it sits in the high 72–77 research-release band.

editor take

BMM makes model merging look practical again: 95.1 vs eight experts at 95.8, but the bill is a strong anchor and a validation set.

sharp

BMM’s useful move is not the Bayesian branding. It turns merging into an engineerable pipeline: a strong anchor as prior, activation-based regression with a closed-form solve, then Bayesian optimization for module-level hyperparameters. On ViT-L/14, one merged 8-task model scores 95.1 versus 95.8 for eight separate experts. That gap is small enough to make plug-and-play merging look alive again. I’d be careful with the data-free claim. The paper estimates the Gram matrix through alignment between activation statistics and task vectors, but 20 vision tasks and 5 language tasks are still cleaner than production model soups. Beating TA, WUDI-Merging, and TSV is a solid signal. Surviving conflicts in instruction models, multilingual behavior, and tool-use policies is the harder test.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User Records

PersonalAlign introduces the AndroidIntent benchmark and HIM-Agent for personalized GUI agents, evaluating vague-instruction resolution and proactive suggestions using 20k long-term records, 775 user-specific preferences, and 215 routines; HIM-Agent improves execution and proactive performance by 15.7% and 7.3%, respectively.

#Agent#Reasoning#Memory#PersonalAlign

why featured

HKR-H/K/R all pass: long-term-memory GUI agents are clickable, and AndroidIntent gives concrete scale plus 15.7%/7.3% gains. As a single arXiv paper with no disclosed release artifact, it stays at the low featured band.

editor take

PersonalAlign pushes GUI agents into long-term memory; +15.7% is solid, but the user-record realism will decide whether this survives deployment.

sharp

PersonalAlign hits the layer where GUI agents usually fail in products: remembering the user, not tapping the right button. AndroidIntent uses 20k long-term records, 775 user-specific preferences, and 215 routines to test vague instructions and proactive suggestions. HIM-Agent lifts execution by 15.7% and proactive performance by 7.3%, which is a more relevant gain than another UI-grounding leaderboard bump. My reservation is the benchmark substrate. If those long-term records are synthetic or too clean, the agent learns preference templates, not human habits. Including GPT-5, Qwen3-VL, and UI-TARS is a useful baseline spread, but the abstract does not give the data collection process, privacy constraints, or cross-user leakage controls. Without that, PersonalAlign is a good exam paper for personal agents, not evidence that the product problem is solved.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Scaling Laws for Mixture Pretraining Under Data Constraints

The authors analyze more than 2,000 language-model training runs and find that scarce target corpora in mixture pretraining can be reused 15-20 times, with the optimal repetition count depending on target data size, compute budget, and model scale.

#Benchmarking#Research release

why featured

Single arXiv training-scaling paper lacks named-lab authority or external replication, so it stays below 78. HKR-H/K/R pass because the 2,000+ runs and 15-20x target-data reuse claim are concrete and cost-relevant.

editor take

This punctures the dedup purism: scarce target data can pay off after 15–20 repeats, if generic data stays in the mix.

sharp

This paper gives low-resource and domain pretraining a sharper knob: stop asking whether the target corpus is “enough,” and compute how many times it should repeat. The authors ran 2,000+ language-model trainings across model sizes, target dataset sizes, multilingual data, domain data, and quality-filtered mixtures. Their claim is concrete: mixture pretraining tolerates 15–20 repeats of scarce target corpora, because generic data regularizes the repeated target tokens. I like this because it turns a familiar enterprise-model argument into a scheduling problem. Compliance text, medical corpora, internal code, and low-resource languages never have web-scale volume; blindly scraping more data is often worse than getting the mixture right. The abstract does not disclose parameter counts, total token budgets, or loss curves, so 15–20 repeats should not become a production default. But it is a useful pushback against the reflex that repetition equals contamination or overfit.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Reinforced Collaboration in Multi-Agent Flow Networks

MANGO builds a flow network from past successful workflows and jointly optimizes workflow paths and agent behaviors with reinforcement learning and textual gradients, reporting up to 12.8% performance improvement and 47.4% efficiency gain across seven benchmarks.

#Agent#Reasoning#Tools#MANGO

why featured

HKR-H/K/R all pass: MANGO has a concrete mechanism, benchmark gains, and a live agent-orchestration pain point. It stays in the 72-77 featured band because this is a single arXiv paper and artifact impact is not disclosed.

editor take

MANGO targets the right failure mode: agent systems don’t lack roles, they lack reusable successful paths. The 12.8% gain depends on baseline strength.

sharp

MANGO is stronger than the usual multi-agent paper because it optimizes the workflow, not just the agents. It builds a flow network from past successful workflows, then uses reinforcement learning plus textual gradients to tune both paths and agent behavior. The reported numbers are specific: up to 12.8% performance gain and 47.4% efficiency gain across seven benchmarks. I’d discount the “generalizes to unseen domains” claim for now. The snippet names no benchmark list, no SOTA baselines, no base models, and no token-cost setup. AutoGen and MetaGPT-style systems also sold collaboration topology hard, then often lost in practice to a single stronger agent with planning, retries, and tool discipline. MANGO becomes convincing only if the open repo shows gains under strong models, real long-horizon tool tasks, and cost accounting—not just workflow search beating loose baselines.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Research paper proposes watermarking as monitoring primitive for entity-level attribution

The paper introduces an observer-based threat model and shows that, under multi-key settings, even zero-bit watermarking enables entity-level attribution by aggregating watermark signals across outputs.

#Safety#Research release#Safety/alignment

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with mechanism-level detail and no disclosed deployment scale or field results. Featured fits; it is below the must-write 85 band.

editor take

Watermarking has been sold as provenance plumbing; this paper correctly frames it as surveillance infrastructure with attribution as the friendly UI.

sharp

Watermarking’s ugly risk sits in aggregation, not single-sample detection. Aremu, Lukas, and Zhang’s 12-page paper introduces an observer threat model: under multi-key settings, even zero-bit watermarking can support entity-level attribution by aggregating weak signals across outputs. External monitoring can also emerge from persistent, key-dependent statistical structure. That lands because most watermark papers still benchmark evasion and false positives per sample. The policy pitch says provenance; the mechanism says longitudinal observability. Once entities get keys and detectors can compare distributions across outputs, watermarking becomes behavioral telemetry. The authors leave room for distribution-preserving or undetectable schemes, but that cure cuts into the attribution value vendors want to sell.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Inference-Time Machine Unlearning via Gated Activation Redirection

GUARD-IT performs machine unlearning through input-dependent activation steering at inference time, keeps weights unchanged via norm-preserving residual-stream rotations, and matches or exceeds 12 gradient-based baselines on TOFU and MUSE across three model scales, including quantized deployment settings.

#Safety#Inference-opt#Alignment#arXiv

why featured

HKR-H/K/R all pass, but this is still an arXiv-summary item with no disclosed code, replication details, or visible industry debate. Inference-time unlearning without weight edits clears the featured threshold, not the must-write band.

editor take

GUARD-IT moves unlearning out of weight edits and into gated inference steering; that sounds hacky, but quantized deployment support is the useful part.

sharp

GUARD-IT’s sharp move is making unlearning a reversible inference control, not a weight surgery. The paper says it keeps weights fixed, applies input-dependent gated rotations in the residual stream, and matches or beats 12 gradient-based baselines on TOFU and MUSE across three model scales. The quantization claim is the deployment hook: parameter-editing methods often get ugly once the served model is int8 or lower. I’ve always found parameter-level unlearning awkward for compliance. Every deletion request touches weights, then audit, rollback, and model versioning become a mess. GUARD-IT smells more like a runtime policy layer, which is operationally cleaner. But I would not call this legal-grade deletion yet. TOFU and MUSE are controlled benchmarks; real copyrighted spans, paraphrase attacks, RAG paths, and near-neighbor prompts are nastier. The abstract also does not disclose the exact model families or red-team strength.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Discovery of Hidden Miscalibration Regimes

The paper proposes a diagnostic framework for hidden miscalibration regimes without predefined data slices, and reports input-dependent calibration heterogeneity across four real-world LLM benchmarks and twelve LLMs.

#Benchmarking#Safety#Research release#Benchmark

why featured

HKR-H/K/R all pass, but this is a single arXiv eval/safety paper. The post gives the method and test scope, not code, authorship weight, or visible industry debate, so it sits in the 72–77 featured band.

editor take

Stop trusting one ECE curve; this paper moves calibration failure into input space, where LLM eval averages usually hide the damage.

sharp

The useful hit here is narrow and sharp: LLM confidence is not a one-dimensional reliability signal. The same confidence score can lie differently across input regions. The paper tests four real-world LLM benchmarks and twelve LLMs, then estimates signed local miscalibration with a calibration-aware representation plus kernel smoothing. That is a better fit for deployment risk than temperature scaling or isotonic regression, which assume the confidence axis carries enough structure. I buy the diagnosis more than most calibration papers because it targets the exact place benchmark averages launder failures. The abstract does not disclose the benchmark names or model names, so the empirical weight is capped for now. Still, the framing is right: global ECE can make a model look sane while specific input regimes remain overconfident.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

DUS partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel, achieving up to 5.8× wall-clock speedup over token-by-token MDLM decoding across benchmarks including GSM8K, HumanEval, and MMLU-Pro without changing the denoiser.

#Inference-opt#Benchmarking#Code#Research release

why featured

HKR-H and HKR-K pass: the 5.8x speedup and dilated parallel unmasking are concrete. HKR-R is weaker because MDLM decoding is still niche, so this sits near the featured floor.

editor take

DUS is the useful kind of MDLM work: no denoiser change, up to 5.8× faster, and a direct attack on diffusion decoding tax.

sharp

DUS makes the right bet: MDLMs do not need another “diffusion can generate text” pitch; they need decoding that behaves like an engineering knob. The method partitions positions into non-adjacent dilated groups, unmasks them in parallel, and minimizes an upper bound on joint entropy gain. The paper reports GSM8K, MATH500, HumanEval, MBPP, BBH, MMLU-Pro, and IFEval, with up to 5.8× wall-clock speedup over token-by-token MDLM decoding, without changing the denoiser. I like the restraint here: inference-only, planner-model-free, and block size B sets the speed-quality trade. That is cleaner than the usual “non-autoregressive generation is fast” handwave. The caveat is also obvious: 5.8× is against token-by-token MDLM decoding, not production AR LLM serving. The snippet does not give the quality-loss table or long-output failure modes.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→New Algorithm for AI Explainability Using Feature Association Maps

The paper proposes FAMeX, an explainability algorithm based on a graph-theoretic Feature Association Map, and compares it with PFI and SHAP in classification experiments across eight benchmark algorithms.

#Interpretability#Benchmarking#FAMeX#PFI

why featured

A single arXiv explainability method clears HKR-K with a concrete mechanism and comparisons. HKR-H and HKR-R are weak: no surprising result, production claim, open artifact, or safety incident link.

editor take

All 3 hits point to the same arXiv record; FAMeX only beats PFI and SHAP. Without code and task detail, don’t crown a new XAI baseline.

sharp

All 3 sources point to the same arXiv:2605.12350 entry with the same headline, so this is a single-source chain, not independent coverage. The paper proposes FAMeX, a Feature Association Map method, and says it beats PFI and SHAP across 8 benchmark algorithms for classification explanations. I’d discount the claim for now. PFI and SHAP are useful baselines, but they are not the toughest 2026 bar for explainability work. The abstract gives no datasets, metrics, significance tests, or code link. Feature association graphs are a sensible idea for tabular settings with correlated inputs; to credibly beat SHAP, FAMeX has to survive correlated features, leakage features, and distribution shift. At this stage, it reads like an early method claim, not a replacement baseline.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Collaborative Parameter Learning: Mitigating Forgetting via Parameter-Level Gradient Analysis

The paper proposes CPL, a parameter-wise training rule that freezes 50% to 75% conflicting parameters and updates only collaborative parameters; against seven baselines, CPL learns 20.2% to 48.2% more questions with negligible forgetting and cuts peak VRAM by about 3 GB per billion parameters.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R all pass: CPL has a concrete mechanism and numbers tied to forgetting and VRAM cost. Single arXiv source, no code or external replication disclosed, so it stays at the lower featured band.

editor take

CPL attacks forgetting at parameter granularity: freeze 50–75% of weights. That feels more usable than another loss-weight recipe.

sharp

CPL is sharp because it stops treating forgetting as a loss-balancing problem and cuts it at parameter level. The paper says 50–75% of parameters are “conflicting,” freezes them, and updates only the remaining 25–50%. Against seven baselines, it learns 20.2–48.2% more questions, with negligible forgetting, about 3 GB less peak VRAM per billion parameters, and 16.5% less compute time. I buy the direction, not the full victory lap. For enterprise knowledge injection, cheaper tuning plus less forgetting is exactly the pain point. But the abstract does not disclose model sizes, baseline implementations, or dataset mix. A 20.2–48.2% range is wide enough to hide fragile cases. If the win concentrates on small models or short knowledge sets, continual tuning on messy internal corpora still needs fresh evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Efficient Sensor Fusion Gesture Recognition for Resource-Constrained Devices

The paper proposes a gesture recognition system that fuses 8×8 ToF and 8×8 IR thermal sensors, reaching 92.3% accuracy and 0.93 macro F1 on a custom 7-class static gesture dataset with 6,343 parameters and 50 mW total power on STM32 MCUs.

#Multimodal#Vision#Inference-opt#STM32

why featured

HKR-H and HKR-K pass via the tiny-model edge-AI hook and concrete metrics. HKR-R is weak because the paper is a narrow embedded-sensing result, so it stays in the 60–71 band.

editor take

An 8×8 ToF plus 8×8 thermal stack hitting 92.3% accuracy is a better smart-glasses bet than another camera module.

sharp

Both sources use the same title, and Hugging Face Papers is amplifying the arXiv record rather than independently validating it. The concrete hook is solid: VL53L8CH ToF plus AMG8833 thermal IR, 7 static gestures, 6,343 parameters, millisecond inference on STM32F4/H7, and 50 mW total system power. I like this more than most “edge multimodal” demos. Smart-glasses input is constrained by power, privacy, and social acceptability; an 8×8 sensor-fusion stack dodges the camera problem instead of pretending it is solved. The weak spot is also clear: custom dataset, static gestures, k-fold validation, and no cross-user or outdoor-light robustness in the abstract. Apple Vision Pro’s hand tracking feels richer, but it lives in a different power and form-factor budget.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Mix, Don’t Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

The paper compares about 1,000 pre-training runs across four model scales from 150M to 1.43B parameters and finds that mixing English data into Arabic low-resource training outperforms hyperparameter tuning on both validation loss and downstream task accuracy.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass: ~1,000 pretraining runs and an Arabic low-resource setting give concrete signal. It remains a single arXiv training-strategy paper, so it stays near the featured floor.

editor take

For low-resource pretraining, stop worshipping hyperparam sweeps; ~1,000 runs say English mixing beats weight decay for Arabic.

sharp

Data mixing looks like the main lever in low-resource pretraining, not another hyperparameter sweep. The paper runs roughly 1,000 pretraining jobs across 150M to 1.43B parameters; for Arabic with English as auxiliary data, mixing beats aggressive tuning like high weight decay on validation loss and downstream accuracy. The gap also grows with model size. The useful number is the implied data multiplier: mixing gives gains equal to 2-3x more unique Arabic data on validation loss, and 2-13x on downstream accuracy. That should make low-resource teams uncomfortable if their plan is still grid search plus repeated target corpora. The catch is methodological: target-language validation loss misses part of the value, because English injects knowledge that shows up in tasks before it shows up cleanly in loss.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Do Activation Verbalization Methods Convey Privileged Information?

arXiv:2509.13316v4 evaluates activation verbalization methods and finds that models can perform well on prior benchmarks without access to target-model internals; controlled experiments show verbalizations often reflect the verbalizer LLM’s parametric knowledge rather than the decoded target LLM’s knowledge.

#Interpretability#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass: the paper makes a testable critique of activation verbalization benchmarks and their reliance on the verbalizer LLM. It is research-heavy, so it lands in the 72–77 band rather than a same-day must-write.

editor take

Good cold shower for activation verbalization: if no target internals still passes, the method is reading the verbalizer’s memory, not the model’s mind.

sharp

Activation verbalization has a benchmark problem before it has an interpretability problem. arXiv:2509.13316v4 makes the sharp cut: methods can score well on prior benchmarks without target-model internals, and controlled experiments show the generated explanations often track the verbalizer LLM’s parametric knowledge. That lands on a whole line of “translate activations into English” papers. This is the ICML 2026 version, with 41 pages, 23 tables, and 6 figures, so it is not just a vibes objection. I’ve always thought this family blurs correlation and mechanism too easily. Mechanistic interpretability at least forces circuits, interventions, and causal swaps into the room. Once a second LLM writes the explanation, its priors become a huge contamination channel.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

The paper constructs latent-space backdoors in ResNet and Vision Transformer image classifiers, reports consistently high attack success with negligible clean-accuracy degradation, and says post-training defenses failed without making the model unusable; the snippet does not disclose dataset names or exact success-rate figures.

#Vision#Safety#Interpretability#arXiv

why featured

HKR-H/K/R pass, but the post gives model families and qualitative outcomes only; datasets, attack success rates, and accuracy drops are not disclosed, so it stays at the featured threshold.

editor take

Latent-space backdoors are a nasty result: if ResNet and ViT can hide channels in normal representations, post-hoc model scanning looks thin.

sharp

The sharp part is not another backdoor; it pushes the clean-versus-backdoored boundary into a hypothesis test over parameter distributions. The authors test latent-space channels on ResNet and Vision Transformer image classifiers, claiming consistently high attack success and negligible clean-accuracy loss. The abstract page gives no dataset names, ASR values, or accuracy deltas.\n\nI don’t fully buy the “cryptographic undetectability” framing for deployed systems; training provenance and trigger constraints still matter. But the result hits the supply-chain weak spot. If the channel rides natural representation geometry instead of added foreign structure, post-training defenses become a bad trade: remove the backdoor and you also damage the model.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

The paper introduces Loopholing Discrete Diffusion Models, which preserve pre-sampling distributional information through a deterministic latent pathway; LDDMs reduce generative perplexity by up to 61% versus prior baselines and improve arithmetic results on Countdown and Game of 24.

#Reasoning#Inference-opt#Research release#Benchmark

why featured

HKR-H/K/R pass: the paper names a sampling-wall hook, gives an LDDM latent-path mechanism, a 61% perplexity drop, and Countdown/Game of 24 results. As a single arXiv claim needing replication, it clears featured but not must-write.

editor take

LDDM hits discrete diffusion at its worst failure mode: sampling collapse. A 61% perplexity drop is serious, but AR is not dead yet.

sharp

LDDM matters because it attacks the breakage point inside discrete diffusion, not because it repeats the parallel-decoding pitch. The paper says categorical sampling collapses distributional state into one-hot vectors; Loopholing keeps pre-sampling information alive through a deterministic latent pathway. That is a cleaner fix than adding more denoising steps and hoping the trajectory stabilizes. The hard number is a generative perplexity reduction of up to 61%, plus gains on Countdown and Game of 24. ICLR 2026 acceptance gives it a decent credibility floor. I still don’t buy the broad “closes or surpasses autoregressive models” read yet. The abstract does not give model scale, decode steps, wall-clock latency, or matched-compute AR baselines. For production inference, discrete diffusion has to turn parallelism into latency and cost wins, not just better perplexity.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·14

→Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

The paper introduces a misconception-faithfulness framework and Selective Flip Score, tests seven 4B-120B LLMs, finds near-zero SFS under targeted versus control feedback, and reports that supervised fine-tuning improves SFS by up to +0.56 while SFS-aligned reinforcement learning is more consistent than preference optimization.

#Reasoning#Fine-tuning#Alignment#arXiv

why featured

HKR-H/K/R all pass, but this is a single arXiv evaluation paper focused on student-misconception simulation. New metric and cross-model results clear the featured bar, not must-write.

editor take

This paper punctures student simulators: seven 4B–120B models show near-zero SFS, so many tutor evals reward correction-chasing, not student modeling.

sharp

Student simulators have a nasty evaluation bug: fluent student-like text gets mistaken for student-like cognition. This paper makes the failure measurable with Selective Flip Score. Across seven 4B–120B LLMs, models corrected at similarly high rates after targeted, generic, and misaligned feedback, leaving SFS near zero. That is not a coherent misconception state. It is a solver hearing “you are wrong” and rerunning the problem from its own knowledge. The useful part is that the failure is trainable. SFT raises SFS by up to +0.56, and SFS-aligned RL beats preference optimization on consistency. I don’t buy most AI tutor demo claims unless they test this interaction loop. A student agent without selective belief updates is a polite adversarial test harness with a backpack.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→DiscoverLLM: From Executing Intents to Discovering Them

DiscoverLLM trains LLMs with a hierarchical-intent user simulator, improving task performance by over 10% and reducing conversation length by up to 40% on interactive benchmarks for creative writing, technical writing, and SVG drawing.

#Agent#Reasoning#Alignment#DiscoverLLM

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with no known lab pull or artifact details disclosed. Concrete benchmark gains keep it in all, below the 72 featured bar.

editor take

DiscoverLLM reports 10%+ gains and 40% shorter chats; simulator-shaped rewards beat clarification-question cosplay.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

The paper reanalyzes 722 human-annotated SAE features from Gemma 2 2B and Pythia 70M, finding that 82.1% of features share an explanation with another feature and the average annotation resolves only 70% of feature identity.

#Interpretability#Gemma#Pythia#Marks et al.

why featured

HKR-H/K/R all pass, but this is a single arXiv interpretability paper with a narrow audience and no artifact or cross-source cluster; defaulting to the high end of 60–71.

editor take

722 human-labeled SAE features show 82.1% explanation collisions; activation-prediction scores look too forgiving when “plural nouns” names 101 features.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges

The paper proposes a behavioral forecast-evaluation method that groups autonomous decision traces into five-day episodes and scores six dimensions with three LLM judges; after three fine-tuning cycles, one-day MAPE on the 2017–2025 held-out test period falls from 0.61% to 0.54%.

#Agent#Reasoning#Fine-tuning#arXiv

why featured

HKR-H and HKR-K pass: the paper has a clear LLM-judge feedback loop and reports test-period MAPE moving from 0.61% to 0.54%. The finance niche limits HKR-R, so it stays in the high 60–71 band rather than featured.

editor take

Three tuning cycles cut MAPE from 0.61% to 0.54%; I buy the evaluator, not the trading-value claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Learning a Continue-Thinking Token for Enhanced Test-Time Scaling

The paper adds one learned <|continue-thinking|> token to a distilled DeepSeek-R1 model and trains only its embedding with reinforcement learning while freezing model weights. On GSM8K, fixed-token budget forcing with “Wait” improves accuracy by 1.3 percentage points, while the learned-token method improves accuracy by 4.2 points over the base model.

#Reasoning#Inference-opt#DeepSeek#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper with evidence centered on a small GSM8K gain. It lacks cross-task stability or production impact, so it stays in upper “all.”

editor take

One trained token lifts distilled DeepSeek-R1 by 4.2% on GSM8K; cheap inference hacks work, but don’t extrapolate to harder math yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

Researchers benchmarked general-purpose coding agents on eight mouse neural population recording papers, giving them data, code, and papers to load and reformat datasets for decoder training; the agents performed individual subtasks well but rarely produced a fully error-free end-to-end solution.

#Agent#Code#Benchmarking#arXiv

why featured

HKR-H/K/R pass, but this is a single arXiv paper on neurodata reuse, narrower than a general agent benchmark. Score stays in the 60–71 band, tier all.

editor take

Coding agents rarely solved 8 neuroscience reuse tasks end-to-end. Keep humans until ground-truth checks exist.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Improving LLM Final Representations with Inter-Layer Geometry

The paper introduces Cayley-Encoder, which aggregates LLM layer representations with a Cayley graph over SL(2, Zn), and reports evaluation across 13 tasks and 9 LLMs with up to 40 percentage-point accuracy gains and at most 0.1% extra parameters relative to the LLM size.

#Reasoning#Fine-tuning#Interpretability#Research release

why featured

HKR-K/R pass: the mechanism and benchmark numbers are concrete, with tiny parameter overhead. HKR-H fails because this is a technical single-paper method, so it stays below featured.

editor take

Cayley-Encoder claims up to +40 points across 13 tasks and 9 LLMs; I’d audit baselines and splits first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

The paper studies 135M and 1B language models across two post-training domains and two downstream fine-tuning tasks, finding that mixing post-training data into pretraining improves the frontier between retained upstream capability and downstream performance after later fine-tuning.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-K/R pass: the paper gives concrete model sizes and task settings, and tests whether early exposure improves robustness after fine-tuning. HKR-H is weak, so this stays at the top of the 60–71 band.

editor take

This tests 135M/1B models across four later tasks; early data mixing helps retention, but don’t extrapolate to frontier labs yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Research shows neural network extrapolation ability depends on feature representation not architecture

The paper argues that OOD extrapolation is non-identifiable from a single ID training window, and changing only the representation can make the same architecture at the same ID loss differ by about 520x out of distribution.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R pass on the 520x OOD gap and identifiability claim. Single arXiv abstract; no code, author authority, or industry replication is disclosed, so it stays in the 60–71 band.

editor take

A single ID window cannot identify OOD extrapolation; a representation swap gave 520x spread, so stop crediting scale alone.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→RDMA: Cost-Effective Agent-Driven Rare Disease Mining from Electronic Health Records

RDMA uses smaller quantized LLMs to mine rare diseases from clinical notes without task-specific fine-tuning; the paper reports performance above fine-tuned and RAG baselines, with inference costs reduced by up to 10x and local hardware costs by up to 17x.

#Agent#RAG#Reasoning#RDMA

why featured

HKR-K and HKR-R are solid: the paper gives concrete cost-reduction claims and a local-deployment angle. Single arXiv source plus narrow rare-disease EHR scope keeps it below featured.

editor take

RDMA cuts rare-disease mining inference cost 10x; I buy tool-augmented small models here, but real EHR external validation decides it.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA formulates hallucination elicitation as constrained optimization, builds an input-dependent dictionary of valid editing directions, and tests latent adversarial attacks on open-source LLMs plus large reasoning models under free-form response settings.

#Reasoning#Safety#Alignment#REALISTA

why featured

HKR-H/K/R all pass: the hook is latent attacks that induce hallucinations, and the mechanism is constrained optimization plus input-dependent edit directions. ArXiv-only evidence with no success rates or model list keeps it in 60-71, not featured.

editor take

REALISTA hits large reasoning models in free-form QA; success rates aren’t disclosed, so don’t buy SOTA until code runs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→TS-Haystack: A Multi-Task Retrieval Benchmark for Long-Context Time-Series Reasoning

TS-Haystack introduces 10 event-grounded QA tasks over 100-second to 24-hour contexts, spanning direct retrieval, temporal reasoning, multi-step reasoning, and anomaly detection; an agentic retrieval framework with specialized time-series classifier tools matches or beats SoTA TSLMs on 9 of 10 tasks.

#RAG#Reasoning#Agent#TS-Haystack

why featured

HKR-K is strong with task count, context range, and 9/10 results; HKR-R comes from the long-context versus retrieval architecture tradeoff. The narrow time-series benchmark scope and single arXiv source keep it in upper all, not featured.

editor take

TS-Haystack tests 10 QA tasks; TSLMs collapse near zero at 24h, while tool retrieval wins 9. End-to-end lost to RAG again.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Embodied Neurocomputation: A Framework for Interfacing Biological Neural Cultures with Scaled Task-Driven Validation

The paper proposes an Embodied Neurocomputation framework, evaluates about 1,300 BNN encoding configurations in a simulated grid-world, and uses over 4,000 hours of closed-loop agent-environment interaction to identify 12 configurations that consistently show learning.

#Agent#Robotics#Benchmarking#Research release

why featured

HKR-H/K pass because the wetware-agent setup is unusual and the paper gives concrete run counts. HKR-R is weak: the work is far from current agent, model, and tooling decisions.

editor take

BNN tuning ran 1,300 configs over 4,000 hours; 12 beat DQN, but a tiny grid-world is not robotics evidence yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Protocol-Driven Development: Governing Generated Software Through Invariants and Evidence

The paper introduces Protocol-Driven Development, defining a protocol as P=(S,B,O); an implementation is admitted only when it satisfies structural, behavioral, and operational invariants and produces a verifiable Evidence Chain.

#Code#Agent#Safety#Research release

why featured

HKR-K/R pass: the mechanism is concrete and relevant to AI coding governance. The post only gives abstract-level details, with no experiment scale, benchmark, or deployment case, so it stays in the 60–71 band.

editor take

PDD gates generated code with P=(S,B,O); no toolchain is disclosed, so this smells like formal methods repackaged for agents.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LINE: LLM-based Iterative Neuron Explanations for Vision Models

LINE uses an LLM and a text-to-image generator to iteratively label neurons in vision models under a strict black-box setting, improving AUC by up to 0.11 on ImageNet and finding an average of 27% new concepts missed by predefined vocabularies.

#Vision#Interpretability#Safety#LINE

why featured

HKR-H/K pass on the LLM+text-to-image loop and ImageNet numbers; HKR-R is weak because deployment impact is not disclosed. This fits the 60–71 research-interest band.

editor take

LINE gains 0.11 AUC on ImageNet; the sharper bit is black-box looping finding 27% concepts outside fixed vocabularies.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→When Is Warmstarting Effective for Scaling Language Models?

An arXiv paper tests warmstarting on dense MLPs and dense language models, finding that a 2× growth factor most reliably improves convergence speed, with gains strongest under 20 tokens-per-parameter budgets and diminishing as the training budget increases.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-K adds testable training guidance: 2× scaling is most stable, with larger gains under 20 tokens/parameter. HKR-R lands on compute cost, but the paper-like framing keeps it below featured.

editor take

This paper says 2× growth is the reliable warmstart zone; under 20 tokens/parameter it pays, beyond that the magic fades.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers Based on JEPA

The paper introduces Rabtriever, which uses JEPA-based on-policy distillation from an LLM generative reranker to reduce document-length complexity from quadratic to linear, and evaluates it on rationale-based tasks plus MS MARCO and BEIR; the abstract does not disclose exact scores or model sizes.

#RAG#Embedding#Inference-opt#Rabtriever

why featured

HKR-K/R pass: the paper gives a concrete mechanism, complexity claim, and benchmark set. HKR-H is weak; single arXiv research with no open-source artifact, deployment, or major-lab release stays in all.

editor take

Rabtriever cuts document-length complexity from quadratic to linear; scores and model sizes are undisclosed, so treat it as RAG cost-cutting work.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Query-conditioned test-time self-training improves large language model reasoning

QueST updates LLM parameters at inference by generating query-conditioned problem-solution pairs from the input and using them for parameter-efficient fine-tuning; the paper evaluates it on seven mathematical reasoning benchmarks and GPQA-Diamond, where it outperforms strong test-time optimization baselines.

#Reasoning#Fine-tuning#Inference-opt#QueST

why featured

HKR-K and HKR-R pass: the mechanism and evaluation scope are concrete, with 7 math benchmarks plus GPQA-Diamond. HKR-H is weak, and the post gives no gain size, code, or model scale, so it stays in the normal research-release band.

editor take

QueST fine-tunes per query using synthetic supervision; it wins 7 math sets plus GPQA, but the latency bill is undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?

GraphIP-Bench evaluates GNN model extraction under one black-box protocol, covering 12 attacks, 12 defenses, 10 public graphs, 3 GNN backbones, and 3 graph-learning tasks; the paper reports that GNNs are easy to steal at medium query budgets and most defenses do not change that result.

#Benchmarking#Safety#LabRAI#Research release

why featured

HKR-H/K/R all register: the theft framing is clickable, and the benchmark gives a concrete black-box test matrix plus a medium-query theft claim. The GNN-security scope is narrower than LLM or agent security, so it stays in the high all band.

editor take

GraphIP-Bench tests 12 attacks and 12 defenses; for GNN IP, watermark verification alone is a bad comfort metric.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

The paper introduces inference-time alignment with reference-model temperature adjustment, combines multiple generative reward models as a sharpened logarithmic opinion pool, and proposes a SLOP weight-calibration algorithm to mitigate reward hacking while preserving alignment performance.

#Alignment#Safety#Inference-opt#Research release

why featured

HKR-H/K/R pass, but the abstract gives mechanisms without metrics, benchmark gains, or reproducible conditions. This stays in the upper end of a normal research release, not featured.

editor take

SLOP uses reward ensembles and temperature tweaks against reward hacking; experiment scale is undisclosed, so don't crown it an RLHF replacement.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Why Is “Chicago” Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues

The paper proposes a conjecture-then-validate framework that uses LLMs to convert lexical cues learned by deceptive-review classifiers into interpretable language phenomena, and the abstract says these phenomena are empirically grounded, generalize across similar review domains, and outperform phenomena derived from LLM prior knowledge or in-context learning.

#Interpretability#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv methods paper with no metrics or reproducibility details in the provided text. It stays in the mid research band below featured.

editor take

LLMs explain deceptive-review lexical cues here; sample size is undisclosed, so don’t confuse interpretability with causal discovery.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation

The paper uses FLOPs to predict GPU inference energy for diffusion models, covering 4 models, 3 NVIDIA GPU architectures, 256²–1024² resolutions, fp16/fp32 precision, 10–50 sampling steps, and classifier-free guidance settings, with within-architecture R² above 0.9.

#Vision#Inference-opt#Benchmarking#Stable Diffusion

why featured

HKR-K and HKR-R pass: the paper gives reproducible ranges and an R² claim for diffusion energy prediction. HKR-H is weak, and this is an engineering measurement paper, not a broad product or market event.

editor take

FLOPs predict diffusion inference energy with R²>0.9 per GPU architecture; latency-only image-gen model cards now look under-specified.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Decoupling Exploration and Policy Optimization: Uncertainty-Guided Tree Search for Hard Exploration

The paper proposes uncertainty-guided tree search that bypasses RL during exploration; on hard-exploration benchmarks, it explores one order of magnitude more efficiently than standard intrinsic-motivation baselines.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the paper has a concrete decoupling mechanism and a claimed 10x efficiency gain over intrinsic-motivation baselines. As a single arXiv RL paper with no disclosed code or real-task validation, it stays interesting-not-featured.

editor take

UGTS reports ~10x exploration efficiency over intrinsic motivation; I buy the split, hard exploration has overused RL optimizers as hammers.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LIFT: Last-Mile Fine-Tuning for Table Explicitation

LIFT uses a pretrained LLM to extract an initial table, then applies a 1B-24B parameter fine-tuned SLM to repair errors; on a 2,596-table benchmark, it exceeds end-to-end fine-tuning by up to 0.144 TEDS with 1,000 training examples.

#Fine-tuning#Reasoning#Tools#LIFT

why featured

HKR-K is strong with a concrete mechanism and numbers; HKR-R is limited to doc extraction/RAG practitioners. The academic title weakens HKR-H, so this stays below featured.

editor take

LIFT lets 1B-24B SLMs repair LLM tables and gains 0.144 TEDS on 2,596 tables; stop forcing small models to own the whole pipeline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

The paper proposes S(H)NAP to audit Sybil using 3D diffusion bridge interventions on CT anatomical features, with expert radiologists validating the attributions; the audit finds Sybil often separates malignant from benign pulmonary nodules, but shows dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

#Vision#Interpretability#Safety#Sybil

why featured

HKR-H/K/R all pass, but this is a single medical-imaging audit paper with no disclosed open artifact, deployment, or industry uptake, so the lower 60–71 band fits.

editor take

S(H)NAP audits Sybil with 3D diffusion bridges; sample size is undisclosed, but artifact sensitivity and radial bias sting deployment claims.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Revealing Interpretable Failure Modes of VLMs

The paper introduces REVELIO, a framework that finds interpretable VLM failure modes using diversity-aware beam search and Gaussian-process Thompson Sampling. It evaluates the method in autonomous driving and indoor robotics, reporting simulated crashes, missed hazards, false alarms, and excessive conservatism, but the RSS snippet does not disclose the specific state-of-the-art VLMs tested.

#Vision#Multimodal#Interpretability#Research release

why featured

HKR-H/K/R all pass, but the disclosed facts stay at abstract level: REVELIO plus two domains, with no model list, baselines, or reproducible details. This fits the 60–71 band.

editor take

REVELIO probes VLM failures with two search methods; no model names disclosed, so the safety claim loses half its bite.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

The paper proposes RDPO, a reward-processing method for multi-objective and mixed-reward reinforcement learning, using Magnitude-Aware Quantile normalization and Mahalanobis whitening; when applied to LongCat-Flash post-training, it improves instruction following, writing quality, and robustness to hard prompts while staying competitive on reasoning and coding evaluations.

#Reasoning#Code#Fine-tuning#LongCat-Flash

why featured

HKR-K/R pass: RDPO gives reward normalization and whitening mechanisms, with reported LongCat-Flash post-training gains. HKR-H is weak; this is a narrow optimization paper, not same-day industry news.

editor take

RDPO decorrelates mixed rewards via quantile normalization and whitening; LongCat-Flash gains lack numbers, so treat this as reward-engineering work.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Gyan: An Explainable Neuro-Symbolic Language Model

The Gyan paper proposes a non-Transformer neuro-symbolic language model and reports SOTA results on 3 public datasets plus stronger performance on 2 proprietary datasets.

#Reasoning#Interpretability#Gyan#Research release

why featured

HKR-H/K pass: a non-Transformer neuro-symbolic LM plus 3+2 dataset results gives signal. Dataset names, scale, code, and reproducible settings are not disclosed, keeping it in the normal research band.

editor take

Gyan claims SOTA on 3 public datasets; tasks, scale, and reproducibility are undisclosed, so I discount the no-Transformer-limit pitch.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

RISED proposes a five-dimension pre-deployment evaluation for clinical AI decision-support systems, using pre-specified pass/fail thresholds, 95% BCa bootstrap confidence intervals, and Holm-Bonferroni correction to detect reliability, equity, threshold-sensitivity, and deployability failures that aggregate accuracy metrics miss.

#Safety#Benchmarking#RISED#Research release

why featured

HKR-K/R pass: the paper offers concrete statistical checks for pre-deployment clinical AI review. HKR-H is weak, and this is a single arXiv paper without product or institutional adoption, so it stays in all.

editor take

RISED gates clinical AI with 5 dimensions and 95% BCa CIs; good, because AUC worship dies at procurement.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

The paper evaluates LLM uncertainty on SQuAD across five context-availability levels, finding that sampling-based confidence stays high as accuracy collapses, while response entropy rises with context removal and explains more accuracy variance, with a quadratic R² gap up to 0.057.

#Reasoning#Benchmarking#arXiv#Research release

why featured

HKR-H/K/R all pass: the hook is LLM confidence under missing context, with a 5-level SQuAD setup and R² up to 0.057. Impact stays research-level, with no model list, code, or production consequence disclosed.

editor take

SQuAD gets five missing-context levels; confidence stays smug as accuracy drops, while entropy’s quadratic R² gains only 0.057—useful, not a victory lap.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Continual Fine-Tuning of Large Language Models via Program Memory

The paper proposes ProCL, a continual LoRA framework that retrieves structured program-memory slots through input-conditioned attention, operates entirely within LoRA parameterization, and adds no inference cost while reporting better retention and less catastrophic forgetting across diverse benchmarks.

#Fine-tuning#Memory#Inference-opt#arXiv

why featured

HKR-K/R pass: ProCL’s memory-slot retrieval and no-inference-cost claim are useful for fine-tuning practitioners. Single arXiv item with no benchmark numbers or code disclosed keeps it in 60–71.

editor take

ProCL turns LoRA into program-memory slots with zero inference cost; no baselines or forgetting numbers in the snippet, so don't crown it yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

RealICU uses senior physicians’ hindsight review of full ICU trajectories to label four tasks, with 930 Gold windows from 94 MIMIC-IV patients and 11,862 Scale windows extended by a physician-validated LLM labeler.

#Agent#Reasoning#Memory#RealICU

why featured

HKR-H/K/R pass: the benchmark has a concrete clinical hook, labeling setup, and dataset size. It stays in all because it is a vertical arXiv paper, not a broad agent release or major industry event.

editor take

RealICU labels 94 ICU patients and 930 physician windows; clinical agents should stop bragging about long context while recall fights safety.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Humanwashing -- It Should Leave You Feeling Dirty

arXiv 2605.13723 challenges the safety framing of “human in the loop,” arguing that the loop metaphor obscures processes and outcomes in deployed AI decision systems and enables “humanwashing,” language analogous to greenwashing.

#Safety#Alignment#Safety/alignment#Commentary

why featured

HKR-H and HKR-R pass: the title has a sharp hook, and HITL accountability matters to AI teams. HKR-K is weak because the provided facts disclose no data, examples, or reproducible method, so this stays in all.

editor take

arXiv 2605.13723 attacks human-in-the-loop safety claims; I buy it—deployed “human review” often means liability outsourcing.

HKR breakdown

hook ✓knowledge —resonance ✓

→ open source

SCORE

H1·K0·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→The Efficiency Gap in Byte Modeling

The paper measures byte-modeling cost with a compute-matched scaling study and finds the byte-level performance penalty is worse for masked diffusion modeling than for autoregressive models across scale, with controlled permutation experiments pointing to context fragility from disrupted local contiguity.

#Benchmarking#Inference-opt#Research release#Benchmark

why featured

HKR-H/K pass: the title frames a tokenizer-free efficiency puzzle, and the body gives compute-matched scaling plus permutation evidence. The arXiv architecture angle is narrow, so HKR-R misses and it stays all.

editor take

Compute-matched scaling shows byte MDM pays more than AR; modality-agnostic purity needs local-contiguity bias, not more slogan fuel.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

LoRA-Mixer routes task-specific LoRA experts into attention input and output projection layers, uses an adaptive Routing Specialization Loss, and beats routing and LoRA-MoE baselines across 15 benchmarks while using 48% of their trainable parameters, with reported gains of 3.79 points on GSM8K, 2.90 on CoLA, and 3.95 on ARC-C.

#Fine-tuning#Agent#Benchmarking#LoRA-Mixer

why featured

HKR-H/K/R pass, but this is a single arXiv fine-tuning method with reach limited to LoRA and PEFT practitioners. The 15-benchmark, 48%-parameter claim is useful, not same-day must-write.

editor take

LoRA-Mixer beats LoRA-MoE with 48% trainable parameters; routing experts inside attention projections smells more practical than another FFN-MoE variant.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Data Difficulty and the Generalization–Extrapolation Tradeoff in LLM Fine-Tuning

The paper studies SFT data difficulty and finds no universal optimum: under a fixed data budget, an optimal difficulty exists, and the optimum shifts toward harder data as the budget increases.

#Fine-tuning#Reasoning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a testable link between SFT data difficulty and budget, useful for data curation. HKR-H is weak, and a single arXiv paper keeps it in all rather than featured.

editor take

Fixed SFT budgets have an optimal data difficulty; I buy the direction, but model scale and task bounds are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Understanding and Accelerating the Training of Masked Diffusion Language Models

The paper attributes slow MDM training to language locality bias and uses bell-shaped time sampling to reach the same validation NLL up to about 4× faster than standard training on the LM1B benchmark.

#Benchmarking#Inference-opt#Research release#Benchmark

why featured

HKR-K is strong: it gives locality bias plus bell-shaped time sampling and ~4x faster LM1B validation NLL. HKR-H/R stay niche to training researchers, with no code, model release, or cross-source validation disclosed.

editor take

Bell-shaped time sampling gets MDMs to same LM1B NLL about 4× faster; I buy the diagnosis, but scale evidence is undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

ODRPO decomposes multi-tier discrete rewards such as 1-10 rubrics into ordinal binary indicators, and reports relative gains of up to 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals using Qwen2.5-7B and Qwen3-4B, with no additional per-step training compute over standard estimators.

#Alignment#Fine-tuning#Reasoning#Nirmal Patel

why featured

HKR-K and HKR-R pass: mechanism, models, gains, and training-cost claim are concrete. HKR-H is weak because the title is academic; as a non-top-lab arXiv method paper, it fits the 60–71 research-signal band.

editor take

ODRPO gains 14.8% on Qwen2.5-7B. Ordinal binary splits beat majority-vote cleanup without extra per-step compute.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Learning POMDP World Models from Observations with Language-Model Priors

The paper introduces Pinductor, which uses an LLM to propose POMDP models from a few observation-action trajectories and refine them with a belief-based likelihood score; the code is available on GitHub.

#Agent#Reasoning#Tools#Pinductor

why featured

HKR-K/R pass: the mechanism and code release are concrete, and agent world models are relevant. HKR-H is weak; no benchmark results, task scale, or reproducibility details are disclosed, so it stays in the 60–71 band.

editor take

Pinductor induces POMDPs from few trajectories via LLM priors; no task counts disclosed, so treat it as sample-efficiency work.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Mechanistic Evidence for Spectral Structures in Prior-Data Fitted Networks

The paper tests three PFN architectures, including TabPFN, and shows spectral information is linearly decodable from latent attention scores. A Filter Bank Decoder maps frozen PFN latents to spectral densities and reconstructs stationary kernels, while spectral subspace interventions are an order of magnitude more effective than random directions and support competitive GP regression with one forward pass.

#Interpretability#Reasoning#Benchmarking#TabPFN

why featured

HKR-K/R pass: the abstract gives 3 architectures, Filter Bank Decoder, 10x intervention effects, and one-pass GP regression comparisons. The topic is specialized, so it stays in the lower all band.

editor take

PFNs expose spectral signals across 3 architectures; if TabPFN exports portable kernels, interpretability finally touches utility.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

GLASS combines a global model-intrinsic prior with local prompt-specific activations to rank FFN neuron criticality for training-free inference-time pruning, and reports up to 45.10% lower perplexity and 25.73% lower KL divergence than prior baselines under short-prompt, long-generation conditions.

#Inference-opt#GLASS#arXiv#Research release

why featured

HKR-H/K/R all register, but this is a single arXiv inference-optimization paper without code, latency/memory numbers, or adoption signal; keep it in the 60-71 band.

editor take

GLASS cuts perplexity 45.10% in short-prompt long-generation; prompt-only pruning is too brittle for long decoding.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

The paper introduces AutoSelection for fixed-pool SFT data recipe search on a 90K instruction pool, using cached task, data, and model signals, warmup probes, local recipe edits, Gaussian-process-assisted ranking, and reseeding to reduce full SFT evaluations.

#Fine-tuning#Reasoning#AutoSelection#Research release

why featured

HKR-K and HKR-R pass: AutoSelection frames SFT recipe search with a 90K pool, cached signals, and Gaussian-process ranking. HKR-H is weak, and the post gives no result numbers or artifact details.

editor take

AutoSelection searches recipes over a 90K pool; I buy the top-k pushback, but full SFT budget details are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Key-Value Means: Transformer Attention with Expandable Block-Recurrent Compression

The paper introduces Key-Value Means attention block recurrence, supporting fixed-size or growable state, and reports competitive long-context performance with subquadratic prefill time, sublinear state growth, standard operations, no custom kernels, Apache 2.0 code release, and trained models on Hugging Face.

#Memory#Inference-opt#recursal#Hugging Face

why featured

HKR-K/R pass: compressed memory and prefill complexity matter for long-context systems. The post stays abstract-level, with no model scale, benchmark numbers, or reproducibility details, so it remains useful but not featured.

editor take

KVM claims subquadratic prefill and sublinear state growth; no custom kernels makes it feel more usable than most long-context papers.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→AcquisitionSynthesis: Targeted Data Generation Using Acquisition Functions

The paper proposes AcquisitionSynthesis, a targeted synthetic-data method that uses acquisition functions as reward models for training language models to generate higher-quality data. Experiments on verifiable math, medical QA, and coding tasks report 2–7% in-distribution gains for student models, stronger resistance to catastrophic forgetting, and transfer of generated data to other models and low-to-high resource training setups.

#Fine-tuning#Benchmarking#Code#Research release

why featured

HKR-K/R pass: the mechanism and 2-7% gains are concrete, and synthetic-data tuning is practitioner-relevant. HKR-H is weak, and a single arXiv paper stays below featured.

editor take

AcquisitionSynthesis reports 2–7% in-distribution gains; acquisition rewards for synthetic data look more engineerable than brute rejection sampling.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→A3: An Analytical Low-Rank Approximation Framework for Attention

A3 splits each Transformer layer into QK, OV, and MLP components, and under the same compute and memory reduction budget, its low-rank LLaMA 3.1-70B reaches 4.69 perplexity on WikiText-2 versus the prior SoTA's 7.87.

#Inference-opt#Benchmarking#Fine-tuning#A3

why featured

HKR-K and HKR-R pass via a concrete compression mechanism and LLaMA 3.1-70B metric. HKR-H fails; as a single arXiv paper without repo, speed gains, or deployment evidence, it stays in all.

editor take

A3 gets LLaMA 3.1-70B to 4.69 PPL on WikiText-2; skipping decomposed-matrix GEMM tax is the smart move.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

Stylus repurposes pretrained image diffusion models for training-free music style transfer on Mel-spectrograms, using self-attention key/value style injection and phase-preserving reconstruction, and reports 34.1% higher content preservation plus 25.7% better perceptual quality across 2,925 human ratings.

#Audio#Multimodal#Stylus#Research release

why featured

HKR-H and HKR-K pass: the cross-modal reuse of image diffusion is novel, and the paper provides human-rating numbers. Impact stays in research/audio niches with no disclosed product deployment or major-lab tie, so it fits 60–71.

editor take

Stylus ports image diffusion to Mel style transfer, +34.1% over 2,925 ratings; I’d stress-test drums and vocals before buying it.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Beyond Softmax: A Natural Parameterization for Categorical Random Variables

The paper replaces softmax with catnat for latent categorical variables, using hierarchical binary splits to produce a diagonal Fisher Information Matrix, and reports higher learning efficiency and test performance across graph structure learning, variational autoencoders, and reinforcement learning experiments.

#Reasoning#Fine-tuning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the paper challenges softmax and states catnat’s hierarchical binary mechanism plus diagonal Fisher. HKR-R is weak because results stay in research settings, with no LLM-training or production payoff disclosed.

editor take

catnat swaps softmax for hierarchical binary splits, yielding diagonal Fisher; metrics aren’t disclosed here, but the angle is cleaner than another estimator tweak.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

UPS calibrates a VLM verifier with a pre-trained robot policy using conformal prediction. It selects among three deployment actions: execute a high-confidence action, ask a natural-language clarification, or request an action intervention, then uses residual learning from interventions across simulation and hardware experiments.

#Robotics#Vision#Alignment#Research release

why featured

HKR-H/K/R all pass, but the post gives only the mechanism, not results, dataset size, or real-robot performance. As a single arXiv robotics-policy paper, it stays in the 60–71 band.

editor take

UPS calibrates VLM robot verifiers with conformal prediction; I buy the direction—overconfident VLMs should not drive arms unchecked.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

ZKBoost introduces the first zkPoT protocol for XGBoost, letting model owners prove correct training on a committed dataset without revealing data or model parameters; its fixed-point XGBoost version matches standard XGBoost accuracy within 1% on real-world datasets.

#Safety#ZKBoost#XGBoost#Research release

why featured

HKR-K is strong and HKR-R is present for enterprise ML security. The score stays in all because XGBoost plus zero-knowledge proofs is specialized and far from the LLM/agent product lane.

editor take

ZKBoost proves XGBoost training with <1% accuracy loss; I need prover-cost tables before buying any deployment story.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

The paper proposes Layer-wise Representation Dynamics, using Frenet, NRS, and GFMI metrics to analyze 31 encoder and decoder embedders plus base LLMs across 30 MTEB tasks. GFMI is the only measurement-guided pruning rule that beats Random at 15% and 20% budgets, while model-level LRD scores correlate positively with downstream MTEB performance.

#Embedding#Interpretability#Inference-opt#arXiv

why featured

HKR-K is clear with 31 models, 30 MTEB tasks, and pruning budgets; HKR-R is limited to inference-cost practitioners. HKR-H fails, so this stays all rather than featured.

editor take

LRD tests 31 models on 30 MTEB tasks; GFMI only beats Random at 15% and 20% pruning. Useful heuristic, not interpretability victory.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory

PMNet stabilizes Backpropagation Through Time with Unitary Phasor Dynamics and an 85-slot hierarchical memory tree, reaching near-100% exact retrieval on a synthetic Copy-Paste task across temporal distances beyond the local sliding-window attention receptive field.

#Memory#Reasoning#Benchmarking#PMNet

why featured

HKR-H/K pass: the paper offers testable mechanisms plus an 85-slot and near-100% retrieval claim. It remains specialist single-source arXiv research without product traction, so it stays in all.

editor take

PMNet hits near-100% retrieval with 119M params and 85 memory slots; I’m not buying “scalable” until real language tasks reproduce it.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Weichen Yu and 10 coauthors introduce MOPD, a peer-conditioned on-policy distillation method that uses successful and failed rollouts from the same prompt, and report improvements over standard OPD baselines on competitive programming, math reasoning, science QA, and tool-use benchmarks.

#Reasoning#Fine-tuning#Tools#Weichen Yu

why featured

HKR-H/K pass: MOPD’s success-and-failure multi-rollout distillation is a concrete training mechanism with benchmark claims. No exact gains, artifact status, or major-lab signal are disclosed, so it stays in the mid all band.

editor take

MOPD distills same-prompt success and failure rollouts; the 23-page paper omits effect sizes here, so I buy the method, not the “consistent gains” spin.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→TMPO Paper Introduces Trajectory Matching Policy Optimization for Improved Diffusion Alignment

The paper introduces TMPO, replacing scalar reward maximization with trajectory-level reward distribution matching for diffusion alignment, and uses a Softmax-TB objective over K trajectories plus Dynamic Stochastic Tree Sampling; experiments report a 9.1% generative diversity gain over state-of-the-art methods across preference, compositional generation, and text rendering tasks.

#Alignment#Fine-tuning#Research release

why featured

HKR-K passes via trajectory-level reward matching, K-trajectory Softmax-TB, and +9.1% diversity. HKR-H/R miss: this is a dense diffusion-alignment paper with no product impact or broader practitioner flashpoint.

editor take

TMPO matches K trajectories to reward distributions and claims +9.1% diversity; I buy the direction, not the claim without code.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

The paper reformulates GRPO as a weighted positive-negative score difference and proposes ConSPO, which uses length-normalized sequence log-probabilities plus a group-wise InfoNCE objective; evaluations across backbone models, parameter scales, and training datasets show gains over several RLVR baselines on mathematical reasoning benchmarks.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K passes: ConSPO recasts GRPO through contrastive learning and gives a concrete objective change. No exact gains are disclosed, and the paper is training-method heavy, so it stays in the 60-71 band.

editor take

ConSPO recasts GRPO as group-wise InfoNCE; no score table is disclosed, so I read it as a clean RLVR objective ablation.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

The paper introduces BISE, a pruning-based method that extracts bias-invariant subnetworks from vanilla-trained models without retraining or fine-tuning original parameters; the RSS snippet says experiments cover common benchmarks but does not disclose benchmark counts or performance numbers.

#Safety#Fine-tuning#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the angle is counterintuitive, and the method gives a concrete pruning mechanism. Benchmarks and performance numbers are not disclosed, so the industry impact stays research-level all.

editor take

BISE extracts bias-invariant subnetworks by pruning, but gives no benchmark numbers; I’d treat it as lottery-ticket fairness for now.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

The paper introduces SeqComm-DFL, combining sequential communication with decision-focused learning; on collaborative healthcare and SMAC benchmarks, it reports 4–6x higher cumulative rewards and over 13% win-rate gains under partial observability.

#Agent#Reasoning#Benchmarking#SeqComm-DFL

why featured

HKR-H/K pass: the mechanism and metrics are specific, with medical collaboration and SMAC as test beds. Single arXiv paper, narrow MARL scope, and no product impact keep it in the 60-71 band.

editor take

SeqComm-DFL reports 4–6x rewards on healthcare and SMAC; I'd audit baselines first, since Stackelberg messaging can win by setup.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

The paper trains behavioral geometric supervision on 49,484 odd-one-out judgments from 250 social videos, and fine-tuned V-JEPA 2.1 reaches nearly 3x the pretrained baseline while exceeding the MPNet sentence-embedding baseline.

#Vision#Fine-tuning#Alignment#V-JEPA

why featured

HKR-K passes with concrete dataset size, method, and baselines. HKR-H/R are weak: this is a single academic benchmark paper with no product rollout or market pressure, so it fits the all tier.

editor take

BGS uses 49,484 judgments on 250 videos; V-JEPA 2.1 gains nearly 3x, so caption distillation isn’t the fix.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

Steer-to-Detect detects LLM-generated text with a two-stage framework: it injects a learned steering vector into hidden states of a frozen observer LLM, then applies hypothesis testing over the steered representations with finite-sample high-probability guarantees for Type I and Type II errors.

#Safety#Interpretability#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the mechanism is specific and AI-text detection matters to practitioners. No accuracy, dataset, or reproducible result is disclosed, so it stays in the mid research-release band.

editor take

S2D injects a steering vector into a frozen observer LLM; I buy the mechanism, not claims without AUROC or attack details.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

The paper proposes an always-valid release wrapper for generator-evaluator workflows, using a hard-negative reference pool to calibrate black-box scores and an e-process to control the probability of incorrect release under optional stopping.

#Agent#Code#Benchmarking#arXiv

why featured

HKR-H/K/R pass, but this stays in all: the arXiv item gives a wrapper, hard-negative pool, and e-process mechanism, with no benchmark numbers or production case; value is narrow release-gating reliability.

editor take

This turns release timing into a finite-sample test; MBPP+ is shown, but the hard-negative pool is the fragile part.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lessons Learned at ABB Robotics

The study frames fault localization as supervised text classification and evaluates 5 models on 5 years of ABB Robotics bug reports, finding TF-IDF-based traditional models outperform fine-tuned RoBERTa variants under code-free industrial maintenance conditions.

#Fine-tuning#Benchmarking#ABB Robotics#Research release

why featured

HKR-H/K/R pass through the TF-IDF-vs-RoBERTa result, 5-year ABB dataset, and baseline-cost debate. Importance stays in the 60–71 band because it is a niche software-engineering research paper, not a broad product or model release.

editor take

ABB Robotics tested 5 models on 5 years of bug reports; TF-IDF beat RoBERTa, so don’t fine-tune first on thin domain text.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Understanding Catastrophic Forgetting in LoRA via Mean-Field Attention Dynamics

The paper studies catastrophic forgetting in LoRA with a mean-field self-attention toy model, identifies two phase-transition conditions tied to perturbation norm and Transformer depth, and validates the predicted trends with LoRA fine-tuning experiments on real models.

#Fine-tuning#Interpretability#Alignment#LoRA

why featured

HKR-K/R pass: the paper gives concrete phase-transition conditions for LoRA forgetting and touches fine-tuning reliability. HKR-H fails, and the mean-field framing limits generalist reach, keeping it in the 60–71 band.

editor take

LoRA forgetting gets two phase lines: perturbation norm and depth. Useful theory, but not a training recipe yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LightSplit: Practical Privacy-Preserving Split Learning via Orthogonal Projections

LightSplit applies fixed orthogonal random projections at the split-learning cut layer, transmitting low-dimensional activations and retaining over 95% baseline accuracy with up to 32x lower transmitted dimensionality.

#Fine-tuning#Safety#Inference-opt#Research release

why featured

HKR-K is solid via the 32x reduction and >95% accuracy-retention claim; HKR-R is limited to privacy-preserving training practitioners. No major lab, product, or artifact keeps it in the 60-71 band.

editor take

LightSplit cuts split-layer activations 32x while keeping 95% accuracy; I want attack metrics, and the abstract gives none.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data

The paper introduces a multi-party negotiation benchmark using document-grounded instances from a climate negotiation exercise and several baseline solvers; exact evaluation on small games and comparative evaluation on larger instances show that no solver dominates across regimes.

#Agent#Benchmarking#arXiv#Research release

why featured

HKR-K and HKR-R pass: the paper offers real-negotiation-derived instances and baseline findings, and it speaks to multi-agent evaluation pain. Still, it is an arXiv benchmark paper, not a same-day model, product, or framework release.

editor take

This benchmark builds multi-party games from climate negotiation docs; useful for commitment chains, but scale and data protocol are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→SHM-Agents: A Generalist-Specialist Integrated Agent System for Structural Health Monitoring

The paper proposes SHM-Agents, a generalist-specialist system that combines LLM reasoning and planning with specialized algorithms, and tests it on a long-span cable-stayed bridge across 12 SHM tasks including anomaly diagnosis, modal identification, and reliability assessment.

#Agent#Reasoning#Tools#SHM-Agents

why featured

HKR-K is clear and HKR-H works via the bridge-monitoring agent hook. The arXiv paper is a narrow engineering vertical with no disclosed code, benchmark comparison, or reproducible setup, so it stays in the 60–71 all band.

editor take

SHM-Agents runs 12 tasks on one cable-stayed bridge; I want cross-bridge replication, not another single-asset demo.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Efficient Generative Prediction for EHR Foundation Models: The SCOPE and REACH Estimators

The paper proposes SCOPE and REACH for generative EHR outcome prediction, matching 100-sample Monte Carlo accuracy across 11 clinical outcomes in MIMIC-IV and the UChicago health system while cutting median token use by 2.5× to 3.4×, with reductions above 80× for the rarest outcomes.

#Inference-opt#Benchmarking#MIMIC-IV#UChicago

why featured

HKR-H/K/R pass, but this is a single EHR inference-efficiency paper with narrow audience reach and no product impact. It fits the 60-71 research band, so tier is all.

editor take

SCOPE/REACH cut 2.5–3.4× tokens across 11 outcomes; EHR prediction has inference waste before it has model scarcity.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search

The paper proposes Const-o-T, representing each reasoning step as an intent-constraint pair and integrating it into MCTS; across three domains—Risk, CAD code generation, and arithmetic reasoning—the method outperforms baselines on accuracy and structural alignment.

#Reasoning#Agent#Code#Research release

why featured

HKR-H/K pass: the paper offers a concrete constrained-reasoning mechanism across Risk, CAD code, and arithmetic. It lacks gain sizes, model scale, and deployment evidence, so it stays in the 60-71 research-browse band.

editor take

Const-o-T adds constrained MCTS across 3 tasks; without effect sizes, don’t crown it a CoT replacement.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

The paper introduces IndicMedDialog, a parallel multi-turn medical dialogue dataset covering English and nine Indic languages; it extends MDDial with LLM-generated consultations, TranslateGemma translations, native-speaker verification, and script-aware post-processing.

#Fine-tuning#Benchmarking#IndicMedDialog#IndicMedLM

why featured

HKR-K passes: the language count and synthetic-translation-native review pipeline add concrete information. HKR-H/R are weak because this is a niche NLP dataset release, useful but not featured-level.

editor take

IndicMedDialog spans English plus 9 Indic languages; size is undisclosed, so judge it by expert-review rigor.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→AmaraSpatial-10K: A Spatially and Semantically Aligned 3D Dataset for Spatial Computing and Embodied AI

AmaraSpatial-10K publishes more than 10,000 synthetic 3D assets, with each .glb carrying metric scale, deterministic anchoring, separated PBR maps, a convex collision hull, a reference image, and multi-sentence text metadata.

#Robotics#Vision#Benchmarking#AmaraSpatial-10K

why featured

HKR-K and HKR-R pass: the paper gives a 10K-scale 3D asset set with concrete engineering fields useful for embodied-AI simulation. HKR-H is weak and the audience is narrow, so it stays in 60–71.

editor take

AmaraSpatial-10K ships 10K+ deployable assets; CLIP R@5 hits 0.612 vs Objaverse’s 0.181, useful for sim stacks.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

CAFT trains on 30M image-text pairs and combines local text-region alignment at intermediate layers with global image-text alignment at the final layer. The paper reports state-of-the-art results on six long-text retrieval benchmarks and shows localization of textual semantics without explicit region-level supervision.

#Vision#Multimodal#Benchmarking#CAFT

why featured

HKR-K is solid: the paper gives training scale, a layered alignment mechanism, and 6 benchmark claims. HKR-H/R are weak, and this is a single arXiv research item with no disclosed release, product path, or broad debate.

editor take

CAFT hits 6 long-caption retrieval SOTAs on 30M pairs; I buy local alignment, but need same-compute CLIP deltas.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-Model Synthetic Tabular Data

The MIDST Challenge evaluated privacy gains in diffusion-model synthetic tabular data, focusing on resistance to membership inference attacks under black-box and white-box settings. The abstract covers single mixed-type tables and multi-relational tables with interconnected constraints, and it links a GitHub repository, but the post does not disclose participant counts or leaderboard results.

#Safety#Benchmarking#SaTML#Vector Institute

why featured

HKR-K and HKR-R pass: the paper defines concrete MIA settings for diffusion-based synthetic tabular data. HKR-H is weak, and the topic is a niche privacy benchmark without product or broad industry pull.

editor take

MIDST covers black-box and white-box MIAs; participants and leaderboard are undisclosed. Stop treating diffusion tabular synthesis as anonymization.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs

The paper proposes Structured Residual Reconstruction, which preserves the top-k singular subspace of activation-scaled weights before quantization, quantizes the residual, and allocates the remaining rank r-k to error reconstruction. Experiments report consistent PTQ perplexity reductions across models and quantization settings, plus a 5.9 percentage-point average GLUE gain under 2-bit QPEFT.

#Inference-opt#Fine-tuning#Research release

why featured

HKR-K/R pass: SRR gives a concrete mechanism and GLUE +5.9pp, tied to low-bit tuning cost. HKR-H is weak, and this is a single technical arXiv quantization paper, so it stays in 60-71.

editor take

SRR preserves top-k subspaces before residual quantization; a 5.9-point GLUE gain at 2-bit QPEFT makes plain error repair look crude.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Finding the Weakest Link: Adversarial Attack against Multi-Agent Communications

The paper proposes Jacobian-gradient methods to select vulnerable messages, agents, and timesteps for single-victim communication attacks, testing two multi-agent communication methods across navigation, PredatorPrey, and TrafficJunction environments, with victim selection, message selection, tempo, and adversarial losses improving attack effectiveness in 15 of 30 scenarios.

#Agent#Safety#Alignment#Research release

why featured

HKR-H/K/R pass, but this is an arXiv technical paper tested on simulated tasks such as Navigation, PredatorPrey, and TrafficJunction, not a product or major lab release, so it stays in the 60–71 band.

editor take

Jacobian targeting improved 15 of 30 scenarios; multi-agent comms safety needs better baselines than random perturbations.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

The paper analyzes early training in attention-based language models with a leading-term gradient approximation, deriving closed-form weight expressions built from three basis functions: bigram, token-interchangeability, and context mappings.

#Interpretability#Reasoning#Research release

why featured

HKR-K is clear: gradient leading terms, closed-form weights, and three basis functions are testable mechanisms. HKR-R lands for interpretability, but HKR-H is weak and the item remains abstract-level research, so it stays in 60–71.

editor take

arXiv 2601.19208 derives closed-form early-training weights; experiments are undisclosed here, so don’t sell this as a Transformer theory.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Training Large Language Models to Predict Clinical Events

The study converts time-ordered MIMIC-III notes into 6,900 clinical prediction examples from 702 admissions, and a LoRA adapter reduces expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145.

#Fine-tuning#Benchmarking#MIMIC-III#GPT-5

why featured

HKR-K and HKR-R pass: the paper gives LoRA calibration and Brier gains on MIMIC-III, and clinical prediction stresses reliability. HKR-H is weak; this remains a narrow arXiv paper without product impact.

editor take

LoRA on 6,900 MIMIC-III examples cuts ECE to 0.0398; for clinical LLMs, calibration beats diagnosis theater.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

SubspaceAD extracts patch features from a few normal images with a frozen DINOv2 backbone and fits PCA to model normal variation; in the one-shot setting, it reports 97.1% image-level AUROC and 97.5% pixel-level AUROC on MVTec-AD without training, prompt tuning, or memory banks.

#Vision#Benchmarking#SubspaceAD#DINOv2

why featured

HKR-H/K/R pass, but the audience scope is narrow: this is an industrial vision anomaly-detection paper, not a broad model or product release. The method and MVTec-AD numbers justify all, not featured.

editor take

SubspaceAD hits 97.1/97.5 AUROC with one normal image; DINOv2 plus PCA embarrasses heavier anomaly pipelines.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Delightful Distributed Policy Gradient

The paper proposes Delightful Policy Gradient, which gates updates using the product of advantage and surprisal; on MNIST staleness and transformer sequence tasks, DG achieves an order-of-magnitude sample-efficiency advantage when staleness, actor bugs, reward corruption, and rare discovery occur together.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K passes with a concrete mechanism and test setup; HKR-H/R are weak because the framing is academic and the tasks are niche. The sample-efficiency claim is testable, but no code, named lab, or production pipeline is disclosed, so this stays in all.

editor take

DG reports order-of-magnitude sample-efficiency under four distributed-RL frictions; I'd verify replication first, because advantage×surprisal gating sounds suspiciously simple.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Test-time Offline Reinforcement Learning on Goal-related Experience

The paper introduces GC-TTT, a goal-conditioned test-time training method that selects offline transitions by relevance to the current state and quality for the evaluation goal, then fine-tunes the policy for a few gradient steps during rollout across high-dimensional loco-navigation and manipulation tasks.

#Agent#Fine-tuning#Inference-opt#Research release

why featured

HKR-H and HKR-K pass: the test-time offline fine-tuning mechanism is concrete, but the post gives no gain numbers, code, or reproducibility details. The audience fit is mostly RL researchers, so it stays in the 60-71 band.

editor take

GC-TTT filters offline transitions and fine-tunes a few steps at eval; I like the compute story more than bigger-policy scaling.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL uses LLMs to generate context-rich synthetic training examples for candidate concepts in a target knowledge base, reports new state-of-the-art results on MedMentions, QUAERO, and SPACCC, and matches full human supervision with up to 60% less annotated data.

#Fine-tuning#Inference-opt#Benchmarking#SynCABEL

why featured

HKR-K and HKR-R pass: 60% less annotation and 3 benchmarks add signal, and labeling cost resonates. HKR-H is weak, and the biomedical entity-linking scope keeps it in all.

editor take

SynCABEL claims SOTA on 3 BEL benchmarks with up to 60% less labels; LLM synthetic data is becoming real leverage for biomedical long tails.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions

The paper proposes Counterfactual Explanation Consistency, a framework that aligns feature attributions between individuals and counterfactual counterparts to detect procedural bias, and tests it on synthetic data plus German Credit, Adult Income, and HMDA mortgage datasets.

#Alignment#Interpretability#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a niche academic fairness-evaluation paper. The post gives a method and datasets, not deployment evidence or results on major models, so it stays below featured.

editor take

CEC tests procedural bias on 4 datasets; outcome-fair credit models can still take crooked attribution paths.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→TStore: Rethinking AI Model Hub with Tensor-Centric Compression

TStore reduces model-hub storage overhead with tensor-level fingerprinting, clustering, and fine-grained deduplication; the arXiv abstract says experiments on real-world model repositories show substantial storage savings with minimal overhead, but the post does not disclose the exact reduction ratio.

#Inference-opt#TStore#Research release

why featured

HKR-K and HKR-R pass: the paper offers a concrete tensor-centric hub mechanism and storage-cost relevance. No disclosed savings ratio and niche infra scope keep it in the 60–71 band.

editor take

TStore dedupes model hubs via tensor fingerprints, but gives no savings ratio; without numbers, it is not Hugging Face’s cost cure.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→GAGPO: Generalized Advantage Grouped Policy Optimization

The paper proposes GAGPO, a critic-free RL method that builds a non-parametric grouped value proxy from sampled rollouts and reports stronger results than RL baselines on ALFWorld and WebShop multi-turn agent tasks.

#Agent#Reasoning#GAGPO#ALFWorld

why featured

HKR-K passes via a new optimization mechanism and two multi-turn agent benchmarks. HKR-H and HKR-R are weak: the title is technical, and the post does not disclose code, scale, or production impact.

editor take

GAGPO uses rollout-derived value proxies and beats baselines on ALFWorld/WebShop; no margins disclosed, so don’t crown critic-free RL yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Filter-then-Weight: Online Data Selection and Reweighting for LLM Fine-Tuning

The paper proposes Filter-then-Weight, a two-stage algorithm for online LLM fine-tuning that first filters geometrically useful candidates, then optimizes their coefficients under the current optimizer state.

#Fine-tuning#Inference-opt#Research release

why featured

HKR-K and HKR-R pass: the mechanism is relevant to online fine-tuning, but the post gives no benchmarks, model scale, or cost gains. This fits a normal research release, tier all.

editor take

Filter-then-Weight selects data in two stages. No gains disclosed; I don't buy “consistent” yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

MaskPro learns a prior categorical distribution for every M consecutive weights and generates strict (N:M) sparsity through N-way sampling without replacement; the paper says its moving-average tracker of loss residuals reduces policy-gradient variance in the combinatorial space.

#Inference-opt#MaskPro#Research release#Open source

why featured

HKR-K is solid and HKR-R is partial: it gives a concrete MaskPro sampling and variance-reduction mechanism for LLM sparsity. HKR-H fails, and no speed, accuracy, or hardware numbers are disclosed, so it stays in the all band.

editor take

MaskPro samples N weights per M from learned priors; I want GPU speedups, and the snippet gives none.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Three-Stage Learning Unlocks Strong Performance in Simple Models for Long-Term Time Series Forecasting

The paper proposes STAIR, a three-stage training paradigm that uses shared temporal mapping, channel-wise fine-tuning, and residual learning to train a shallow MLP backbone, and reports matching or outperforming strong baselines on nine long-term forecasting benchmarks.

#Fine-tuning#Benchmarking#STAIR#Research release

why featured

HKR-H/K pass via the simple-model twist and concrete STAIR setup across 9 benchmarks. It remains a niche forecasting paper with limited practitioner resonance, so it sits in the lower research band.

editor take

STAIR reports wins on 9 long-horizon benchmarks; I buy the training-recipe angle, but code and ablations decide this one.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

The paper evaluates sample-based MBR decoding with Whisper and derivative models on English and Japanese ASR and speech translation, finding higher accuracy than beam search in most tested settings and releasing code at the CyberAgentAILab GitHub repository.

#Audio#Inference-opt#Benchmarking#Whisper

why featured

HKR-K passes with concrete models, languages, tasks, and open code. HKR-H/R are weak: this is useful ASR inference research, but too niche for broad practitioner discussion.

editor take

MBR beats beam in most English/Japanese ASR/ST settings; useful offline, but latency and sampling cost are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

The paper introduces CPPO, an on-policy contrastive RL algorithm that derives advantages from contrastive Q-values and optimizes the standard PPO objective without rewards or a replay buffer; across 18 continuous, discrete, single-agent, and cooperative multi-agent tasks, CPPO beats prior CRL baselines on 14 tasks and matches or exceeds dense-reward PPO on 12 tasks.

#Reasoning#Research release#Benchmark

why featured

HKR-K passes on a concrete method and benchmark numbers; HKR-H and HKR-R are weak. This is a niche RL research release with no major lab or product implication, so it sits in the 60–71 band.

editor take

CPPO beats CRL baselines on 14/18 tasks; wiring contrastive RL into PPO is the reusable bit, not another offline trick.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LiBaGS Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

LiBaGS scores candidate synthetic samples using four signals—decision-boundary proximity, predictive uncertainty, real-data density, and support validity—and experiments report higher accuracy than classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

#Fine-tuning#Benchmarking#LiBaGS#arXiv

why featured

This is a practical synthetic-data selection paper: HKR-K names the mechanism and HKR-R touches fine-tuning cost and data quality. But datasets, gains, and code are not disclosed here, so HKR-H is weak and it stays in 60–71.

editor take

LiBaGS uses 4 signals for synthetic-sample selection; no datasets or gains are disclosed, so don’t buy “higher accuracy” yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

The paper proposes Group Causal Counterfactual Policy Optimization for LLM reasoning, using an episodic causal counterfactual reward and token-level advantages to favor process-valid, counterfactually robust reasoning; the abstract mentions diverse benchmarks but does not disclose the number of benchmarks or model results.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-K/R pass: the mechanism is specific and relevant to LLM reasoning training. The post lacks benchmark counts, model scale, or reproducible conditions, so it stays in the ordinary research-release band.

editor take

GCCPO trains reasoning with counterfactual rewards; benchmark count is undisclosed, so I don't buy the generalization claim yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench introduces 20,000 synthetic Japanese family-crest samples to evaluate compositional factor recovery in vision-language models, using known container, modifier, and motif factors plus program-code metrics, recombination splits, counterfactual motif-sensitivity groups, and linear probes.

#Multimodal#Vision#Benchmarking#KamonBench

why featured

HKR-H and HKR-K pass through the unusual family-crest setup and 20k controlled samples. HKR-R fails: this is a narrow VLM benchmark without product impact, model-rank conflict, or practitioner stakes.

editor take

KamonBench tests factor recovery on 20K synthetic crests; I like the escape from caption scores, but synthetic grammar stays far from real VLM failures.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Teacher-Guided Policy Optimization for LLM Distillation

The paper proposes TGPO, an on-policy LLM distillation algorithm that uses teacher predictions conditioned on student rollouts as dense directional guidance, requires no extra data annotation, and outperforms standard RKL baselines on complex reasoning benchmarks.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-K passes because TGPO gives a concrete distillation mechanism. HKR-H is weak and HKR-R is narrow; no hard-exclusion rule applies, so it sits in the 60–71 all band.

editor take

TGPO feeds student rollouts to the teacher for dense guidance; no scores disclosed, but it targets RKL distillation’s cold-start failure.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Conformal Anomaly Detection in Python: Moving Beyond Heuristic Thresholds with nonconform

The paper introduces nonconform, a Python package that converts anomaly scores into calibrated p-values under data exchangeability and integrates with scikit-learn, pyod, and custom detectors for calibration, p-value generation, and false discovery rate control.

#Benchmarking#Tools#nonconform#scikit-learn

why featured

HKR-H and HKR-K pass: the thresholding pain point is clear, and the post names calibrated p-values plus scikit-learn/pyod integration. It remains niche anomaly-detection tooling, not an LLM/agent story, so it stays in the 60–71 band.

editor take

nonconform turns anomaly scores into p-values under exchangeability; in production, the hard test is FDR under drift.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→LiLAW: Lightweight Adaptive Weighting Method Improves Noisy Sample Training

LiLAW adjusts per-sample loss weights with three global learnable scalars for easy, moderate, and hard samples; after each training mini-batch, it updates those parameters with one gradient step on a validation mini-batch and reports accuracy and AUROC gains across general and medical imaging noise settings.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K is concrete and HKR-R is relevant to fine-tuning teams, but HKR-H is weak. As a single method paper without effect sizes, code, or production replacement claims, it stays in the 60–71 band.

editor take

LiLAW learns just 3 scalars; the extra validation step is cheap, but validation bias still decides the win.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Characterizing Universal Object Representations Across Vision Models

The paper decomposes object similarity structures from 162 vision models into non-negative dimensions and estimates how often each dimension reappears across models; universal dimensions are more interpretable, driven more by conceptual image properties, and better predict macaque IT activity and human similarity judgments.

#Vision#Interpretability#Research release

why featured

HKR-K passes: 162 models and recurrence estimates add concrete knowledge. HKR-H and HKR-R are weak because the work is representation analysis, far from product or daily practitioner decisions.

editor take

162 vision models converge on universal object dimensions; architecture, objective, data, and scale fail to explain it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

The paper reformulates visual evidence selection for multimodal RAG as information-gain ranking, and reports better results than state-of-the-art RAG baselines on MRAG-Bench and Visual-RAG across multiple model families.

#RAG#Multimodal#Vision#Research release

why featured

HKR-K passes because the paper gives a concrete selection mechanism and benchmark names. HKR-H and HKR-R are weak: no improvement numbers, artifact, or product stake, so it stays in the 60–71 band.

editor take

This ranks visual evidence by information gain; it beats baselines on MRAG-Bench and Visual-RAG, but the abstract gives no delta.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

The paper compares five low-rank pre-training methods with full-rank training at 60M, 130M, and 350M scales, using 16 metrics to analyze loss landscapes, spectral structure, and activation similarity.

#Fine-tuning#Inference-opt#Benchmarking#GaLore

why featured

HKR-K passes via the 5-method/3-size/16-metric setup. HKR-H and HKR-R are weak because the item lacks a surprising result or practitioner-facing cost/security stake, so it stays in the 60–71 research-signal band.

editor take

The paper tests 5 low-rank pretraining methods; stop trusting PPL alone—GaLore tracks full-rank closest, yet later layers still drift.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Continual Learning with Multilingual Foundation Model

The paper presents a multi-stage framework for detecting reclaimed LGBTQ+-related slurs in English, Spanish, and Italian tweets, evaluates 8 multilingual embedding models, selects XLM-RoBERTa by macro F1, and reports 2-5% absolute F1 gains from language-specific threshold optimization without retraining.

#Embedding#Fine-tuning#Benchmarking#XLM-RoBERTa

why featured

HKR-K passes with concrete model/language counts and F1 gains; HKR-R passes for multilingual moderation safety. HKR-H is weak, and the arXiv paper is too narrow for featured.

editor take

XLM-RoBERTa wins across 8 models; the 2-5% F1 gain comes from per-language thresholds, so “continual learning” oversells it.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

CRePE encodes each image token as a depth-aware distribution along its source ray and adds a Geometric Attention Adapter to frozen video DiTs, supporting camera control under the Unified Camera Model for wide-angle and fisheye lenses.

#Multimodal#Vision#CRePE#Research release

why featured

HKR-H/K pass: the paper targets UCM-controlled wide/fisheye video generation and names CRePE plus a Geometric Attention Adapter. No metrics, code/model release, or product tie-in; it stays in the lower all band.

editor take

CRePE adds a geometric adapter to frozen video DiTs. No code in the 17-page paper, so fisheye control stays research-grade.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

The paper introduces an attribute-discrimination benchmark covering color, size, and texture across 67 everyday object classes, then evaluates CVCL, an infant-trained DINO baseline, CLIP, SigLIP, and ResNeXt under image-only prototype tests and text-vision attribute-object prompts.

#Vision#Multimodal#Benchmarking#CVCL

why featured

HKR-K passes via a concrete new benchmark: 67 object classes and three attribute types. HKR-H and HKR-R are weak because the paper is a niche academic evaluation with no product, cost, safety, or competition stake.

editor take

The benchmark spans 67 object classes; infant-scale models learn size and texture, then fail hard on color grounding.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Plan Before You Trade: Inference-Time Optimization for RL Trading Agents

Eun Go, Rohan Deb, and Arindam Banerjee propose FPILOT, an inference-time optimization framework that uses multi-step price forecasts to adapt RL trading policies before one trade, and evaluate it across five policy-learning algorithms on the TradeMaster DJ30 benchmark.

#Agent#Inference-opt#Eun Go#Rohan Deb

why featured

HKR-H and HKR-K pass: the paper has a clear inference-time planning mechanism plus DJ30/5-algorithm evaluation. HKR-R is weak because trading RL is niche, with no code, live trading result, or broad agent lesson disclosed.

editor take

FPILOT tests 5 RL policies on TradeMaster DJ30; I don’t buy trading-RL gains until forecaster quality is nailed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→FeatCal: Feature Calibration for Post-Merging Models

FeatCal calibrates merged-model weights layer by layer with a small calibration set and closed-form updates, reaching 85.5% on CLIP-ViT-B/32 Task Arithmetic versus 77.0% for Surgery and 78.8% for ProbSurgery.

#Fine-tuning#Inference-opt#Benchmarking#FeatCal

why featured

HKR-K passes with a concrete mechanism and 85.5% versus 77.0%/78.8%; HKR-H and HKR-R are weak because this is a narrow model-merging paper with limited general-practitioner pull.

editor take

FeatCal hits 85.5% on CLIP-ViT-B/32 TA; 8 samples still get 82.9%, making post-merge repair look usable.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

GSEC uses multimodal large language models to generate semantic descriptions and weighted image embeddings, then reports better results than 18 methods across six benchmark datasets; the authors released code on GitHub, while the snippet does not disclose dataset names or exact scores.

#Multimodal#Vision#Embedding#GSEC

why featured

This is a standard arXiv vision-clustering paper with code, a clear mechanism, and 6-benchmark results, so HKR-K passes. HKR-H and HKR-R are weak, keeping it in the 60-71 research-signal band.

editor take

GSEC beats 18 methods on 6 benchmarks; scores and datasets are undisclosed, so treat MLLM semantic priors as unproven cost.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Distinguishing Performance Gains from Learning When Using Generative AI

The arXiv paper argues that generative AI can raise learner performance, but the RSS abstract does not disclose the sample size, task design, controls, or effect sizes needed to separate performance gains from learning.

#Research release

why featured

HKR-H/R pass, but HKR-K fails because the feed lacks experimental details. This is a useful learning-research signal, not a reusable model, product update, or benchmark result.

editor take

arXiv only gives an abstract: no sample, task, or effect size; I buy “better output ≠ learning,” but this has no teeth yet.

HKR breakdown

hook ✓knowledge —resonance ✓

→ open source

SCORE

H1·K0·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→AdaptNC: Adaptive Nonconformity Scores for Conformal Prediction under Distribution Shift

AdaptNC jointly adapts nonconformity score parameters and conformal thresholds online, using adaptive reweighting and a replay buffer to maintain target coverage on robotic benchmarks under multi-agent policy changes, environmental changes, and sensor degradation.

#Robotics#Safety#Benchmarking#Research release

why featured

HKR-K/R pass: the mechanism is concrete and tied to robotics reliability. HKR-H is weak, and no experiment numbers are disclosed, so this stays an interesting niche research item.

editor take

AdaptNC adapts scores and thresholds online; coverage holds, but volume gains lack numbers in the snippet, so 'significant' stays unproven.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→A Five-Layer MLOps Architecture for Connected Automated Driving

The paper proposes a five-layer MLOps architecture for collective learning in connected automated driving systems, covering layer responsibilities, interactions, and multi-level self-assessments; the abstract frames the design as a conceptual blueprint for fleet operators and stakeholders, aimed at detecting and reducing edge cases including black swan events.

#Robotics#Safety#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: the paper offers a five-layer ADS MLOps architecture and self-assessment mechanism for edge cases. No metrics, artifact, or fleet deployment keeps it in the normal research-release band.

editor take

This ADS MLOps paper gives a five-layer blueprint, no validation metrics; I’d treat the black-swan claim as governance architecture.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks

ISOMORPH introduces the first public digital twin for a multi-echelon logistics network, releasing datasets at C=50 and C=200 catalogue scales with six scenario sweeps, 30 additional rollouts, and 20 Latin-hypercube perturbations for time-series forecasting benchmarks.

#Benchmarking#ISOMORPH#Chronos#TimesFM

why featured

HKR-K passes because the paper gives reproducible supply-chain digital-twin datasets and scenario parameters. HKR-H/R are weak; this is a vertical forecasting benchmark, useful but below featured.

editor take

ISOMORPH ships C=50/200 logistics twins; the useful part is perturbable simulation, not another static retail-style TSF table.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→CHAL: Council of Hierarchical Agentic Language

The paper introduces CHAL, a multi-agent dialectic framework that uses CBS graph-structured belief representations and a gradient-informed mechanism to optimize beliefs in defeasible domains; the abstract does not disclose the number of ablations or benchmark names.

#Agent#Reasoning#Alignment#Research release

why featured

HKR-K passes on the CBS graph belief representation and gradient update mechanism. HKR-H/R miss: the title is opaque, and the post discloses no experiment count, benchmark, or open artifact.

editor take

CHAL recasts debate as CBS belief optimization; no benchmarks or ablation count disclosed, and “differentiable belief strength” needs proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→AGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers

The paper introduces AGOP-Weighted, which weights per-sample gradients with a training-distribution AGOP prior; on XAI-TRIS, it reports 44% higher mIoU than Integrated Gradients on linear tasks, while AGOP-Global reaches 7x IG’s mIoU on multiplicative tasks with zero inference cost.

#Vision#Interpretability#Benchmarking#Research release

why featured

HKR-K passes with a concrete mechanism and two benchmark numbers; HKR-H misses because the title is jargon-heavy, and HKR-R is narrow to vision attribution researchers. This fits the 60–71 research-release band, landing at 62.

editor take

AGOP-Weighted beats IG by 44% mIoU on XAI-TRIS linear tasks; I read it as gradient denoising, not universal attribution yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Entropy Aware Reward Guidance for Diffusion Language Model Alignment

The paper introduces EntRGi for 7B-parameter diffusion language models, using predictive entropy to interpolate per token between continuous relaxations and sampled hard tokens for test-time adaptation and RGRL post-training.

#Alignment#Fine-tuning#Inference-opt#arXiv

why featured

HKR-K passes on a concrete mechanism: entropy-guided per-token interpolation for test-time adaptation and RGRL post-training. HKR-H and HKR-R are weak, and no result numbers or artifact details are disclosed.

editor take

EntRGi uses entropy-gated per-token interpolation on 7B diffusion LMs; I buy the direction for reward guidance in discrete diffusion.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Deep Delta Learning

The paper introduces Deep Delta Learning, a residual update rule that reads a learned direction, compares it with a learned target, and applies a gated correction; evaluations use decoder-only language models, but the RSS snippet does not disclose model sizes or benchmark numbers.

#Reasoning#Inference-opt#Benchmarking#arXiv

why featured

HKR-K passes because the summary gives a concrete Deep Delta Learning mechanism. HKR-H/R are weak, and parameter scale or strong benchmark results are not disclosed, so this stays a standard research update.

editor take

DDL lets each layer gate-rewrite residual components; sizes and scores are undisclosed, so treat it as residual surgery before buying the gains.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Cascaded Flow Matching for Heterogeneous Tabular Data with Mixed-Type Features

TabCascade generates heterogeneous tabular rows with a two-stage cascade: it first produces categorical features and coarse numerical categories, then applies high-resolution flow matching with a guided conditional probability path, improving the detection score by 51.9%.

#Fine-tuning#Benchmarking#TabCascade#Research release

why featured

HKR-K passes with a concrete mechanism and 51.9% figure. HKR-H/R fail because this is narrow tabular-data generation research, so it stays in the 60–71 band.

editor take

TabCascade lifts detection score by 51.9% using coarse categorical sketches before Flow Matching; mixed tabular generation is finally paying its missing-value debt.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequences

UxSID uses Semantic IDs and dual-level attention to model ultra-long user sequences with semantic-group shared interest memory, and the abstract reports state-of-the-art performance plus a 0.337% revenue lift in a large-scale advertising A/B test.

#Memory#Inference-opt#Benchmarking#UxSID

why featured

HKR-K passes: UxSID cites Semantic IDs, two-level attention, and a +0.337% ads A/B revenue lift. HKR-H and HKR-R miss because the angle stays inside ad-recsys, away from models, agents, or toolchains.

editor take

UxSID reports a 0.337% ad A/B revenue lift; semantic shared memory is a credible cheaper path than item-specific retrieval.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Conditional Compatibility Learning for Context-Dependent Anomaly Detection

The paper introduces conditional compatibility learning and CC-CLIP, a vision-language architecture that disentangles subject and context representations from a single image and fuses visual evidence with text-conditioned attention; the abstract says it reaches state-of-the-art results on real-world contextual anomaly detection, but it does not disclose specific scores.

#Vision#Multimodal#Benchmarking#CC-CLIP

why featured

HKR-K passes for a concrete CC-CLIP mechanism, but HKR-H and HKR-R miss: no SOTA numbers are disclosed, and the topic is niche vision anomaly detection rather than broad practitioner signal.

editor take

CC-CLIP frames anomaly detection as subject-context compatibility; no scores in the snippet, so treat SOTA as unverified.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

F-GRPO jointly optimizes candidate generation and ranking in one autoregressive rollout, using separate group-relative advantages for generation and ranking, and reports better top-ranked performance than GRPO, decoupled baselines, and supervised alternatives on sequential recommendation and multi-hop QA benchmarks.

#RAG#Reasoning#Fine-tuning#Research release

why featured

HKR-K passes: F-GRPO proposes a joint generation-ranking training mechanism and reports gains over GRPO and decoupled baselines on recsys and multi-hop QA. No scores are disclosed, and HKR-H/R are weak, so it stays in 60-71.

editor take

F-GRPO trains generation and ranking in one rollout; gains are undisclosed, so I read it as a GRPO credit-assignment patch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

The paper introduces CR-Net, a parameter-efficient training framework that reconstructs layer activations with previous-layer outputs and low-rank differences; pre-training experiments span 60M to 7B parameters, but the snippet does not disclose exact memory or compute reduction percentages.

#Fine-tuning#Inference-opt#CR-Net#Research release

why featured

HKR-K passes for a new structure and 60M-to-7B experiment range. HKR-H/R are weak: this is a method paper without disclosed savings or a direct deployment result, so it stays in all.

editor take

CR-Net tests 60M–7B pretraining; no savings percentages disclosed, so don’t treat low-rank memory claims as engineering proof.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Low-Rank Adapters Initialization via Gradient Surgery for Continual Learning

The paper proposes SLICE for LoRA continual learning, using gradients from the current task and a replay buffer, reconciling them with a projection operator, and applying truncated SVD to initialize adapter weights; evaluation covers TRACE, Super-NI, and adversarial Super-NI sequences mined for maximally opposing gradients.

#Fine-tuning#Alignment#Benchmarking#arXiv

why featured

HKR-K passes via the SLICE mechanism and TRACE/Super-NI evaluation setup. HKR-H/R are weak, and the body does not disclose gains or reproducibility details, so this stays a niche research signal.

editor take

SLICE initializes LoRA from replay gradients and truncated SVD; I’d check buffer cost first, since the abstract gives no overhead.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Shortcut Mitigation via Spurious-Positive Samples

The paper proposes identifying a small set of instances where a model relies on spurious attributes, then locating relevant intermediate-layer neurons and regularizing their impact; the method does not require balanced held-out data, extra annotations, or all attribute-class group combinations in training data.

#Alignment#Interpretability#Research release

why featured

HKR-K is clear: the paper proposes shortcut mitigation without balanced validation sets, training labels, or full group combinations. HKR-R is narrow and HKR-H is weak, so this stays all, below featured.

editor take

This paper targets shortcut neurons from a few spurious-positive samples; datasets and gains aren’t disclosed, so don’t sell it as general debiasing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Effective Context in Transformers: Analysis of Fragmentation and Tokenization

The paper analyzes representation choice under fixed Transformer context windows using Markov sources, proves fragmentation can strictly increase optimal finite-context log-loss, and gives loss guarantees for greedy tokenizers such as BPE and WordPiece based on source-history coverage and compression rate.

#Reasoning#Benchmarking#ByT5#CANINE

why featured

HKR-K passes via a concrete mechanism around fragmentation, effective context, and tokenizer guarantees. HKR-H/R are weak, and the Markov/log-loss framing raises accessibility friction, keeping it in all.

editor take

The paper proves fragmentation raises finite-context log-loss; byte models losing to BPE is not just bad training.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Latent-Augmented Discrete Diffusion Models

The paper proposes LADD, adding a learnable auxiliary latent channel to discrete diffusion over joint token-latent space, with Co-LADD, Di-LADD, joint denoising, and sequential latent-then-token schedules.

#Inference-opt#Reasoning#Research release

why featured

HKR-K passes for a concrete modeling mechanism and two variants, but the post gives no metrics, code, or practical win over autoregressive models. This stays in all as a niche research signal.

editor take

LADD adds a latent channel to discrete diffusion; no benchmark numbers disclosed, so I read it as a structural patch for few-step decoding.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

The study trains monolingual and bilingual BERT models on six downsampled languages and finds that applying BPE dropout during both pretraining and fine-tuning usually beats using it only during fine-tuning in low-resource settings.

#Fine-tuning#Benchmarking#BERT#Research release

why featured

HKR-K passes with a testable setup across 6 languages and mono/bilingual BERT. HKR-H and HKR-R miss: narrow NLP training technique, no broad industry nerve; no hard exclusion, so it sits at the low end of all.

editor take

BPE dropout wins across 6 low-resource languages when used in pretraining plus fine-tuning; saving randomness for fine-tuning is too late.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Reducing Cross-Sample Prediction Churn in Scientific Machine Learning

The paper evaluates 9 chemistry benchmarks and finds that two classifiers trained on independent bootstraps differ by only 1.3–4.2 percentage points in aggregate accuracy, while disagreeing on labels for 8.0–21.8% of test molecules.

#Benchmarking#Fine-tuning#Research release#Benchmark

why featured

HKR-H/K pass: the paper shows 8.0–21.8% molecule-label churn despite close accuracy. HKR-R is weak; chemistry benchmarks and no product or agent angle keep it in the general research band.

editor take

9 chemistry benchmarks show 8.0–21.8% label churn; accuracy-only scientific ML leaderboards are hiding instability.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

The paper proposes an event-driven MARL framework that uses NMD and an event-based hypernetwork to generate LoRA modules, reconfiguring agent policies when events occur and decoupling agent identity from behavior.

#Agent#Reasoning#arXiv#Research release

why featured

HKR-K passes because the summary gives concrete mechanisms for event-driven MARL. HKR-H and HKR-R are weak; no metrics, artifact, or production claim keeps it below the featured threshold.

editor take

Only the abstract is disclosed: NMD plus event hypernet emits LoRA; “only method solving reassignment” is strong, but no baselines or numbers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Differentiable Learning of Lifted Action Schemas for Classical Planning

The paper proposes a neural network architecture that learns lifted action schemas from traces with fully observed states and unobserved action arguments, then evaluates structure recovery across multiple planning domains plus robustness to observation noise and a slot-based dynamics variation.

#Reasoning#Research release

why featured

HKR-K passes because the paper states a concrete learning setup for lifted action schemas. HKR-H/R are weak: the angle is niche classical-planning research with no product hook or practitioner-level debate.

editor take

This learns lifted schemas from fully observed states and hidden action arguments; narrow setup, but cleaner than end-to-end planning hype.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Probabilistic Prediction Markets with Intermittent Contributions

The paper introduces a prediction market design where agents trade forecasts, enter or exit at will, and use robust regression to combine forecasts with missing submissions while allocating payoffs from historical, in-sample, and out-of-sample performance.

#Agent#Benchmarking#Research release

why featured

HKR-K passes because the paper states a concrete market mechanism for intermittent agent input. HKR-H and HKR-R are weak, and only abstract-level facts are available, so it stays in the 40–59 upper band.

editor take

arXiv 2510.13385 handles missing forecasts with robust regression; open entry is useful, but payoff reproducibility is thin.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→VIP-COP: Context Optimization for Tabular Foundation Models

VIP-COP selects high-value samples and features for tabular foundation models, using online KernelSHAP-based regression, iterative refinement, value-guided context sampling, and multi-fidelity pruning to optimize test-time context under black-box access.

#Reasoning#Inference-opt#Interpretability#VIP-COP

why featured

HKR-K passes because the mechanism is specific, but HKR-H/R are weak. The tabular-foundation-model scope is narrow, and the post does not disclose benchmark gains or production impact.

editor take

VIP-COP uses black-box KernelSHAP for tabular context selection; the RSS claims minutes-to-gain, but gives no benchmark size or lift.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

DiffLNS integrates a D3PM initializer with LNS2 for MAPF, reaching a 95.8% average success rate across 20 congested settings and beating the strongest tested baseline by 9.6 percentage points.

#Agent#Robotics#Reasoning#DiffLNS

why featured

HKR-K passes with a clear mechanism and benchmark result. The MAPF paper is specialist research with no product path or broad practitioner hook, so it stays below featured.

editor take

DiffLNS hits 95.8% across 20 congested MAPF settings; using diffusion as an LNS2 warm start is the sane bet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Delightful Exploration

The paper introduces Delight-gated exploration, a host-override rule that gates exploratory actions by expected improvement times surprisal, reuses the same hyperparameters across Bernoulli bandits, linear bandits, and tabular MDPs, and reports weaker regret growth than Thompson Sampling and epsilon-greedy in tested unresolved regimes.

#Reasoning#Research release

why featured

HKR-H/K pass via a named mechanism and test settings, but the work is narrow bandits/MDPs research. No product impact or practitioner nerve is disclosed, so it stays below featured.

editor take

DE reuses one hyperparameter set across three task types; I like the gate, but abstract-only wins aren't a TS replacement.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling

MILM represents multimodal irregular time series as XML time-ordered triplets and uses two-stage fine-tuning, with MILM-2S achieving the best average performance across multiple EHR datasets while value-redaction tests show sampling patterns carry predictive signal.

#Multimodal#Fine-tuning#Benchmarking#MILM

why featured

HKR-K passes with a concrete representation and training recipe plus average top EHR results. HKR-H/R fail: the angle is niche medical time-series modeling, not a broad practitioner conversation.

editor take

MILM-2S ranks best on average across EHR datasets; XML triplets are ugly, but redaction training turns sampling bias into signal.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Efficient Compression of Neural Networks and Datasets

The paper casts intractable MDL optimization as ℓ0-regularized learning and compares sparse optimization methods on convolutional networks and transformers; the RSS snippet does not disclose compression ratios, accuracy-loss numbers, dataset names, or sample-efficiency metrics.

#Inference-opt#Fine-tuning#Benchmarking#Research release

why featured

HKR-K passes via the MDL-to-ℓ0 mechanism and CNN/Transformer setting; HKR-H fails, and HKR-R is weak because compression ratio and accuracy loss are missing. Technical research signal, but no hard exclusion.

editor take

The paper maps MDL to ℓ0 regularization across CNNs and transformers; no compression or accuracy numbers, so I don't buy the generalization claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Research proposes mechanism design framework for decentralized risk detection

The paper proposes a temporal value assignment mechanism that uses discounted verified outcomes and a strictly proper scoring rule to incentivize truthful posterior reporting, then illustrates the framework on a 1.4M-transaction synthetic anti-money-laundering benchmark.

#Safety#Benchmarking#Research release

why featured

HKR-K passes on the TVA mechanism and 1.4M synthetic AML benchmark. HKR-H and HKR-R miss because the title is specialist-heavy and the use case sits far from models, agents, or product shifts.

editor take

TVA is shown on 1.4M synthetic AML transactions; I buy the mechanism problem, not the regulatory leap from delayed labels.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Sample-Efficient Optimisation over the Outputs of Generative Models

The paper proposes O3 for black-box optimisation over continuous-variable diffusion and flow-matching models, using low-dimensional surrogate latent spaces extracted without extra training to search for higher-scoring samples on image and protein design tasks.

#Inference-opt#Multimodal#O3#Research release

why featured

HKR-K passes via O3’s training-free low-dimensional surrogate latent-space mechanism. HKR-H/R are weak: only abstract-level detail is given, with no benchmark numbers or production replacement claim.

editor take

O3 optimizes diffusion outputs via training-free surrogate latents; image and protein gains are claimed, but exact lift is undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→PolySHAP: Extending KernelSHAP with Interaction-Informed Polynomial Regression

PolySHAP replaces KernelSHAP’s linear game approximation with higher-degree polynomial regression, reports better Shapley value estimates on multiple benchmark datasets, proves consistency, and shows that paired sampling produces exactly the same approximations as second-order PolySHAP without fitting a degree-2 polynomial.

#Interpretability#Benchmarking#KernelSHAP#PolySHAP

why featured

HKR-K passes because the paper states a concrete mechanism and proof. HKR-H/R are weak: this is a specialized interpretability method, with no product path or broad industry nerve disclosed.

editor take

PolySHAP lifts KernelSHAP from linear to polynomial fits; the sharp part is proving second-order equals paired sampling.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

The paper proposes a style conditioning framework that encodes a reference motion into a style embedding, uses a hypernetwork to generate LoRA updates at each diffusion denoising step, and reports state-of-the-art stylized text-to-motion results on HumanML3D and 100STYLE, including improved generalization to unseen styles.

#Multimodal#Fine-tuning#HumanML3D#100STYLE

why featured

This niche multimodal paper clears HKR-K through a concrete LoRA-at-denoising mechanism and named benchmarks. It avoids hard exclusion, but lacks product impact and HKR-H/R, so it stays in the 40–59 band.

editor take

HyperLoRA generates LoRA per denoising step from reference motion; SOTA is reported, but speed and rank are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→SMA: Submodular Modality Aligner for Data-Efficient Multimodal Learning

The paper introduces SMA, a Submodular Modality Aligner that uses Submodular Mutual Information to align multimodal sets, and evaluates it on 14 zero-shot classification and retrieval tasks from the CLIP benchmark under low-data conditions.

#Multimodal#Vision#Benchmarking#SMA

why featured

HKR-K passes with a concrete mechanism and 14 CLIP zero-shot tasks. HKR-H/R are weak: the angle is technical-paper jargon, and the post does not disclose a production impact or usable release.

editor take

SMA runs 14 CLIP tasks with tens of thousands of samples; set-level alignment looks saner than hoarding pairs.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Causal Fine-Tuning under Latent Confounded Shift

The paper introduces Causal Fine-Tuning for latent confounded shift, instantiates it in BERT, decomposes representations into stable causal and shift-sensitive components, and reports stronger results than black-box domain generalization baselines in text spurious-correlation injection attack experiments.

#Fine-tuning#Alignment#Benchmarking#BERT

why featured

HKR-K passes for the representation-splitting mechanism and spurious-correlation tests, but HKR-H/R fail: no numbers, product path, or practitioner nerve. This stays in the lower research-signal band.

editor take

CFT splits stable and shift-sensitive BERT representations; no dataset or lift disclosed, so I’d treat it as a spurious-correlation training recipe.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Real-World Challenges in Fake News Detection: Dealing with Posts by Cold Users

The paper proposes USER EVIDENCE NETWORK for fake news and rumor detection under cold-user conditions, using existing users’ interactions to approximate missing behavior data; the RSS snippet does not disclose dataset sizes, metric results, or code availability.

#RAG#USER EVIDENCE NETWORK#Research release

why featured

This is a narrow arXiv paper: HKR-K comes from the cold-user modeling mechanism, while dataset size, metrics, and code are absent. HKR-H/R are weak, so it stays in the 40–59 band.

editor take

UEN fills cold-user evidence from existing-user interactions; no datasets, metrics, or code in RSS, so treat it as directional.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→A3B2: Adaptive Asymmetric Adapter for Branch Bias in Few-Shot Vision-Language Image Classification

The paper proposes A3B2, an adaptive asymmetric adapter for few-shot vision-language image classification, and evaluates it on 3 few-shot tasks across 11 datasets against 11 prompt- and adapter-based baselines, using UAAD to suppress image-branch adaptation when prediction uncertainty is high.

#Vision#Fine-tuning#Multimodal#CLIP

why featured

HKR-K passes for a concrete UAAD mechanism and benchmark setup. HKR-H/R fail because this is a niche vision-language few-shot adapter paper with little practitioner debate, so it stays in the 40–59 band.

editor take

A3B2 beats 11 baselines on 3 tasks and 11 datasets; I buy UAAD, but effect sizes and OOD splits are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

The paper introduces IV-ICL, an amortized Bayesian in-context learning method that learns marginal posteriors for causal effects and derives bounds from quantiles, evaluating it on synthetic and semi-synthetic IV benchmarks with 20–500x lower inference time than efficient semi-parametric, Bayesian, and plug-in baselines.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

hard-exclusion-technical-accessibility applies: IV causal-effect bounds and amortized Bayesian posteriors are specialist-heavy with no product on-ramp. HKR-K passes on the 20–500x claim, but H/R are weak.

editor take

IV-ICL claims 20–500x faster inference; I buy the ICL framing, but code and semi-synthetic replication decide it.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Context-Aware Web Attack Detection in Open-Source SIEM Systems Using MITRE ATT&CK Behavioral Profiling

Smart-SIEM uses per-source-IP context vectors and a LightGBM/XGBoost two-stage cascade to detect web attacks on 46,454 Wazuh events, reaching 0.967 F1 for binary detection and 0.914 F1 for six-class categorization.

#Benchmarking#Wazuh#MITRE ATT&CK#Smart-SIEM

why featured

Triggers hard-exclusion-technical-accessibility: Wazuh, MITRE ATT&CK, and web-attack detection are specialist security material with no AI product or agent impact. HKR-K passes on metrics, but the item is capped at 39.

editor take

Smart-SIEM hits 0.967 binary F1 on 46,454 Wazuh events; I buy context features, not the self-built dataset leap.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→On the Limits of Latent Reuse in Diffusion Models

The paper analyzes latent reuse in diffusion models under source-target distribution shift, showing that target-domain score error is governed by principal-angle misalignment between subspaces and target ambient noise amplified by the diffusion time scale.

#Multimodal#Reasoning#Research release

why featured

Hard-exclusion technical-accessibility fail: this is a theory paper on diffusion latent reuse errors, with no product, tool, or accessible experiment path. Only HKR-K passes, so importance is capped at 39.

editor take

The paper pins latent reuse failure on subspace angle mismatch and target noise amplification. Cheap transfer needs geometry checks first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

The paper introduces Neural LoFi, a stylized limit of gradient-based training that turns hierarchical feature learning into a layerwise spectral procedure; the arXiv abstract says experiments cover fully connected and convolutional architectures, but the post does not disclose dataset sizes.

#Reasoning#Interpretability#Benchmarking#arXiv

why featured

HKR-K passes, but Neural LoFi is spectral deep-learning theory with no on-ramp for general AI practitioners, triggering hard-exclusion-technical-accessibility and the 39 cap. Dataset scale and reproducible conditions are not disclosed.

editor take

Neural LoFi frames deep training as layerwise low-degree spectral filtering in 62 pages with code; useful mechanism, not an LLM theory yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→The Payment Heterogeneity Index: An Unsupervised Framework for High-Volume Procurement Oversight

The paper introduces PHI, an unsupervised screening framework for post-award procurement payments, and identifies 10.1% of high-volume UK municipal suppliers as structurally deviant, with permutation tests, Kolmogorov-Smirnov tests, and a Certified Fraud Examiner review supporting the flagged cases.

#Benchmarking#arXiv#Research release

why featured

HKR-K passes via PHI, the 10.1% supplier finding, and validation tests. HKR-H/R are weak: this is procurement-audit ML, with no model, product, or agent-system impact, so it stays in the low-value research band.

editor take

PHI flags 10.1% of high-volume suppliers; without labels, fraud detection claims stay capped at plausibility.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

The paper proposes TCE, a cross-domain offline reinforcement learning framework that uses a dual score-based generative model to synthesize target-consistent transitions over expanded state regions; the abstract reports experiments across diverse cross-domain environments, but it does not disclose dataset sizes.

#Robotics#Research release

why featured

Hard-exclusion technical-accessibility fail: offline RL plus score-based transition generation is specialist material, with no product path or dataset scale disclosed. HKR-K passes, but the item is capped below 40.

editor take

TCE uses dual score models to expand target coverage; no benchmark numbers disclosed, so robotics claims need replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→From Heuristics to Analytics: Forecasting Effort and Progress in Online Learning

The paper uses ITS logs from 425 middle-school students over one school year to predict weekly practice minutes and newly mastered skills, benchmarking 15 predictors and reducing MAE by 22–33% versus heuristic baselines.

#Benchmarking#Interpretability#Research release#Benchmark

why featured

HKR-K passes with a clear dataset, task, and 22-33% MAE gain. HKR-H/R are weak: this is niche learning analytics, not a model, product, or agent-system story for most AI practitioners.

editor take

425 student logs cut weekly-forecast MAE by 22–33%; don’t sell personalization yet, 8 tutor interviews prove no intervention lift.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Study uses graph neural networks and multimodal data to classify esophageal motility disorders

The study trains a GNN-based multimodal classifier on HRIM recordings and patient data from 104 patients with esophageal motility disorders, represents HRIM signals as spatio-temporal graphs, and reports ablation gains over HRIM-only feature models and vision-based classifier baselines, while the abstract does not disclose exact accuracy, dataset split, or external validation results.

#Multimodal#Reasoning#Benchmarking#Research release

why featured

Triggers hard-exclusion-4: a medical diagnosis paper uses AI as a tool with no agent, product, or AI-infrastructure implication. HKR-K passes on sample size and mechanism, but HKR-H/R fail, so it is capped below 40.

editor take

HRIM from 104 patients becomes spatiotemporal graphs; small cohort, so treat GNN gains as a clinical-feature fusion probe.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Researchers introduce supervised deep multimodal matrix factorization for brain network analysis

The paper introduces SD3MF, extending SNMTF from unsupervised single-graph clustering to supervised prediction over multimodal graph populations, and the authors provide reproducibility code on GitHub.

#Multimodal#Interpretability#Benchmarking#SD3MF

why featured

Triggers hard-exclusion-4: AI is used for brain-network science, with no agent or product implication. HKR-K passes for SD3MF and code, but the niche technical scope keeps it excluded and capped below 40.

editor take

SD3MF targets multimodal brain graphs with released code; no dataset size or gain margin is disclosed, so I don’t buy the CNN/GNN win yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Graph-Based Financial Fraud Detection with Calibrated Risk Scoring and Structural Regularization

arXiv:2605.12782 proposes a graph neural network framework for financial transaction fraud detection, using shared attributes and interaction consistency to build a transaction graph; the abstract says experiments use a public financial transaction dataset but does not disclose metric values.

#Benchmarking#Research release#Benchmark

why featured

This is a routine applied arXiv paper: HKR-K comes from the stated mechanisms, while HKR-H and HKR-R are weak. The body gives no concrete results, so it stays in the low-value research-update band.

editor take

arXiv:2605.12782 gives a public-dataset setup, but no AUC or calibration error; I’d file this as routine GNN fraud stacking.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

26d ago

arXiv · cs.LG· atomEN04:00 · 05·14

→Learning Local Constraints for Reinforcement-Learned Content Generators

The paper constrains a PCGRL generator’s action space with WFC-learned local constraints and tests input count, input type, random starting-state collapse, and rare-pattern exclusion on puzzle-platform levels such as Lode Runner.

#Agent#Reasoning#arXiv#Wave Function Collapse

why featured

HKR-K passes via a concrete WFC-to-PCGRL mechanism and test conditions. HKR-H/R are weak; the niche procedural-content focus limits relevance for general AI practitioners, with no hard-exclusion trigger.

editor take

PCGRL constrains actions with WFC priors. Nice hybrid, but hyperparameter sensitivity and no disclosed generalization evidence keep it niche.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

01:57

26d ago

HuggingFace Papers (takara mirror)· rssEN01:57 · 05·14

→What Makes Words Hard? Sakura at BEA 2026 Vocabulary Difficulty Prediction Task

Sakura describes two vocabulary difficulty prediction models: a black-box LLM fine-tuned with soft-target loss ranked first in the open track with r > 0.91, while an explainable model reached r > 0.77 and the authors released code on GitHub.

#Fine-tuning#Interpretability#Benchmarking#Sakura

why featured

HKR-K passes with concrete rank, correlation numbers, and open code. HKR-H and HKR-R are weak because this is a narrow BEA shared task with limited product, cost, or competitive relevance for AI practitioners.

editor take

Sakura hits r>0.91, but the KVL finding stings: spelling and item design are contaminating vocabulary benchmarks.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

01:37

26d ago

● P1HuggingFace Papers (takara mirror)· rssEN01:37 · 05·14

→EnergyLens: Multi-GPU LLM Inference Energy Modeling and Optimization

EnergyLens models multi-GPU LLM inference energy with an einsum-based interface covering fusion, parallelism, overlap, MoE load imbalance, and communication energy, then validates on Llama3 and Qwen3-MoE; reported MAPEs are 9.25% to 13.19% for multi-GPU prefill and decode energy, 12.97% across SM allocations, with decode efficiency varying up to 52.9x across configurations.

#Inference-opt#Benchmarking#Llama3#Qwen3-MoE

why featured

HKR-H, HKR-K, and HKR-R pass: the 52.9x efficiency gap is clickable, the MAPE range is testable, and GPU energy cost is a real nerve. The topic is infra-specialized, so it stays in the featured-threshold band.

editor take

EnergyLens attacks the lazy latency-as-energy proxy; the 88.2% number is promising, but two arXiv entries are one source chain, not field validation.

sharp

Both entries trace to the same arXiv paper, 2605.10556, with identical title framing; this is a paper signal, not independent validation. EnergyLens lands because it rejects the common latency proxy: the authors say latency and energy optima diverged in over 20% of tested configurations, then fit a 12-parameter closed-form model separating tensor parallelism, pipeline parallelism, prefill, and decode. I buy the direction, not the deployment claim yet. Fifty profiling measurements for 88.2% Top-1 configuration selection beats a 60.9% analytical baseline, and the 10x sample reduction versus ensemble ML is a useful hook. But the abstract does not disclose the full model list, accelerator SKUs, or power-measurement path. For inference teams, this belongs as a feature generator for schedulers, not a replacement for live A/B tests and rack-level energy accounting.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

00:45

26d ago

HuggingFace Papers (takara mirror)· rssEN00:45 · 05·14

→DT-Transformer Foundation Model Achieves Strong Performance in Disease Trajectory Prediction

DT-Transformer trains on 57.1M structured EHR entries from 1.7M Mass General Brigham patients across 11 hospitals, and reports a median age- and sex-stratified AUC of 0.871 for next-event prediction across 896 disease categories.

#Benchmarking#Mass General Brigham#Research release#Benchmark

why featured

HKR-H and HKR-K pass via real-hospital EHR scale and concrete AUC data. HKR-R is weak because the paper sits outside most AI practitioners’ toolchain, so it stays in the 60–71 band.

editor take

DT-Transformer hits 0.871 AUC on 57.1M EHR entries; clinical value hinges on external validation, undisclosed here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

00:11

26d ago

FEATUREDHuggingFace Papers (takara mirror)· rssEN00:11 · 05·14

→MetaAgent-X: Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

MetaAgent-X jointly optimizes automatic MAS design and execution with end-to-end reinforcement learning, reports up to 21.7% gains over automatic MAS baselines, and uses hierarchical rollout plus stagewise co-evolution to stabilize designer-executor training.

#Agent#Reasoning#Tools#MetaAgent-X

why featured

HKR-H/K/R all pass for an agent-systems paper with a 21.7% reported gain and concrete training mechanisms. With only the summary available and no code or third-party replication stated, it stays just above the featured threshold.

editor take

MetaAgent-X treats auto-MAS as training, not prompt orchestration; 21.7% is catchy, but absent benchmark names make me cautious.

sharp

MetaAgent-X is sharp because it trains the designer and executor together, instead of pretending workflow search equals adaptation. A lot of auto-MAS work only searches at test time, or optimizes the planner while freezing execution agents. This paper uses end-to-end RL, hierarchical rollout, and stagewise co-evolution for credit assignment, then reports up to 21.7% gains over automatic MAS baselines. I buy the direction, not the victory lap. The snippet gives no benchmark names, task mix, baseline list, or whether 21.7% is an average or a cherry-picked peak. Against AutoGen- or CrewAI-style orchestration, this is a cleaner research bet: make multi-agent behavior trainable. Against production agent stacks, it still owes the boring parts—training cost, failure cases, and whether the learned scripts survive tool/API drift.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

papers · 2026-05-14

more

feeds

admin