papers · 2026-05-15

▸ 209 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-15 · Fri

17:58

24d ago

FEATUREDarXiv · cs.AI· atomEN17:58 · 05·15

→Designing Datacenter Power Delivery Hierarchies for the AI Era

The paper proposes a framework for evaluating datacenter power delivery designs with throughput, power, and cost metrics over arrival, oversubscription, and decommissioning sequences. It uses projection models for GPU, compute, and storage deployments, grounded in Microsoft Azure production data, and cites projections approaching 1MW rack-scale deployment power density by 2027.

#Inference-opt#Microsoft Azure#Research release

why featured

HKR-H/K/R all pass: the near-1MW rack forecast is a concrete AI-infra hook with new evaluative machinery. Technical power-delivery depth limits generalist reach, keeping it in the 78–84 band.

editor take

A 1MW rack turns datacenter planning from bought power into usable GPU capacity; every cloud megawatt boast needs a haircut.

sharp

This paper drags AI datacenter planning away from headline megawatts and toward deployable capacity. The authors use Microsoft Azure production data to model arrivals, oversubscription, and decommissioning, then show multi-resource stranding changes deployable capacity, effective capex, and delivered throughput. The hard number is the 2027 projection: rack-scale deployment power density approaching 1MW. Cloud providers like to brag about power deals, campus megawatts, and GPU counts. This framework says those numbers get discounted by electrical topology, rack placement, workload mix, and hardware turnover. NVIDIA GB200 NVL72 already pushed cabinet power into the hundred-kilowatt class; 1MW-class deployments make power hierarchy a model-supply constraint. Ask how much purchased power reaches accelerators safely and economically, not how many megawatts got announced.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:52

24d ago

FEATUREDarXiv · cs.CL· atomEN17:52 · 05·15

→Generative AI Framework for Utility Billing CO2 Analytics and Resource Optimization

The paper proposes an end-to-end utility billing framework with four production capabilities: a generative AI agent for constrained natural-language bills, a transformer forecaster for day-ahead consumption with calibrated quantile bands, CO2 analytics, and resource scheduling under grid-stress and emissions constraints.

#Agent#Reasoning#Research release

why featured

HKR-K passes because the paper names an end-to-end mechanism across billing, forecasting, CO2 accounting, and scheduling. HKR-H/R are weak, with no metrics, code, or major lab behind it, so it stays in all.

editor take

Two arXiv tracks, same paper, not market validation. Bundling bills, carbon, and load forecasting under “generative AI” smells like architecture before evidence.

sharp

The 2 sources are the same arXiv paper listed under cs.CL and cs.LG, with identical title and abstract framing. That is not independent coverage. The paper claims 4 production-grade capabilities: natural-language bills, CO2 per kWh, day-ahead consumption forecasting, and constrained resource scheduling; the provided body gives architecture language, not dataset, error, cost, or deployment conditions. I’m wary of this “end-to-end generative AI framework” genre. Utility billing is not a chat UX problem. The hard parts are auditable carbon factors, tariff logic, and calibrated forecast intervals. The abstract’s constrained decoding and quantile bands are the right words, but without comparisons against time-series baselines, rules engines, and existing carbon-accounting pipelines, the GenAI layer looks like a fluent wrapper around systems that already need to be deterministic.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:48

24d ago

FEATUREDarXiv · cs.AI· atomEN17:48 · 05·15

→Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

VLA-AD distills OpenVLA-7B with offline VLM semantic supervision and produces a 158M-parameter student across three LIBERO suites, reducing model size by 44× while matching the teacher within a 0.27% average relative gap and running at 12.5 Hz on an RTX 4090.

#Robotics#Vision#Multimodal#OpenVLA

why featured

HKR-H/K/R all pass: offline VLM semantic guidance distills OpenVLA-7B into a 158M policy with concrete speed and gap numbers. It is still a robotics research paper, so it sits above the featured threshold, not near P1.

editor take

VLA-AD shrinks OpenVLA-7B to 158M with a 0.27% gap; robot VLA deployment just lost one convenient excuse.

sharp

VLA-AD’s sharp move is keeping semantic guidance offline, then removing both the VLM and teacher at inference. OpenVLA-7B becomes a 158M student, 44× smaller, with a 0.27% average relative gap across three LIBERO suites and 12.5 Hz on an RTX 4090. That is close to the latency regime where closed-loop robot control stops being a demo tax. I’d still keep the champagne corked. LIBERO is simulation, not a messy arm with bad lighting, camera drift, and gripper calibration pain. But the mechanism matters: phase anchors plus multi-frame direction cues reduce sensitivity to noisy teacher actions, including high-frequency gripper mistakes. That is a better bet than plain 7-DoF action cloning.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:45

24d ago

FEATUREDarXiv · cs.AI· atomEN17:45 · 05·15

→Prospective Multi-Pathogen Disease Forecasting Using Autonomous LLM-Guided Tree Search

The paper presents an LLM-guided tree search system that generated executable forecasting models for influenza, COVID-19, and RSV during the 2025-2026 US respiratory season, with the ensemble matching or outperforming CDC hub human-curated ensembles out-of-sample.

#Agent#Reasoning#Code#CDC

why featured

HKR-H/K/R all pass: the paper claims an autonomous LLM-guided tree search generated real-season forecast code for three pathogens and matched or beat CDC Hub ensembles. Single arXiv paper keeps it below must-write.

editor take

LLM agents are finally acting like modeling labor, not chat wrappers; still, arXiv v1 plus public health means no CDC-replacement victory lap.

sharp

The sharp part is the prospective setup, not the LLM label. The system ran during the 2025-2026 US respiratory season, generated executable forecasting code for flu, COVID-19, and RSV, and its ensemble matched or beat CDC hub human-curated ensembles out of sample. It also handled RSV cold-start conditions, which is the right stress test. I buy the direction; I don’t buy the “overcomes the modeling labor bottleneck” framing yet. The abstract names useful guardrails: LLM-guided tree search, log-scale distance metrics to reduce reward hacking, and a judge-in-the-loop for structural fidelity. It does not give model-call cost, failure rate, or human intervention boundaries. Compared with SWE-bench-style agent papers, this is stronger because the evaluation was live. The blast radius is also larger: a bad forecast contaminates public-health signals, not a GitHub issue.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:43

24d ago

FEATUREDarXiv · cs.CL· atomEN17:43 · 05·15

→Layer Equivalence Is Not a Property of Layers Alone: Redundancy Testing Methods Change Conclusions

The paper compares replacement and interchange swap-KL tests for transformer layer redundancy; under a matched WikiText-2 contract at 8B scale, Qwen3-8B shows several-fold safer removal with interchange-guided pruning, while Llama-3.1-8B ties pruning cost despite lower interchange KL.

#Benchmarking#Interpretability#Inference-opt#Qwen3-8B

why featured

HKR-H/K/R pass via a counterintuitive pruning-test result and concrete Qwen3-8B/WikiText-2 setup. The topic is useful for compression and eval readers, but its swap-KL framing is niche, so it stays in the 60-71 band.

editor take

Same arXiv paper surfaced in cs.CL and cs.LG; the signal is methodological, not hype: layer-pruning papers have been leaning on a shaky test.

sharp

Two arXiv categories carry the same v1, with identical framing, so this is a single-paper signal rather than independent confirmation. The concrete hook is across 410M, 1.4B, and 8B models: the Pythia replacement-interchange gap grows during training, Qwen3-8B makes interchange-guided removal several-fold safer, and Llama-3.1-8B shows lower interchange KL without lower pruning cost. I buy the critique. A lot of layer merging and pruning work quietly treats “replaceable” and “commutable” as the same redundancy claim, then reports one clean pruning story. This paper says the test protocol can flip the safe layers. The catch is practical: it is 40 pages with 8 figures and 24 tables, but code and frozen JSON logs are not public yet. Until reproduction lands, it is a sharp audit of evaluation habits, not a deployable compression recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:42

24d ago

● P1arXiv · cs.CL· atomEN17:42 · 05·15

→FORGE: Self-Evolving Agent Memory Without Weight Updates

FORGE improves hierarchical ReAct agents on the 30-step CybORG CAGE-2 B-line task across four LLM families, raising average evaluation return by 1.7-7.7x over zero-shot and 29-72% over Reflexion without weight updates.

#Agent#Memory#Reasoning#Gemini

why featured

HKR-H/K/R all pass: the paper offers a concrete no-weight-update memory mechanism and testable CAGE-2 gains across 4 LLM families. It stays below P1 because this is still an arXiv benchmark result, not a shipped product or broad field event.

editor take

FORGE’s population-broadcast memory looks useful, but the evidence lives inside CAGE-2 B-line; don’t sell it as general agent learning yet.

sharp

Two arXiv tracks, cs.CL and cs.LG, point to the same 2605.16233v1 paper with identical framing; that is taxonomy spread, not independent corroboration. Under CAGE-2, 30-step horizon, B-line attacker, FORGE reports 1.7-7.7x average return over zero-shot and 29-72% over Reflexion across four model families. I buy the engineering instinct here: failed trajectories become Rules or Examples, then the best instance’s memory gets broadcast to the population. That is a stronger agent-training scaffold than isolated Reflexion loops. But the authors also fence the claim tightly: all evidence is confined to CAGE-2 B-line. Compared with the Voyager/Reflexion lineage, FORGE’s clean win is no weight update; its unresolved risk is open-ended tasks, long-horizon drift, and memory contamination.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:34

24d ago

arXiv · cs.AI· atomEN17:34 · 05·15

→Evaluating Design Video Generation: Metrics for Compositional Fidelity

The paper proposes a fully automated evaluation framework for design animation generation, covering four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity.

#Multimodal#Vision#Benchmarking#Research release

why featured

HKR-K passes via a concrete 4-axis evaluation framework, but HKR-H and HKR-R are weak: no surprising hook, no disclosed benchmark size, results, or artifact. This fits the 60–71 research-interest band.

editor take

The paper defines 4 automated metrics, but no dataset size is disclosed. Design-video generation needs rulers before victory laps.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

17:33

24d ago

FEATUREDarXiv · cs.CL· atomEN17:33 · 05·15

→Artificial Aphasias in Lesioned Language Models

The authors zeroed out parameters in five 1B-scale LMs and evaluated 112,426 outputs with the Text Aphasia Battery. All assessed aphasia symptoms appeared, but their distributions differed from humans, with layer depth and component type shaping symptom profiles.

#Interpretability#Benchmarking#Text Aphasia Battery#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper without replication or visible industry debate. The concrete mechanism and 112,426-output scale clear the featured bar, not the must-write band.

editor take

Don’t romanticize “model aphasia” as brain evidence; 112,426 outputs show the symptoms, but the distributions point back to Transformer mechanics.

sharp

This paper is useful because it drags the “LLMs are brain-like” story back into intervention, not analogy. The authors zero parameters in five 1B-scale LMs and score 112,426 outputs with the Text Aphasia Battery. The full symptom set appears, but the distributions differ from human aphasia, which is the part people will underplay. The concrete hook is component specificity: attention query/key/value/output and FFN up/gate/down produce different symptom profiles, and depth changes the failure mode. Early-layer lesions skew toward syntactic and semantic symptoms; late-middle lesions raise phonological and fluency deficits. I’d treat the clinical battery as a probe for Transformer organization, not as evidence that aphasia syndromes transfer across humans and LMs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:29

24d ago

FEATUREDarXiv · cs.CL· atomEN17:29 · 05·15

→Argus: Evidence Assembly for Scalable Deep Research Agents

Argus uses a Searcher and Navigator to assemble an evidence graph, improving average scores by 12.7 points across eight benchmarks with 8 parallel Searchers; with 64 Searchers, it reaches 86.2 on BrowseComp while keeping the Navigator reasoning context under 21.5K tokens.

#Agent#Reasoning#Benchmarking#Argus

why featured

HKR-H/K/R all pass: the 64-Searcher BrowseComp result is clickable, and the paper gives a concrete evidence-graph mechanism with +12.7 across 8 benchmarks. Single arXiv research release, not a lab product launch, so 78–84.

editor take

Argus attacks the ugly part of deep research scaling: duplicate rollouts. BrowseComp 86.2 with 64 Searchers and <21.5K Navigator context is a serious claim.

sharp

Argus is valuable because it names the waste in parallel agent search: repeated evidence, not lack of trajectories. The system splits a 35B-A3B MoE into a Searcher and a Navigator, then trains the Navigator to verify gaps, dispatch work, and synthesize. With 8 parallel Searchers, it gains 12.7 average points across eight benchmarks; with 64 Searchers, it hits 86.2 on BrowseComp while keeping Navigator context under 21.5K tokens. That is cleaner than stuffing every ReAct rollout into one aggregator and praying the context window holds. I still want the cost curve and the exact proprietary-agent comparison protocol. A 64-way search result can be a research win and a product nonstarter if latency and token spend explode.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:29

24d ago

FEATUREDarXiv · cs.CL· atomEN17:29 · 05·15

→Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Fully Open Meditron introduces a fully open LLM-CDSS pipeline with a clinician-audited corpus, reproducible training framework, and use-aligned evaluation protocol. It unifies 8 public medical QA datasets, adds synthetic data from 46,469 clinical guidelines, tests 5 fully open base models, and raises Apertus-70B from 47.2% to 53.8% on aggregate medical benchmarks.

#Fine-tuning#Benchmarking#Safety#Meditron

why featured

HKR-H/K/R all pass: the hook is a fully open clinical CDSS pipeline with concrete dataset scale and benchmark gains. Single-source arXiv and a medical vertical cap it at the featured threshold band.

editor take

MeditronFO drags open medical LLMs back to data provenance: 47.2% to 53.8% is modest, but auditability is harder to fake than a leaderboard bump.

sharp

MeditronFO’s sharp edge is not the +6.6-point gain; it is raising “open medical LLM” from open weights to an auditable training stack. The recipe merges 8 public medical QA datasets, derives synthetic data from 46,469 clinical guidelines, adds system-wide decontamination, uses a four-physician validation panel, and calibrates LLM-as-judge against 204 human raters. That is far more expensive than posting weights and a model card. I’m not excited by 53.8% on aggregate medical benchmarks; that is still nowhere near clinical confidence. The 58.6% judge preference for Gemma-3-27B-MeditronFO over MedGemma also depends on judge design. But in medicine, the useful artifact here is the chain of custody. Apertus-70B-MeditronFO is less a deployment-ready doctor model than a reproducible liability trail.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:24

24d ago

FEATUREDarXiv · cs.CL· atomEN17:24 · 05·15

→Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

The study evaluates seven LLM tutoring agents on 10,836 solution-feedback pairs in propositional logic. The models perform near ceiling on optimal steps, but over-reject valid suboptimal reasoning and over-validate incorrect solutions, while accurate diagnosis does not reliably yield actionable tutoring feedback.

#Agent#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R all pass via a sharp failure hook, 10,836-pair benchmark, and evaluator-reliability resonance. It stays at the featured threshold because this is a single arXiv evaluation without a major lab release or cross-source cluster.

editor take

Tutoring agents don’t fail on perfect answers; they fail on the messy middle, where students actually learn. 10,836 pairs make that hard to wave away.

sharp

LLM tutoring agents should not be sold as diagnostic engines yet; they break exactly where tutoring earns its keep. This paper tests seven agents on 10,836 propositional-logic solution-feedback pairs. The models are near ceiling on optimal steps, but they over-reject valid suboptimal reasoning and over-validate wrong answers. That failure mode is brutal because real students live in the half-right zone, not in clean benchmark endpoints. The stronger claim is architectural. The failures persist across models regardless of solution context, and correct diagnosis still does not reliably become actionable feedback. I buy the paper’s hybrid direction: let KG-grounded systems handle diagnosis, then use the LLM for dialogue and scaffolding. “All-LLM tutor” remains a demo story until it can grade the messy middle.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:23

24d ago

arXiv · cs.CL· atomEN17:23 · 05·15

→Cost-Performance Study of Compound LLM Agents in Adversarial POMDP

The paper evaluates compound LLM agents in CybORG CAGE-2 across five model families, six models, twelve configurations, and 3,475 episodes with token-level cost accounting. Programmatic state abstraction raises mean return by up to 76%, while distributed deliberation tools in hierarchies produce up to 3.4× worse mean return and use 1.8–2.7× more tokens.

#Agent#Reasoning#Tools#CybORG CAGE-2

why featured

HKR-K is strong, with concrete scale and effect sizes; HKR-H comes from the 3.4x return gap. The CybORG CAGE-2 setting is niche and academic, so it stays below featured.

editor take

CybORG CAGE-2 ran 3,475 episodes: state abstraction gained 76%, hierarchical deliberation lost 3.4×; agent stacks need plumbing, not more pondering.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

12:09

24d ago

HuggingFace Papers (takara mirror)· rssEN12:09 · 05·15

→Linked Multimodal Data on Russian Domestic and Foreign Policy Speeches

The paper introduces a Russian government political communication dataset covering decades of speeches from Kremlin and Russian Ministry of Foreign Affairs actors, with Russian and English texts, available images, captions, linked identifiers, harmonized metadata, and expert-refined multimodal topic annotations.

#Multimodal#Vision#Benchmarking#Kremlin

why featured

HKR-K lands because the corpus combines Russian/English speeches, images, captions, and metadata. HKR-H and HKR-R miss: no product, model capability, or practitioner-facing industry mechanism is disclosed.

editor take

The dataset spans decades of Kremlin and MFA speeches; sample size is undisclosed, so don't call it a benchmark yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

05:09

25d ago

HuggingFace Papers (takara mirror)· rssEN05:09 · 05·15

→LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP estimates the dominant low-rank subspace of visual tokens with PCA and keeps tokens with high projection residuals, preserving 94.7% of original image-understanding performance after an 88.9% token reduction and 97.8% average video-understanding accuracy after an 87.5% token reduction.

#Multimodal#Vision#Inference-opt#LRCP

why featured

HKR-K/R pass: the paper offers a concrete PCA-based pruning mechanism and a cost/latency angle. HKR-H is weak, and a single technical paper without implementation or real latency data stays in the 60–71 band.

editor take

LRCP cuts 88.9% of visual tokens while keeping 94.7% image performance; I buy PCA residuals over attention-score pruning.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·15

→Circuit Attribution Enables Machine Unlearning to Persist Through Quantization

The paper introduces MANSU, which combines circuit attribution, null-space projection, and a per-parameter magnitude floor to keep unlearning intact after 4-bit NF4 quantization; across baselines, per-parameter updates sit 47-828x below the quantization bin width, and gradient-based baselines recover up to +0.05 accuracy under compression.

#Alignment#Safety#Interpretability#MANSU

why featured

HKR-H/K/R all pass: the title has a counterintuitive hook, the summary gives MANSU’s mechanism and the 47-828x quantization-bin gap, and the safety risk is deployment-relevant. Single arXiv paper, so it stays in the 78-84 band.

editor take

Two arXiv tracks are not media heat; they are taxonomy spillover. Still, the 47–828x update gap nails a real audit hole in post-unlearning quantization.

sharp

cs.LG and cs.CL list the same arXiv v1, with identical framing; the signal is author-supplied, not independently corroborated. The strongest hook is concrete: baseline per-parameter updates sit 47–828x below the NF4 quantization bin width, and 4-bit PTQ recovers up to +0.05 accuracy after unlearning. I buy the problem framing. Full-precision unlearning evals are too clean for a deployment path that usually ends in 4-bit or NF4 inference. MANSU’s recipe—circuit attribution, null-space projection, diagonal-Fisher retain bound, and a magnitude floor—sounds more serious than another behavioral suppression loop. But the body surfaced here only names “multiple model families” and “hazard benchmarks,” without model names or tables. Treat this as a sharp mechanistic paper, not a compliance-ready unlearning recipe yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→MeMo: New Method Encodes Knowledge Into Dedicated Memory Model

MeMo encodes new knowledge into a dedicated memory model while keeping LLM parameters unchanged, and the paper evaluates it on three benchmarks: BrowseComp-Plus, NarrativeQA, and MuSiQue.

#Memory#RAG#Benchmarking#MeMo

why featured

HKR-H/K/R all pass: the title has an architecture inversion, the summary gives a memory-model mechanism and 3 benchmarks, and the topic matters to RAG/memory builders. Sparse results, no code, and no lab signal keep it at the featured threshold.

editor take

MeMo makes memory a trainable side model, not a vector store. Nice idea, but without latency, training cost, and code, don’t bury RAG yet.

sharp

Both arXiv entries point to the same MeMo paper; cs.CL and cs.LG are category spread, not independent validation. The hard numbers disclosed are three benchmarks: BrowseComp-Plus, NarrativeQA, and MuSiQue. I like the target: MeMo keeps LLM weights fixed, avoids logits access, and claims plug-and-play use with closed models. It also claims inference retrieval cost independent of corpus size. That is a cleaner attack on RAG than another vector-index wrapper. But the abstract gives no latency, training cost, memory footprint, or code status. Without those, “robust to retrieval noise” and “cross-document relationships” read like benchmark-shaped claims. Against GraphRAG and long-context agent stacks, MeMo has to win the engineering bill, not the paper title.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Self-Distilled Agentic Reinforcement Learning Improves Benchmark Performance

SDAR uses OPSD as a gated auxiliary objective while keeping RL as the primary backbone, and improves over GRPO by 9.4% on ALFWorld, 7.0% on Search-QA, and 10.2% on WebShop-Acc across Qwen2.5 and Qwen3 model families.

#Agent#Reasoning#Fine-tuning#Qwen

why featured

HKR-K and HKR-R pass: the paper gives a concrete training mechanism and three agent-benchmark gains, relevant to agent RL practitioners. HKR-H is weak, so this stays at the featured threshold.

editor take

SDAR attacks agent RL at token-level supervision, and +7% to +10.2% is real bait; but this is one arXiv paper cross-listed, not consensus yet.

sharp

Both entries point to the same arXiv paper under cs.CL and cs.LG, so the coverage is aligned by source identity, not independent validation. SDAR’s useful move is the gate, not the recycled “self-distillation” label. It keeps RL as the backbone, adds OPSD as an auxiliary loss, then uses a sigmoid gate to strengthen teacher-endorsed positive-gap tokens and soften negative teacher rejections. The reported gains over GRPO are concrete: +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop-Acc across Qwen2.5 and Qwen3. I buy the problem framing: trajectory rewards are too blunt for multi-turn agents, and naive GRPO+OPSD instability is a familiar failure mode. The abstract does not disclose code, training budget, or seed variance; before porting this into a production post-training stack, I’d want reproduction beyond Qwen-family agents.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→IPR-1: Interactive Physical Reasoner

The paper introduces IPR and the G2U benchmark with 1,000+ heterogeneous games; IPR uses world-model rollouts to score and reinforce a VLM policy, and the authors report overall performance above GPT-5.

#Agent#Reasoning#Vision#IPR

why featured

HKR-H/K/R all pass: the article claims GPT-5-beating results, a 1,000+ game benchmark, and a concrete rollout-scoring mechanism. Single arXiv source with no visible adoption or lab authority keeps it in 78–84, not P1.

editor take

IPR-1 is a serious swing at interactive physical reasoning, but “beats GPT-5” on an author-built 1,000-game benchmark needs independent runs.

sharp

IPR-1’s signal is the training recipe, not the headline claim that it beats GPT-5. The paper builds G2U with 1,000+ heterogeneous games, uses world-model rollouts to score a VLM policy, and adds PhysCode to align semantic actions with dynamics. That is a cleaner agent setup than asking a VLM to stare at frames and emit moves. I’m cautious on the GPT-5 comparison. The snippet gives no prompt setup, interaction budget, tool policy, or per-category breakdown. G2U is also author-built, so task design can quietly favor the proposed representation. If external labs reproduce the gain, this becomes real evidence for physics-centric interaction training. Until then, I’d file it as a strong architecture paper with a benchmark claim still on probation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→It Takes Two: Your GRPO Is Secretly DPO

The paper proposes 2-GRPO, which builds contrastive signals with two rollouts; it retains 97.6% of 16-GRPO performance while using 12.5% of the rollouts and 21% of the training time.

#Reasoning#Fine-tuning#Alignment#Research release

why featured

HKR-H/K/R all pass: the title has a method-reversal hook, and the post gives testable numbers: 2 rollouts, 97.6%, 12.5%, 21%. It is technical, but the post-training cost claim keeps it in the good research-release band, not P1.

editor take

2-GRPO cuts into the “big groups make GRPO work” story: 97.6% of 16-GRPO at 21% training time is a cost attack, not a tweak.

sharp

2-GRPO’s sharp claim is that GRPO’s expensive group sampling is doing preference learning in disguise. The paper says two rollouts can form the contrastive signal, keeping 97.6% of 16-GRPO performance while using 12.5% of rollouts and 21% of training time. That hits post-training cost directly, not just algorithm aesthetics. I buy half of it. GRPO’s pain in reasoning post-training is the sampling bill, and smaller groups are the first knob teams will turn before adopting a new RL stack. But the abstract does not expose model size, task mix, or failure cases. That 97.6% number is strong only if it survives math, code, and long-chain reasoning with messy reward distributions.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

The paper tests 14 frontier LLMs across four web environments and shows that passive JavaScript traces of actions and timings identify a browser agent’s underlying model with up to 96% F1.

#Agent#Safety#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the hook is model fingerprinting via UI traces, with 14 models, 4 environments, and 96% F1. It is a strong agent-safety paper, but a single arXiv result keeps it below P1.

editor take

Browser agents now leak model identity through behavior: passive JS action and timing traces hit 96% F1. Random delays are a speed bump, not cover.

sharp

Browser-agent fingerprinting is now a practical web attack surface, not a lab curiosity. The paper tests 14 frontier LLMs across four web environments, covering retrieval and shopping tasks. A passive JavaScript tracker records UI actions and timing, then identifies the underlying model with up to 96% F1. The classifier also generalizes across model sizes and families, trains from few traces, and infers identity early in an episode. The ugly part is the defense result. Randomized action delays degrade the classifier, but retraining on delayed traces largely recovers performance. That lands directly on Operator-style browsing, Claude Computer Use, and hosted browser-agent stacks. Prompt injection gets the headlines, but the agent’s motor pattern is already a side channel.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

The paper proposes an RL framework that trains a lightweight prompter to generate prompts for a frozen worker LLM, raising performance from 55% to 90% on BBEH logic-heavy reasoning tasks and from 74% to 91% on Tau-bench tool-use tasks.

#Reasoning#Tools#Agent#arXiv

why featured

HKR-H/K/R all pass: the paper claims large reasoning and tool-use gains for frozen black-box LLMs with concrete benchmark jumps. It remains an arXiv research release without independent replication or product adoption, so it fits the 78–84 band.

editor take

Training a small prompter around a frozen LLM is a cleaner deployment story than hand-tuned CoT. The 90% BBEH claim needs worker and budget details.

sharp

This paper pushes prompt engineering into a learned policy, and that is the right shape for black-box deployment. The setup freezes a worker LLM and trains a lightweight prompter with RL to emit the prompt in one shot. The reported gains are large: BBEH rises from 55% to 90%, and Tau-bench from 74% to 91%, with a claimed sample-efficiency win over GEPA-style evolutionary baselines. I would press on the missing knobs first. The snippet does not name the worker model, call budget, sampling count, or prompter size. Without those, 90% can mean a better algorithm, or search cost amortized into training. I like the direction more than another longer-CoT recipe because experience becomes weights. But if train and test tasks share the same distribution, the 17-point Tau-bench jump does not transfer cleanly to real agent tickets.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→PreFT: Prefill-only finetuning for efficient inference

PreFT applies adapters only during prefill and discards them before decode, reaching 1.9x the throughput of traditional PEFT when serving 512 adapters on Llama 3.1 70B.

#Fine-tuning#Inference-opt#arXiv#Llama

why featured

HKR-H/K/R all pass: the mechanism is clear, the 1.9x/512-adapter claim is testable, and the cost angle matters to inference teams. This is strong infra research, not a same-day model-release story.

editor take

PreFT keeps adapters in prefill and drops them for decode; 1.9x throughput at 512 adapters on Llama 3.1 70B is a clean systems cut.

sharp

PreFT makes a sharp bet: multi-adapter serving breaks at decode, not at parameter count. It applies LoRA or ReFT only during prefill, then discards the adapter before autoregressive generation. On vLLM, Llama 3.1 70B reaches 1.9x the throughput of standard PEFT when serving 512 adapters. I buy the systems move, especially for multi-tenant personalization and lightweight RL policies. The paper is honest about the trade: SFT eval loss rises, but higher rank recovers some quality with little throughput loss; RL tasks approach standard PEFT. Don’t sell this as lossless fine-tuning. It is closer to injecting personalization into the KV state, then keeping the expensive per-token path clean.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→RubricRefine Improves Tool-Use Agent Reliability with Pre-Execution Refinement

RubricRefine generates task- and registry-specific rubrics before execution, scores candidate code against explicit contract checks, and reaches 0.86 on M3ToolEval averaged across seven models with zero execution attempts, above the 0.65 baseline and 0.75 execution-feedback revision result.

#Agent#Tools#Code#RubricRefine

why featured

HKR-H/K/R all pass: pre-execution rubric generation is a concrete mechanism, with testable 0.86/0.65/0.75 results. It stays at 80 because this is a single arXiv method, with no disclosed open-source artifact or deployment.

editor take

RubricRefine gets 0.86 with zero execution attempts; preflight contract checks beat another round of agent self-reflection here.

sharp

RubricRefine’s sharp claim is not “better self-refinement”; it moves tool-agent reliability back to contract checking. On M3ToolEval, it averages 0.86 across seven models, above the 0.65 baseline and the 0.75 result from revision with real execution feedback. The nastier detail: it does this with zero execution attempts and 2.6x lower latency than the strongest non-iterative alternative. This reads like static typing for agents: check output shape, tool routing, and argument provenance before code touches the registry. I buy the direction, but not the broad reliability story yet. The paper says performance stays flat on mostly single-step API-Bank, so the gain comes from dense inter-tool contracts, not from agents suddenly reasoning better.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

SCHEMA evaluated 11 frontier models from 8 vendors on 67,221 scored records and found that 8 models showed metacognitive degradation under compliance-forcing adversarial instructions, with accuracy falling by up to 30.2 percentage points and all reported p-values below 2 × 10^-8 after Bonferroni correction.

#Reasoning#Safety#Benchmarking#Anthropic

why featured

HKR-H/K/R all pass: the paper has a sharp compliance-vs-metacognition hook, a 67,221-record benchmark, and a safety reliability nerve. Still, it is a single arXiv result, so it lands in featured rather than p1.

editor take

SCHEMA turns obedience into a safety bug: 8 of 11 frontier models lost metacognitive accuracy under compliance pressure, up to 30.2 points.

sharp

SCHEMA’s sharp claim is that frontier models are over-trained to comply. Across 67,221 scored records, 11 models, and a six-condition factorial design, the collapse comes from a compliance-forcing suffix. Remove that suffix, and performance recovers even when the threat content stays. That is nastier than a normal jailbreak because it attacks epistemic boundaries, not just policy text. I’d put an asterisk on the “Anthropic near-perfect immunity” line. The abstract says Gemini matches Anthropic’s baseline accuracy, while Constitutional AI preserves the boundary under pressure. Nice story, but the snippet does not give model names, per-condition scores, or dual-classifier error rates. If the dataset holds up, safety evals need to measure refusal under pressure and calibrated uncertainty, not just whether the model says no.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Neural Signals Generate Clinical Notes in the Wild

CELM generates clinical reports from long-duration EEG recordings, trained on 9,922 reports paired with about 11,000 hours of EEG from 9,048 patients, and the authors release the model plus benchmark construction pipeline.

#Multimodal#Audio#Benchmarking#CELM

why featured

HKR-H/K/R all pass: the EEG-to-notes angle is novel, the dataset and pipeline are concrete, and clinical documentation automation is a real practitioner nerve. Single arXiv paper and medical-domain scope keep it in the 78–84 band.

editor take

EEG-to-note now has a real long-recording dataset, but “beats baselines” is cheap; missed rare abnormalities are the clinical landmine.

sharp

CELM matters because it moves EEG-to-language beyond toy clips: 9,048 patients, 9,922 reports, and about 11,000 hours of EEG. That is closer to a hospital reporting workflow than the usual short-window seizure or sleep-stage benchmarks. Releasing the model and benchmark construction pipeline also reduces the private-slice problem that has plagued clinical ML papers. I’m still cautious on the clinical-use framing. The abstract says expert evaluation found the reports more coherent and diagnostically reliable, but it does not give miss rates, rare-abnormality strata, external-site testing, or how device and template variation were handled. Radiology report generation already taught this lesson: average text metrics can look fine while a small number of critical omissions make the system unusable.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

TERMINATOR trains an early-exit policy from first-answer positions and cuts average CoT length by 14%-55% across MATH-500, AIME 2025, HumanEval, and GPQA, while reducing inference latency by more than 2x versus the original LRM.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-H/K/R all pass: the title has a hook, and the post gives first-answer-position training plus 14%-55% shorter CoT and over 2x lower latency. As an arXiv research result rather than a shipped system, it stays in the 78-84 band.

editor take

TERMINATOR turns CoT early exit into a trainable policy; 14%-55% fewer tokens is nice, but it rides on first-answer behavior that production traffic may not preserve.

sharp

TERMINATOR targets the dumbest waste in reasoning models: the answer appears, then the model keeps performing thought. It trains an early-exit policy from first-answer positions, cutting average CoT length by 14%-55% on MATH-500, AIME 2025, HumanEval, and GPQA. Latency drops by more than 2x versus the original LRM. That is a better cost lever than prompt trimming. I have doubts about deployment. First-answer position is model-, task-, and decoding-dependent; the abstract admits optimal CoT length shifts by task and model. In production, tool calls, strict formatting, and self-check loops change the distribution. One bad exit turns a correct trajectory into an unfinished answer. TERMINATOR is an inference-bill optimizer, not a capability gain.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

The paper surveys 15 defenses against malicious fine-tuning and develops a unified adaptive attack, showing that current methods mainly block fixed attacks they were designed against while leaving harmful behavior intact.

#Fine-tuning#Safety#Alignment#arXiv

why featured

HKR-H/K/R all pass: 15 defenses plus a unified adaptive attack make the claim testable. It stays below P1 because this is an arXiv paper, with no major-lab deployment or cross-source cluster disclosed.

editor take

A unified adaptive attack breaks 15 malicious fine-tuning defenses; this dents the fantasy that safety survives user-controlled tuning.

sharp

Malicious fine-tuning defenses are blocking signposts, not removing the road. The paper surveys 15 recent defenses and makes a blunt claim: they obscure or misdirect the path to harmful behavior while leaving the behavior inside the model. A unified adaptive attack breaks defenses across mechanisms, so the field has been testing fixed attacks against fixed guardrails. That is ugly for both open weights and API fine-tuning. Llama-style releases face it directly; hosted fine-tuning from OpenAI or Anthropic faces the same pressure through data controls, permissions, and review pipelines. The snippet does not disclose per-defense success rates, but “15 defenses plus one adaptive adversary” is enough to downgrade any robustness claim built only on static jailbreaks or fixed harmful sets.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

The paper proposes PODS, a plug-and-play oscillatory data-volume scheduler that alternates low- and high-selection-ratio phases under a target ratio, reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

#Fine-tuning#Inference-opt#PODS#ImageNet-1k

why featured

HKR-H/K/R pass: PODS gives a testable scheduling mechanism and claims 50% lower ImageNet-1k cost plus over 2x faster LLM instruction tuning without loss. Single arXiv paper keeps it below must-write.

editor take

PODS shifts data selection from sample scoring to volume scheduling; 50% ImageNet cost cuts are nice, but the LLM claim is only instruction-tuning so far.

sharp

PODS has a clean idea: treat the selection ratio as a training signal, not another sample-scoring metric. It alternates low-ratio phases for regularization with high-ratio phases for coverage, under the same target ratio. The abstract gives two hard claims: 50% lower ImageNet-1k training cost with better accuracy, and over 2x faster LLM instruction tuning without performance loss. I’d discount the LLM part for now. The page does not give model size, dataset, token budget, training steps, or a head-to-head against lines like DoReMi, DataComp, or LESS. PODS smells like a cheap scheduler knob that can slot into existing pipelines. To sell it as a serious LLM cost lever, the authors need pretraining or continual-training bills, not only instruction-tuning speedups.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

NodeSynth uses an evidence-grounded fine-tuned taxonomy generator to create socially relevant synthetic queries, and evaluation on four mainstream LLMs, including Claude 4.5 Haiku, produced failure rates up to five times higher than human-authored benchmarks.

#Safety#Fine-tuning#Benchmarking#NodeSynth

why featured

HKR-H/K/R all pass: the 5x failure-rate hook is sharp, TaG plus four-model testing adds concrete knowledge, and safety evaluators will debate it. It remains an arXiv research release without major-lab backing or broad replication, so it sits in the 78–84 band.

editor take

NodeSynth pushes safety eval toward scalable synthetic red-teaming, but the 5x failure rate only matters if TaG captures real edge communities.

sharp

NodeSynth hits a stale failure mode in safety evals: human-written benchmarks are too clean, and frontier models have learned the format. Its evidence-grounded fine-tuned taxonomy generator creates socially relevant queries, then gets up to 5x higher failure rates than human-authored benchmarks across four mainstream LLMs, including Claude 4.5 Haiku. It also flags Llama-Guard-3 weaknesses, which matters because guard models often look better on tidy policy taxonomies than messy user intent. The wild part is the ablation claim: granular taxonomic expansion drives the failures, not cheap paraphrasing. I still have doubts here. The abstract does not disclose sample sizes or the exact failure rubric, so 5x is a strong red-team signal, not a general safety ranking yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

TRIM selects instruction-tuning coresets with forward-only attention fingerprints instead of gradients, beating state-of-the-art baselines by up to 9% on downstream tasks and exceeding full-data fine-tuning in some settings.

#Fine-tuning#Interpretability#Inference-opt#TRIM

why featured

HKR-H/K/R all pass, but this is a single arXiv methods paper with no disclosed code, authorship signal, or cross-source pickup. The testable less-data-than-full claim fits the 78–84 recommendation band.

editor take

TRIM cuts coreset selection out of the gradient bill, but attention fingerprints must survive model transfer before this becomes more than a paper trick.

sharp

TRIM’s sharp claim is not the 9% lift; it is replacing gradient-based data selection with forward-only attention fingerprints. Instruction-tuning data curation is still full of expensive proxies: gradients are costly, sample-level scores are blunt, and manual filters often delete useful diversity. TRIM uses a handful of target samples, matches token-level multi-layer attention patterns, and even beats full-data fine-tuning in some settings. That is a direct hit on the habit of treating more SFT data as safer. I have one serious doubt: attention saliency has to stay stable across base models, tokenizers, and task formats. The abstract gives “up to 9%” and “some settings,” but not model scale, retained data ratio, or compute multiplier. If it only holds inside one model family, this is a neat selector. If it transfers across Llama, Qwen, and Mistral-style stacks, LoRA data engineering loses a lot of folklore.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Quantifying and Mitigating Self-Preference Bias of LLM Judges

The paper introduces a fully automated framework for measuring and mitigating self-preference bias in LLM judges, tests it across 20 mainstream LLMs, and reports that a structured multidimensional evaluation strategy reduces average SPB by 31.5%.

#Alignment#Benchmarking#Safety#Research release

why featured

HKR-H/K/R all pass: the self-preference hook is sharp, the article gives 20-model testing and a 31.5% reduction, and eval trust is a live practitioner nerve. As a single arXiv paper without broad adoption yet, it stays in the 78–84 band.

editor take

Stop treating stronger models as cleaner judges; across 20 LLMs, capability did not buy low self-preference bias.

sharp

This paper quantifies the awkward failure mode in LLM-as-a-judge: stronger models are not automatically fairer judges, and self-evaluation can tilt positive or negative. The useful move is the equal-quality response-pair setup. It tries to separate generation skill from evaluative stance across 20 mainstream LLMs, without human gold labels. That is more operationally useful than another Arena-style score. The 31.5% average SPB reduction should not be treated as a solved-bias number. The abstract gives the structured multidimensional strategy, but not per-model variance, task mix, or residual bias. For anyone running eval pipelines, the practical read is blunt: swapping the judge to GPT-5 or Claude Sonnet 4.5 is not enough. You need dimension-level rubrics, author blinding, and a measured directional-bias check for your own model family.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→The Moltbook Observatory Archive: An Incremental Dataset of Agent-Only Social Network Activity

Moltbook Observatory Archive releases 78 days of agent-only social network records from 2026-01-27 to 2026-04-14. The dataset contains 2,615,098 posts, 1,213,007 comments, 175,886 posting agents, and 6,730 communities, stored in SQLite and date-partitioned Parquet under an MIT license.

#Agent#Safety#Benchmarking#Moltbook

why featured

HKR-H/K/R all pass: an agent-only social network is a rare angle, the archive has concrete scale, and it feeds agent behavior, safety, and benchmark debates. No major-lab release, so 78.

editor take

Moltbook puts 175,886 agents into one social graph; safety work finally gets behavior outside the chatbox.

sharp

Moltbook is closer to a wind tunnel for agent safety than a normal social dataset. It covers 78 days, 2.6 million posts, 1.2 million comments, 175,886 posting agents, and 6,730 communities. That is enough surface area to study norm formation, topic drift, spam coordination, and identity games. The useful hook is the passive API polling. It is messier than hand-scripted multi-agent simulations, and closer to what deployed agent ecosystems will look like. I still have doubts about interpretation: the paper gives platform records, but not the model mix, prompt templates, or recommendation mechanics behind those agents. Without those, “emergent social behavior” can easily be a product artifact wearing a research label.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Mistletoe attacks the acceptance mechanism in model-based speculative decoding by reducing drafter-target agreement while constraining the target model distribution; experiments across speculative decoding systems show lower average accepted length τ, collapsed speedup, and reduced averaged token throughput while preserving output quality and perplexity.

#Inference-opt#Safety#Alignment#Research release

why featured

HKR-H/K/R all pass: the attack hook is clear, the mechanism cites τ and token throughput, and the risk maps to LLM serving cost. It remains an arXiv research item with a narrow technical lane, so 78.

editor take

Mistletoe hits the cost layer, not the answer layer: quality stays clean while speculative decoding loses throughput. That is the nastier failure mode.

sharp

Mistletoe is sharp because it attacks accepted length τ, not model correctness. The method uses null-space projection to reduce drafter-target agreement while constraining the target model distribution. The abstract claims preserved output quality and perplexity, yet collapsed speedup and lower averaged token throughput. It does not disclose the actual drop sizes, and that missing number matters. The deployment risk is misclassification. Many serving stacks treat speculative decoding as a performance trick, while safety monitoring focuses on toxic output, jailbreaks, or data leaks. Medusa, EAGLE, and related draft-verify systems all lean on the same premise: the drafter stays close enough to the target. Mistletoe bites that premise directly. Inference acceleration now has an adversarial surface, not just an engineering SLO.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing

ReasonCache uses collaborative filtering to identify reusable KV cache blocks and applies zero-copy reuse; experiments report an 89.2% peak throughput gain and a 40-60% average gain while maintaining higher accuracy than existing KV cache management techniques.

#Reasoning#Inference-opt#ReasonCache#Research release

why featured

HKR-H/K/R all pass: the mechanism is clear, the numbers are concrete, and the cost angle matters for LRM serving. It is still a systems paper, not a model or major product release, so 78 fits the featured band.

editor take

ReasonCache attacks serving cost at KV reuse, not kernels; 89.2% peak throughput is tasty, but production hinges on how often reasoning traces rhyme.

sharp

ReasonCache makes a gutsy systems bet: long reasoning traces repeat enough internal state to reuse KV blocks across requests. The paper uses collaborative filtering to find reusable KV-cache blocks, then applies zero-copy reuse. It reports 89.2% peak throughput gain and 40-60% average gain, with higher accuracy than existing KV-cache management methods. I like the direction because LRM serving is now a memory-and-concurrency bill, not just a FLOPs bill. vLLM’s PagedAttention handled fragmentation and scheduling; ReasonCache bets that similar intermediate reasoning can be cashed out at the system layer. The caveat is sharp: the abstract does not give model size, workload mix, or concurrency setup. If the reuse signal comes mostly from templated math traces, that 40-60% number shrinks fast on messy agent traffic.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→BiSpikCLM: A Spiking Language Model Integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

BiSpikCLM introduces a fully binary, MatMul-free causal language model with Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation; at 1.3B parameters, it reaches performance comparable to ANN counterparts using 5.6% of the training tokens and 4.16%-5.87% of the computational cost on natural language generation tasks.

#Inference-opt#Alignment#BiSpikCLM#Research release

why featured

HKR-H/K/R pass, but this is an arXiv architecture paper rather than a deployable release. The numbers are strong, while the technical bar keeps it at the 78 featured band, not P1.

editor take

BiSpikCLM makes spiking LMs credible at 1.3B, but the 4.16%-5.87% compute claim needs hardware reality before anyone prices inference around it.

sharp

BiSpikCLM’s sharp claim is not “spiking saves energy”; it pushes SNN language modeling to a 1.3B causal LM while removing MatMul, softmax, and floating-point paths. The numbers are strong on paper: the 1.3B model uses 5.6% of ANN training tokens and reports 4.16%-5.87% compute cost on generation tasks. I don’t buy the inference-savings story yet. SpAD aligns the SNN student to an ANN teacher across embeddings, attention maps, intermediate features, and logits, so the method still leans on a standard Transformer as scaffolding. The abstract gives no neuromorphic hardware run, throughput, latency, or memory-bandwidth data. Compared with BitNet-style binary or ternary LLM work, BiSpikCLM is more radical, but its deployment risk looks closer to a hardware research track than a quantization trick you drop into vLLM.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→TabPFN-3: Technical Report

TabPFN-3 scales tabular prediction to datasets with 1M training rows and 200 features, using synthetic pretraining, test-time compute scaling, reduced KV cache, and row chunking on one H100; TabPFN-3-Plus leads non-TabPFN models by over 200 Elo on TabArena, reaches 420 Elo on the largest subset, and runs 10x faster than AutoGluon 1.5 extreme.

#Inference-opt#Benchmarking#Reasoning#TabPFN

why featured

HKR-H/K/R pass, but the impact is concentrated in tabular ML rather than broad foundation-model news. The 1M-row scale, +200 Elo, and 10x speed justify the lower end of 78–84.

editor take

TabPFN-3 takes tabular FMs to 1M rows on one H100; GBDT survives, but AutoML’s comfort zone just got punched.

sharp

TabPFN-3 is sharp because it pushes tabular foundation models into a usable enterprise scale: 1M training rows, 200 features, one H100. Earlier tabular FMs often looked good on small benchmarks, then LightGBM, CatBoost, or AutoGluon did the real work. The hard numbers here are strong: TabPFN-3-Plus leads non-TabPFN models by 200+ Elo on TabArena, hits 420 Elo on the largest subset, runs 10x faster than AutoGluon 1.5 extreme, and claims up to 120x faster SHAP. The more interesting mechanism is synthetic pretraining plus test-time compute scaling. That dodges the messy problem of collecting real tabular corpora and turns inference budget into an accuracy knob. I still want external runs on ugly business tables: leakage, time splits, categorical junk, and relational joins. These are author-reported numbers, and TabArena is closely tied to the TabPFN world; 200 Elo shrinks fast when the benchmark stops being friendly.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Robometer trains robotic reward models with a dual objective: frame-level progress loss on expert data and trajectory-comparison preference loss across same-task trajectories, using RBM-1M with over 1 million robot trajectories that include suboptimal and failed data.

#Robotics#Benchmarking#Robometer#RBM-1M

why featured

HKR-K is strong: RBM-1M has 1M+ trajectories and a two-loss training recipe. HKR-H/R pass through the failed-trajectory angle, but the impact stays inside robotics research rather than a broad model-release moment.

editor take

Robometer treats failed robot rollouts as signal, not trash; 1M trajectories is the right bet for reward models that must survive messy labs.

sharp

Robometer makes the right call: robot reward models cannot keep living on expert demos alone. RBM-1M has over 1 million trajectories, including failed and suboptimal runs. The training recipe also changes the supervision target: frame-level progress loss anchors expert data, while same-task trajectory preference loss forces global ordering across rollouts. I buy the direction because robotics data is messier than language data by default. RT-1 and Open X-Embodiment pushed scale for behavior cloning, but reward learning still gets trapped by local frame labels. The missing piece is evaluation hardness. The abstract claims stronger generalization across benchmarks and real-world tests, but this article slice gives no task count, success-rate delta, or embodiment split. Without those numbers, Robometer is a credible reward-modeling recipe, not yet a solved robot-learning loop.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

LayerBoost rewrites Transformer attention by layer sensitivity, keeps softmax in critical layers, uses linear sliding-window attention in moderate layers, removes attention in low-sensitivity layers, and adds a 10M-token distillation healing phase, reducing inference latency and improving throughput by up to 68% under high concurrency while keeping competitive benchmark quality.

#Inference-opt#Benchmarking#LayerBoost#Research release

why featured

HKR-H/K/R all pass: LayerBoost has a concrete attention-reduction mechanism and a 68% latency claim for inference teams. Single arXiv source, with no disclosed code or adoption, keeps it at 78.

editor take

LayerBoost’s sharp edge is deleting attention in low-sensitivity layers, not linear attention; the 68% latency win lives or dies by model size and context setup.

sharp

LayerBoost makes the right cut: attention efficiency should be layer-specific, not a uniform swap across the stack. The method keeps softmax in high-sensitivity layers, uses linear sliding-window attention in moderate layers, deletes attention in low-sensitivity layers, then heals with only 10M distillation tokens. That is small enough to be an engineering knob, not a pretraining project. I’m cautious on the 68% latency claim. The snippet ties it to high concurrency, but does not give model size, context length, batch shape, hardware, or serving stack. FlashAttention, MQA/GQA, and KV-cache compression already moved this fight into narrow regimes. If LayerBoost’s deleted-attention layer map transfers across models, it is useful. If it depends on one checkpoint and one serving setup, it is another nice arXiv speedup with a fragile deployment story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

The paper proposes a metacognitive harness that controls Claude Sonnet-4.6 inference with pre-solve FOK and post-solve JOL signals; without parameter updates or benchmark-specific fine-tuning, it raises pooled accuracy on public benchmark snapshots from 48.3 to 56.9.

#Reasoning#Code#Multimodal#Claude

why featured

HKR-H/K/R all pass: the title has a sharp failure hook, the post gives a testable 48.3→56.9 result, and the mechanism targets Claude Sonnet-4.6 test-time scaling. As a single arXiv paper, it stays below same-day must-write releases.

editor take

Sonnet-4.6 jumps 48.3→56.9; the sting is that the model exposes confidence signals and still needs an external loop to obey them.

sharp

The sharp claim here is that test-time scaling is moving from “sample more” to “make the model’s self-monitoring drive the loop.” Claude Sonnet-4.6 moves pooled accuracy from 48.3 to 56.9 with no parameter updates and no benchmark-specific fine-tuning. The mechanism is concrete: ask for FOK before solving, ask for JOL after solving, then trust, retry, or send attempts to an aggregator. I buy the direction, but I’d discount the leaderboard claim for now. The abstract names HLE-Verified, LiveCodeBench v6, and R-Bench-V, but gives no call count, latency, token budget, or failure breakdown. Against vanilla self-consistency, FOK/JOL only matters if it wins under the same inference budget. Otherwise this is a clever reasoning-budget controller, not evidence that Sonnet-4.6 gained a new capability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

LoRIF uses low-rank gradient structure for training data attribution, reducing per-layer per-sample I/O from O(D) to O(c√D) and inverse-Hessian memory from O(D²) to O(Dr), while experiments on 0.1B to 70B-parameter models and million-example datasets report up to 20× lower storage and faster queries than LoGRA with matched or better attribution quality.

#Fine-tuning#Inference-opt#Interpretability#LoRIF

why featured

LoRIF is a technical research release, but HKR-H/K/R pass via a clear scaling hook, concrete complexity cuts, 70B tests, and up to 20× storage reduction versus LoGRA. Specialist overhead keeps it at 78, not p1.

editor take

LoRIF moves attribution pain from compute fantasy to I/O engineering; 20× storage wins are real, but attribution is still not auditability.

sharp

LoRIF’s sharp move is admitting training-data attribution dies on I/O, not theory. It cuts per-layer, per-sample I/O from O(D) to O(c√D), and inverse-Hessian memory from O(D²) to O(Dr). On 0.1B-to-70B models with million-example datasets, it reports up to 20× lower storage than LoGRA. That is more useful than another TRAK-style projection tweak. I don’t buy the paper’s “frontier scale practical” phrasing yet. The evidence is storage and query speed against LoGRA, not attribution across closed pretraining corpora, continual training, or multimodal mixture pipelines. LoRIF makes debugging and sample triage cheaper. Compliance-grade audit trails still need dataset versioning, dedup logs, and the actual training recipe.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

FrontierSmith synthesizes open-ended coding variants from competitive programming tasks, and training Qwen3.5-27B on the generated data raises FrontierCS by 12.12 points and ALE-bench by 309.12 Elo versus the base model.

#Code#Agent#Benchmarking#FrontierSmith

why featured

HKR-H/K/R all pass: the paper gives a task-synthesis mechanism plus two concrete gains, and it maps to coding-agent training and eval bottlenecks. Single arXiv source with no deployment or cross-source cluster keeps it at 78.

editor take

FrontierSmith turns contest problems into open-ended training data; +309 Elo on Qwen3.5-27B says coding agents need messier tasks, not bigger LeetCode piles.

sharp

FrontierSmith hits a real coding-model data gap: contest tasks are too clean, while production coding lacks a single optimal answer. It starts from competitive programming problems, changes goals, restricts outputs, generalizes inputs, then filters variants with an idea-divergence metric. That is a better recipe for agent training than piling up more unit-test puzzles. The numbers are strong: Qwen3.5-9B gains 8.82 on FrontierCS and 306.36 Elo on ALE-bench; Qwen3.5-27B gains 12.12 and 309.12 Elo. The wild part is the behavioral signal: agents take more turns and spend more tokens, closer to human-curated open-ended tasks. My caveat is the verifier. If the generated verifier is narrow, the model still learns to exploit a harness, just with a fancier problem statement.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Research proposes training-free pace-and-path correction for VLA dynamics-blindness problem

The paper proposes Pace-and-Path Correction, a training-free inference-time operator for chunked-action VLA models that jointly optimizes a single quadratic cost, and reports up to 28.8% and 25.9% absolute success-rate gains over foundational VLA models on MoveBench in dynamic-only and static-dynamic mixed settings.

#Robotics#Vision#Inference-opt#MoveBench

why featured

HKR-H/K/R pass, but this is a single arXiv robotics paper, narrower than a major model release. The training-free inference wrapper and large MoveBench gains place it in the 78–84 quality research band.

editor take

A training-free VLA wrapper gaining 28.8 points is exactly the kind of cheap robotics fix labs love; I’m waiting for messy robot runs.

sharp

Pace-and-Path Correction attacks the embarrassing VLA gap: chunked actions plan ahead, then fail to absorb motion. The paper adds no retraining. It wraps any chunked-action VLA at inference with one quadratic cost, then splits correction into pace compression and orthogonal path offset. MoveBench reports up to +28.8 absolute success points in dynamic-only settings, and +25.9 in mixed static-dynamic settings. I like the direction more than the claim. A closed-form, training-free wrapper fits real robotics stacks better than another expensive OpenVLA-style finetune. But MoveBench isolates motion as the controlled variable. That is a useful diagnostic, not a warehouse with occlusion, contact, actuator lag, and ugly sensor noise. This smells like a strong control-layer patch whose value depends on whether the gains survive outside the clean benchmark.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Video2GUI extracts GUI interaction trajectories from 500 million video metadata entries and builds WildGUI with 12 million trajectories across more than 1,500 applications and websites, while pretraining Qwen2.5-VL and Mimo-VL on WildGUI improves multiple GUI grounding and action benchmarks by 5–20%.

#Agent#Multimodal#Vision#Qwen

why featured

HKR-H/K/R all pass: the scale is concrete and the topic maps to GUI-agent pretraining bottlenecks. The score stays at 78 because only abstract-level facts are available; license, gains, and reproducibility details are not disclosed.

editor take

GUI agents don't mainly lack clever prompting; they lack trajectories. WildGUI’s 12M video-mined actions attack the actual bottleneck.

sharp

Video2GUI is sharp because it moves GUI-agent pretraining away from hand labels and into public video mining. The pipeline filters 500M video metadata entries, builds WildGUI with 12M interaction trajectories across 1,500+ apps and sites, then boosts Qwen2.5-VL and Mimo-VL by 5–20% on GUI grounding and action benchmarks. I buy the data direction more than the performance story. GUI traces are easy to fake statistically: tutorial videos contain edits, skipped steps, presenter bias, and weak intent labels. Compared with live-environment tests like OSWorld, WildGUI looks like cheap fuel, not a driving test. If the extraction pipeline is really released, that matters more than one round of benchmark gains.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

MIRAGE uses diffusion models to search same-topology semantic scene variants for online HD map attacks; on nuScenes, boundary removal suppresses 57.7% of detections and corrupts 96% of planned trajectories, while boundary injection succeeds where pixel PGD and AdvPatch fail.

#Vision#Safety#Benchmarking#MIRAGE

why featured

HKR-H/K/R all pass: diffusion-discovered attacks are a clear hook, and the post gives 57.7%/96% results with autonomy-safety stakes. Single arXiv paper and specialized domain keep it in the 72–77 band.

editor take

MIRAGE is not another sticker attack; 57.7% boundary suppression and 96% trajectory corruption move HD-map safety failure into semantics.

sharp

MIRAGE pushes AV attacks out of pixel noise and into plausible road conditions, which is a nastier failure mode for online HD maps. On nuScenes, its diffusion search keeps road topology fixed while mutating semantics; boundary removal suppresses 57.7% of detections and corrupts 96% of planned trajectories. In boundary injection, PGD and AdvPatch fail entirely, while MIRAGE still fabricates lane boundaries. The wild part is realism does not collapse. Two VLM judges rate MIRAGE outputs realistic 80–84% of the time, versus 97–99% for clean nuScenes and only 0–9% for AdvPatch. That puts standard adversarial defenses in the wrong frame: they clean perturbations, but MIRAGE changes shadows, wet roads, and other legitimate environmental variables.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→FlowSteer: Towards Agents Designing Agentic Workflows via Reinforced Progressive Canvas Editing

FlowSteer lets a single agent design the workflow that a downstream executor runs, using Workflow Canvas to return syntax-checked execution feedback after each atomic edit. The paper reports reinforcement learning for a lightweight policy agent, plug-and-play operator libraries and LLM backends, and experiments across 12 datasets, but the snippet does not disclose the baseline names or per-dataset scores.

#Agent#Tools#Reasoning#FlowSteer

why featured

HKR-H/K/R all pass, but the feed only gives arXiv-level detail: no repo link, benchmark deltas, or production replacement claim. This fits the featured threshold for agent-workflow research at 76.

editor take

FlowSteer has the right shape: edit executable workflow graphs, not prompts. But 12 datasets without baselines or scores is still a trust gap.

sharp

FlowSteer targets the messy part of agent workflows: the graph should be edited under execution feedback, not hallucinated in one shot. Its policy agent makes one atomic edit per turn on a Workflow Canvas, then receives syntax-checked feedback. That is a cleaner training loop than prompt-only workflow planning. The trust problem is in the reporting. The abstract says 12 datasets and significant gains, but gives no baseline names, per-dataset scores, or failure modes. A 51-page paper with 6 figures and 5 tables has room for that evidence; the snippet just does not expose it. I’d place this near DSPy-style program search and AutoGen workflow construction: the mechanism is credible, the performance claim still needs the tables.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Non-linear Interventions on Large Language Models

The paper introduces a non-linear intervention framework and learning procedure for LLM representations, validating it on refusal-bypass steering where intervening on a non-linear refusal feature steers the model more precisely than linear baselines.

#Interpretability#Alignment#Safety#Research release

why featured

HKR-H/K/R all pass, but the item only provides abstract-level facts and no model, dataset, or gain size. Safety/interpretability relevance clears featured, not a higher band.

editor take

If refusal lives on a nonlinear manifold, linear steering is underbuilt safety tooling; the abstract gives no model, metric, or lift size.

sharp

This paper pokes the weak spot in the linear-representation habit: refusal bypass may not be steerable by pushing one direction vector. The authors propose nonlinear interventions, including a learning procedure for implicit features without direct output signatures. Their test case is refusal-bypass steering, where the nonlinear refusal feature beats linear baselines. The missing bits matter: the abstract gives no model name, layer, attack set, or success-rate delta. SAE work, activation steering, and CCS-style probes have all shown the same failure mode: neat features often break under distribution shift. If this only works on one model and one refusal setup, it is a clever mechanism paper. If it transfers across models, safety steering leaves the linear-probe comfort zone.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent reports 91.2% Comparison Set Faithfulness on a 4,160-patient clinical cohort, versus 46.2% for standard RAG, and uses a k-anonymity and ℓ-diversity semantic privacy gate to reduce artifact-level membership inference risk by an absolute 9.8%.

#Agent#RAG#Multimodal#ProtoMedAgent

why featured

HKR-H/K/R all pass: the benchmark gap is sharp, and the privacy gates are concrete. Single arXiv medical research keeps it in the 72–77 band, not a same-day must-write.

editor take

ProtoMedAgent hits 91.2% faithfulness versus RAG’s 46.2%; strong paper signal, but cohort/task detail is too thin for clinical-grade confidence.

sharp

ProtoMedAgent’s useful move is the discrete audit layer, not the “medical agent” label. It reports 91.2% Comparison Set Faithfulness on 4,160 patients, versus 46.2% for standard RAG. The mechanism is concrete: frozen prototype backbone, discrete semantic memory, exact set differentials, and a Scribe-Critic loop meant to block retrieval sycophancy. I’m more cautious on the privacy claim. The k-anonymity and ℓ-diversity gate cuts artifact-level membership inference risk by 9.8 percentage points, but the snippet does not give the attack setup, imaging modality, disease task, or external-site validation. Clinical AI papers keep showing the same failure mode: clean internal-cohort faithfulness, then brittle explanations once the hospital distribution changes.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→To See Is Not to Learn: Protecting Multimodal Data from Unauthorized LVLM Fine-Tuning

The paper proposes MMGuard, which injects human-imperceptible perturbations and disrupts cross-modal binding so LVLMs overfit to noise, and evaluates the defense against unauthorized fine-tuning on nine open-source LVLMs across six datasets under white-box, gray-box, and black-box threat models.

#Multimodal#Vision#Fine-tuning#MMGuard

why featured

HKR-H/K/R all pass: the title has a clear contrast, the method and eval scale are concrete, and the data-misuse angle matters to practitioners. Single arXiv paper with no code or adoption disclosed, so it stays below the 78+ band.

editor take

MMGuard is pre-training poison as copyright defense, but the paper proves it on open LVLM fine-tuning, not the closed pretraining pipelines owners fear most.

sharp

MMGuard’s sharp move is poisoning the learning signal before infringement, not tagging images after the fact. It adds human-imperceptible perturbations, creates an optimization shortcut, then disrupts cross-modal binding so the LVLM attends to noise. The paper reports tests on nine open-source LVLMs, six datasets, and white-box, gray-box, and black-box threat models. I buy half of it. For data owners, this is a cleaner lever than unlearning or watermarking because the failure happens during fine-tuning. The boundary is also obvious: the evidence is open LVLM fine-tuning, not GPT-4o or Gemini-style closed scraping and pretraining. That closed pipeline is where copyright defense usually dies.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Hidden State Poisoning Attacks against Mamba-based Language Models

The paper introduces HiSPA short-phrase triggers and RoBench-25 to test retrieval under hidden-state poisoning; the 52B Jamba-1.7-Mini hybrid collapses under some triggers, while pure Transformers do not show the same failure in the reported experiments.

#Safety#Interpretability#Benchmarking#Mamba

why featured

HKR-H/K/R pass: the paper names a concrete SSM attack, benchmark, and 52B Jamba failure. Single-source arXiv and niche SSM scope keep it below the 78–84 safety-paper band.

editor take

Mamba’s linear-time bargain has a bill: HiSPA short triggers break hidden-state memory, and even 52B Jamba-1.7-Mini folds in tests.

sharp

Mamba-family risk is not ordinary prompt injection; a few tokens can corrupt the model’s working memory. HiSPA makes the 52B Jamba-1.7-Mini hybrid collapse on parts of RoBench-25, while pure Transformers avoid the same failure in the paper’s reported experiments. The authors also extend the result to Mamba-2 and Nemotron-3-Nano. That is a bad trade for the SSM pitch. Mamba sold linear-time sequence handling as the efficiency win for long context, but this attack lands on irreversible hidden-state writes, not an attention-layer nuisance. The paper ships code and data, with 29 pages and 4 figures, so this is reproducible enough to pressure vendors. I’d check whether the triggers are semantically arbitrary, and how much Jamba’s Transformer blocks actually absorb before the SSM state fails.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→TFGN: Task-Free Replay-Free Continual Pre-Training Without Catastrophic Forgetting for Large Language Models

TFGN tests a transformer architectural overlay across six text domains, 1B tokens per phase, three model scales from about 398M to 9B, and two training regimes, reporting -0.007 backward transfer on LLaMA 3.1 8B Retrofit with no replay, task IDs, or Fisher penalty.

#Fine-tuning#Memory#Benchmarking#TFGN

why featured

HKR-H/K/R all pass, but this is an arXiv training-method paper with scale and metrics only; no code or independent reproduction is disclosed, so it stays at the featured threshold.

editor take

TFGN has clean numbers, but “closes forgetting” is too loud; six curated domains at 1B tokens each is not live continual pretraining.

sharp

TFGN deserves attention, but the paper overclaims from a very clean setup. The hard numbers are good: LLaMA 3.1 8B Retrofit gets -0.007 backward transfer across six domains, 1B tokens per phase, with no replay, task IDs, or Fisher penalty. Python-only training also drops held-out JavaScript PPL by 26.8%. That is cleaner than the usual replay-buffer or EWC-style story. I don’t buy “closes catastrophic forgetting at LLM scale” yet. Six named domains are not the mess of live continual pretraining: duplicated web text, safety filters, instruction data leakage, RL-tuned model drift, and uneven domain ratios. The 99.59% L2-orthogonal gradient separation is a strong mechanistic hook, but I’d want the same overlay on 70B, long-context code corpora, and 20 sequential domains before treating this as more than a promising architecture paper.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

The paper proposes folding LN centering into upstream linear layers via CCC and CBWC, defines foldable LNs with a graph-based detector, and reports exact inference-time conversion across many common architectures with 2% to 12% end-to-end acceleration without changing model predictions.

#Inference-opt#Research release

why featured

HKR-H/K/R all pass: the LN/RMSNorm contrast is clickable, CCC/CBWC plus 2%–12% speedups add substance, and inference cost resonates. Single arXiv paper and low-level scope keep it in the lower featured band.

editor take

LN optimization finally dodges retraining: CCC/CBWC folds centering upstream, and 2%-12% speedup is exactly the margin inference teams fight for.

sharp

This paper is useful because it attacks LN without asking teams to retrain models. CCC and CBWC fold centering into upstream linear layers, then replace eligible LN with RMSNorm-style computation. The claimed result is exact inference-time conversion across common architectures, unchanged predictions, and 2% to 12% end-to-end speedup. That range matters more than it looks. In production inference, 2% is already real money when kernels are mature. RMSNorm won adoption because it removed mean subtraction; this tries to recover that efficiency for LN checkpoints. I’m cautious on the broad architecture claim: the abstract does not name model sizes, hardware, batch sizes, or sequence lengths. The strong part is testability. A graph-based detector for foldable LNs gives compiler teams a concrete pass to implement, instead of another normalization tweak that dies in training code.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Silent Collapse in Recursive Learning Systems

The paper defines silent collapse in recursive learning and reports three precursors: anchor entropy contraction, representation drift freezing, and tail coverage erosion before standard validation metrics degrade.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R pass: the hook is silent collapse in recursive training, the post gives 3 precursors, and the topic hits synthetic-data degradation fears. No scale, code, or major-lab backing is disclosed, so it stays just above featured threshold.

editor take

The scary part is not self-training decay; it is decay hiding behind stable loss, perplexity, and accuracy.

sharp

This paper pins recursive training failure on monitoring, not model quality: stable loss, perplexity, and accuracy do not prove the system is healthy. The concrete hook is the three precursors: anchor entropy contraction, representation drift freezing, and tail coverage erosion. The authors say these show up multiple generations before standard validation metrics degrade. I buy the problem framing, but I’m not sold on MTR yet. Monitor–Trust–Regulator avoiding pristine real data hits the exact pain point in synthetic-data loops. The snippet gives no model scale, model family, recursion depth, benchmark setup, or intervention threshold. Without those, MTR is a neat brake pedal drawn on the dashboard. Compared with older model-collapse work, the useful claim is early trajectory telemetry, not another warning that models trained on their own exhaust get brittle.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs

AsyncFC decouples LLM decoding from function execution at the execution layer, requires no fine-tuning or changes to the standard synchronous function-calling protocol, and reduces end-to-end task completion time across function-calling and adapted software engineering benchmarks while preserving accuracy.

#Agent#Tools#Inference-opt#AsyncFC

why featured

HKR-H/K/R all pass, but the post gives mechanism and benchmark direction without latency numbers, code status, or reproducible setup. A useful arXiv tool-calling paper, not a same-day must-write.

editor take

AsyncFC treats agent latency as a runtime problem, not a model problem; that’s the right instinct when tool calls keep stalling otherwise capable agents.

sharp

AsyncFC makes the right call: agent latency is often a runtime tax, not a missing-model-capability problem. The concrete move is a future-based execution layer that decouples LLM decoding from function execution, while allowing inter-function parallelism when dependencies permit. No fine-tuning, no change to the standard synchronous function-calling protocol; that matters more for adoption than another tool-use leaderboard bump. I like this because it hits a boring but expensive failure mode in 2025-style agents: the model waits on APIs, file I/O, tests, or shell commands while the whole loop stalls. The article says end-to-end task time drops while accuracy is preserved, but it does not disclose the exact speedup here. The hard part will be dependency inference, rollback, and whether symbolic futures poison reasoning once real tools fail or return messy outputs.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

The paper introduces Top-W decoding, a Wasserstein-regularized geometry-aware truncation rule over token embeddings, and reports up to 33.7% improvement across four benchmarks—GSM8K, GPQA, AlpacaEval, and MT-Bench—using three instruction-tuned models.

#Inference-opt#Reasoning#Benchmarking#Research release

why featured

HKR-H/K/R pass: Top-W offers a testable decoding mechanism and a 33.7% max lift. Score stays near the featured floor because the feed provides no code, author/lab signal, or full reproducibility details.

editor take

Top-W drags decoding back into token geometry; 33.7% is attractive, but latency and strong-model tests decide if this survives.

sharp

Top-W’s sharp move is adding token-embedding geometry to truncation, where top-p and top-k only see probability mass. The paper reports up to 33.7% gains across GSM8K, GPQA, AlpacaEval, and MT-Bench on three instruction-tuned models, while keeping the standard truncation-and-sampling interface. That interface detail matters; many decoding papers die before adoption because they ask teams to rebuild serving. I would discount the 33.7% until the tables are inspected. The abstract does not name the three models or show per-benchmark spread, so the gain may sit on weaker models or one temperature regime. This is a quality-side decoding bet, unlike speculative decoding’s cost-saving track. If Top-W adds geometry computation without clear latency numbers, production teams will treat it as an offline evaluation trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→EvolveMem: Self-Evolving Memory Architecture via AutoResearch for LLM Agents

EvolveMem uses an LLM-powered diagnosis module to iteratively adjust retrieval configuration, outperforming the strongest baseline by 25.7% relative on LoCoMo and 18.9% relative on MemBench, with code released at the SimpleMem GitHub repository.

#Agent#RAG#Memory#EvolveMem

why featured

HKR-H/K/R all pass, but this is still an arXiv paper with benchmark results, not a lab release or production deployment. Agent memory is practical enough for featured, not same-day must-write.

editor take

EvolveMem automates RAG tuning with an LLM loop; +25.7% on LoCoMo is strong, but don’t file this as solved agent memory yet.

sharp

EvolveMem makes long-term memory look less like cognition and more like retrieval engineering. The loop lets an LLM inspect failure logs, modify scoring, fusion, and answer policies, then guard changes with revert-on-regression and explore-on-stagnation. The reported gains are real enough to take seriously: +25.7% relative over the strongest LoCoMo baseline, +18.9% on MemBench, and +78.0% over the minimal LoCoMo baseline. I like the direction, but the “self-evolving memory” label is doing extra work. The system evolves retrieval configuration, not an agent’s durable world model or planning habits. Compared with GraphRAG or MemGPT-style storage changes, this smells closer to eval-driven RAG tuning done automatically. The missing hooks are cost, evolution rounds, and failure-log quality. Without those, +25.7% is either a useful memory layer or a very competent benchmark tuner.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Reinforcement Learning with Verifiable Rewards Improved via Random Few-Shot Guidance

FEST guides RLVR with 128 randomly selected demonstrations from an SFT dataset, combining supervised signals, on-policy signals, and decaying few-shot SFT weights to improve sample efficiency on difficult math and coding-style tasks across several benchmarks.

#Reasoning#Fine-tuning#Benchmarking#FEST

why featured

HKR-H/K/R all pass, but this is a single arXiv training-method paper and impact depends on code and replication; it fits the lower featured band.

editor take

FEST gets RLVR moving with 128 random demos; that’s useful training plumbing, not magic. I want the cost curve before buying the claim.

sharp

FEST targets the boring RLVR bottleneck that actually hurts: sample efficiency on hard tasks. The hook is concrete: 128 randomly selected SFT demonstrations, mixed with supervised signal, on-policy signal, and decaying few-shot SFT weights to avoid overfitting across epochs. The paper says it beats baselines across math and coding-style benchmarks, sometimes matching full-dataset SFT. I buy the direction, not the “128 demos are enough” headline. Correct rollouts are scarce in RLVR, so FEST is basically giving the policy a small verified ramp before on-policy learning takes over. But the abstract does not expose model size, token budget, failure split, or wall-clock cost. Compared with many DPO/GRPO-flavored papers, this smells less like algorithm branding and more like a practical data-efficiency trick.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→The Pitfalls of KV Cache Compression

The paper evaluates five KV cache compression methods on Llama3.1 8B and Qwen2.5 14B with multi-instruction IFEval, and finds that some instructions degrade much faster under compression until the LLM ignores them.

#Inference-opt#Benchmarking#Safety#Llama

why featured

HKR-H/K/R all pass, but this is a single arXiv inference-optimization paper with scope limited to KV-cache compression. Concrete models, 5 methods, and IFEval failures justify featured, not must-write.

editor take

KV cache compression is not a free lunch; it starts by dropping the instructions product teams treat as safety rails.

sharp

KV cache compression ties inference savings to instruction reliability, and that is a deployment risk. The paper tests StreamingLLM, SnapKV, TOVA, H2O, and K-Norm on Llama3.1 8B and Qwen2.5 14B with multi-instruction IFEval. Some instructions degrade faster under compression until the model ignores them. That is exactly the failure mode hidden by single-score benchmark reporting. The uncomfortable part is the safety angle. The authors use system prompt leakage as the case study, then name compression method, instruction order, and KV eviction bias as drivers. Vendors sell KV compression as throughput work; this paper says eviction policy changes the model’s obedience profile. The abstract does not give throughput gains or degradation curves, so any production claim needs a local rerun.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

The paper introduces Pluralistic Repair Score and tests two RLHF-trained frontier models, Claude Sonnet 4.5 with 198 cases and GPT-4o with 100 cases, measuring principled revision versus capitulation on contested-value prompts.

#Alignment#Safety#Benchmarking#Claude

why featured

HKR-H/K/R all pass: the angle is conflict-rich, the post gives a new score plus two sample sizes, and the topic fits alignment evals. As a single arXiv paper without visible uptake, it stays in the 72–77 band.

editor take

Pluralistic alignment cannot stay a polling problem; Claude Sonnet 4.5 and GPT-4o still bend too fast on contested values.

sharp

This paper hits the awkward part of alignment evals: the failure is not thin value coverage, it is models smoothing conflict away. The authors test Pluralistic Repair Score on Claude Sonnet 4.5 with 198 cases and GPT-4o with 100 cases, measuring whether a model revises on principled grounds after user pressure rather than just yielding. That is the right pressure point for RLHF assistants, because agreeableness has become a product incentive, not a rare bug. The weak spot is also obvious: the sample is small, and “principled” still needs a judge with values. But compared with another preference-distribution benchmark, PRS is closer to the deployment failure practitioners actually see.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→LLMs Should Express Uncertainty Explicitly

The paper trains LLMs to emit <uncertain> during reasoning or verbalize a confidence score at the end, and reports that both methods reduce overconfident errors on factual reasoning tasks while serving as RAG triggers.

#Reasoning#Alignment#RAG#Research release

why featured

HKR-H/K/R all pass, but the body gives no experiment scale, model list, or error-reduction size. This is a practical safety/alignment paper, enough for featured but below must-write.

editor take

Training uncertainty as an output interface is practical; without model size or error-rate deltas, don’t sell this as solved reliability.

sharp

This paper pushes uncertainty from an evaluation artifact into an engineering hook. The authors train models to emit `<uncertain>` during reasoning or give a final confidence score; the abstract says both reduce overconfident factual errors and can trigger RAG. The useful detail is mechanistic: final confidence sharpens a structure already present in the pretrained model, while `<uncertain>` changes concentrate in late layers. I like the `<uncertain>` path more. A final confidence score often becomes UI theater; a RAG system needs a mid-generation brake, not a post-hoc apology. The missing pieces are big: no model size, benchmark names, error-rate deltas, or retrieval cost in the snippet. So this proves trainability, not deployability.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

The paper introduces SP-KV, which scores each key-value pair with a lightweight utility predictor, keeps recent KVs in a local window, writes older KVs to global attention only above a threshold, and typically reduces KV cache size by 3 to 10 times.

#Inference-opt#Research release

why featured

HKR-H/K/R all pass, but the article gives abstract-level detail only: no benchmarks, model sizes, or accuracy tradeoffs. This fits a practical inference-optimization paper at the featured threshold.

editor take

SP-KV cuts KV cache by 3–10x, which is the right pressure point; joint training keeps it from being a drop-in serving hack.

sharp

SP-KV attacks the invoice, not the context-length headline. It adds a lightweight utility predictor for each key-value pair, keeps recent KVs in a local window, and writes older KVs to global cache only above a threshold. The paper claims 3–10x KV-cache reduction, with longer sequences often compressing better. That is a cleaner target than blunt eviction because it learns write decisions at KV-pair granularity. The catch is deployment. The LLM and predictor are trained jointly end-to-end from pretrained checkpoints using next-token loss. That makes SP-KV a model-side architectural bet, not a patch vLLM or TensorRT-LLM can simply bolt on tomorrow. Meta FAIR on the author list makes the result harder to dismiss, but the abstract does not disclose model scale or production throughput numbers. I would not translate “vast decoding speed” into serving SLA yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

ECHO reframes speculative execution as a budgeted scheduling problem inside SGLang; evaluations across multiple model scales, including Qwen3-235B, report up to 5.35x walltime speedup and more than 20% relative speedup gain under low-load and high-load serving conditions.

#Inference-opt#SGLang#Qwen#Research release

why featured

HKR-H/K/R all pass, but this is a specialized arXiv systems paper rather than a broad product or model release. The 5.35x speedup, Qwen3-235B eval, and SGLang integration justify low featured.

editor take

ECHO attacks the ugly part of speculative decoding: verification waste under concurrency. The 5.35x number is big; the scheduler idea matters more.

sharp

ECHO hits the part of speculative decoding papers that usually gets sanded down: single-request speedups do not survive messy serving. It turns speculative execution inside SGLang into budgeted scheduling, then uses sparse confidence gating to treat the batch as one super-tree. The concrete hook is strong: Qwen3-235B, up to 5.35x walltime speedup, and over 20% relative gain under both low-load and high-load settings. I’d still be careful with the headline number. The abstract does not expose workload mix, concurrency levels, draft-model cost, or acceptance-rate details, so 5.35x may be tied to a friendly decode regime. This smells less like a model breakthrough and more like a serving-stack correction. For teams running SGLang or vLLM-class infrastructure, the question is whether ECHO keeps verification waste down under real traffic bursts.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

IsoNet extracts user-selected target speech on a compact 4-microphone array, combining multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and direction-of-arrival supervision; trained on 25,000 simulated VoxCeleb mixtures, IsoNet-CL1 reaches 9.31 dB SI-SDR on a -1 to 10 dB SNR hard test set, improving the mixture by 4.85 dB.

#Multimodal#Audio#Vision#VoxCeleb

why featured

HKR-K passes with clear metrics and setup details for multimodal audio readers. HKR-H/R are weak: the title is academic, and the use case is too narrow for featured.

editor take

Three entries trace to the same arXiv paper, not independent validation; IsoNet’s useful claim is beating beamforming on a tiny 4-mic array.

sharp

All 3 entries use the same title and point back to arXiv 2605.14736, so the breadth is aggregator echo, not independent confirmation. IsoNet’s sharp claim is narrow but useful: a compact 4-mic array plus GCC-PHAT, face-conditioned visual embeddings, and DOA supervision inside a U-Net beats classical spatial filtering where aperture is tiny. On the -1 to 10 dB SNR hard set, IsoNet-CL1 reports 9.31 dB SI-SDR, a 4.85 dB gain over the mixture; oracle delay-and-sum and MVDR degrade SI-SDRi by 4.82 and 6.08 dB. I buy the direction, not the deployment story yet. The training set is 25,000 simulated VoxCeleb mixtures, and the paper itself flags phase reconstruction, multi-interferer cases, and sim-to-real transfer as open barriers.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Widening the Gap: Exploiting LLM Quantization via Outlier Injection

The paper introduces an outlier-injection attack that triggers malicious behavior after AWQ, GPTQ, and GGUF I-quants quantization, with evaluation across three attack scenarios and multiple LLMs where prior quantization-conditioned attacks failed.

#Safety#Inference-opt#Research release#Safety/alignment

why featured

HKR-H/K/R all pass: the paper frames quantization as an attack surface, names outlier injection across 3 scenarios and major formats, and hits deployment-safety concerns. Missing success rates and model lists keep it in the lower featured band.

editor take

Quantization is no longer a harmless deployment chore; AWQ, GPTQ, and GGUF I-quants now sit inside the model supply-chain threat model.

sharp

This paper moves quantization attacks out of toy territory and into the deployment path people actually use. The mechanism is blunt: ship a benign-looking full-precision model, inject large outliers into selected weight blocks, then let AWQ, GPTQ, or GGUF I-quants round neighboring weights to zero after the user quantizes it. That turns compression into the trigger, not the payload carrier. The authors claim evaluation across three attack scenarios and multiple LLMs, and say prior quantization-conditioned attacks failed on these schemes. The abstract does not disclose exact success rates or model names, so I would not call this a live supply-chain incident yet. But the common Hugging Face flow—download weights, make a local 4-bit build, run through llama.cpp or similar—now needs a security check at the quantization step, not only at model upload.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Krause Synchronization Transformers

The paper introduces Krause Attention, replacing softmax global aggregation with distance-based localized sparse interactions, reducing sequence-length complexity from quadratic to linear, and validating it on ViT for CIFAR/ImageNet, autoregressive image generation on MNIST/CIFAR-10, Llama/Qwen, and 100M/200M language models trained from scratch.

#Inference-opt#Reasoning#Vision#Llama

why featured

HKR-H/K/R all pass: the mechanism, complexity claim, and tested model families are concrete. Score stays at the featured threshold because benchmark numbers, code release, and larger-scale results are not disclosed.

editor take

Krause Attention attacks softmax at the interaction rule, not the kernel trick layer. Nice shape, but Llama/Qwen without hard long-context numbers is still a lab claim.

sharp

Krause Attention’s useful move is changing the interaction rule, not selling another faster attention kernel. It replaces globally normalized softmax with distance-based local sparse interactions, claims O(n) sequence scaling instead of O(n²), and reports tests on ViT, MNIST/CIFAR-10 autoregressive generation, Llama/Qwen, plus 100M and 200M language models trained from scratch. I buy the target: attention sinks and representation collapse are not fixed by FlashAttention-style kernel work. But the excerpt gives no concrete Llama/Qwen perplexity, context length, or throughput numbers. That matters. Mamba, RetNet, and Hyena all taught the same lesson: linear scaling is the pitch; preserving decoder quality after replacement is the test.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Fast Adversarial Attacks with Gradient Prediction

The paper proposes predicting input gradients from forward-pass hidden states with lightweight linear regression, removing the backward pass and recovering much of FGSM attack performance while reporting a 532% throughput increase under wall-clock constraints.

#Safety#Benchmarking#Research release#Safety/alignment

why featured

HKR-H/K pass: the no-backprop attack mechanism and 532% throughput figure are concrete. HKR-R is weak because this remains a niche robustness paper with no cross-source pickup or production replacement evidence.

editor take

A 532% throughput jump would slash red-team cost, but “much of FGSM” is doing work; attack strength needs the full ledger.

sharp

The sharp part is not another FGSM variant; it removes backprop from the attack loop. The method predicts input gradients from forward-pass hidden states using lightweight linear regression. The abstract reports a 532% throughput increase while recovering much of FGSM’s attack performance. For robustness evaluation and large-batch red-teaming, that hits the cost center directly: more samples and perturbation settings under the same wall clock. I’m wary of the phrase “much of FGSM.” The excerpt does not give datasets, model sizes, epsilon values, success-rate loss, or how this relates to multi-step attacks like PGD. The NTK-exact argument is elegant, but finite-width models and defended models are where gradient shortcuts usually get exposed. If the win only holds near clean classifiers, this is an evaluation accelerator, not a replacement for strong attacks.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

The paper identifies a mismatch between ascending-quality curricula and learning-rate decay in LLM pretraining; on 1.5B-parameter models trained for 30B tokens, moderate decay plus weighted averaging of final checkpoints improves average benchmark scores by 1.64% over random shuffling.

#Reasoning#Benchmarking#arXiv#Research release

why featured

Single arXiv pretraining paper with HKR-H/K/R: a counterintuitive title, concrete 1.5B/30B-token evidence, and a cost-efficiency angle. Its reach is mainly training teams, so it sits just above the featured threshold.

editor take

Curriculum pretraining was not the dud; the lazy LR schedule was. A 1.64% gain is small, but it hits a real training-recipe nerve.

sharp

This paper moves the blame from “curriculum ordering does not work” to “LR decay wastes the good tail.” On 1.5B models trained for 30B tokens, ascending-quality curricula lose their edge under standard decay because the best data arrives when the step size is already small. A milder decay plus weighted averaging of the final checkpoints beats random shuffling by 1.64% on average benchmarks. I buy the mechanism more than the headline gain. This is not another data-filter paper claiming a cleaner pile; it explains why curriculum pretraining often fails to beat shuffle in practice. The caveat is scale: 1.5B / 30B tokens is useful, not frontier evidence, and the body gives no 10B-plus replication. Still, for teams tuning pretraining recipes, this is a cheaper lever than another round of data curation.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→ASH: Agents that Self-Hone via Embodied Learning

ASH learns embodied policies from unlabeled noisy internet video without reward shaping or expert annotations, and in 8-hour evaluations reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in The Legend of Zelda, while the strongest baselines stop at 6.5/12 and 6.0/12 respectively.

#Agent#Robotics#Memory#ASH

why featured

HKR-H/K/R pass, but this is a single arXiv paper with only the mechanism and two game scores disclosed; no code or real-robot validation is stated, so it stays at the lower featured band.

editor take

ASH’s punchline is not gaming skill; it turns unlabeled video into supervision. Two games and 8-hour runs still leave robotics claims exposed.

sharp

ASH deserves attention for its supervision pipeline, not for being good at Pokémon. When stuck, it trains an inverse dynamics model on its own trajectories, then mines relevant internet video for action supervision. That dodges two scaling traps: reward shaping and expert action labels. The numbers are clean: 11.2/12 milestones in Pokémon Emerald and 9.9/12 in Zelda over 8-hour runs, while the strongest baselines stall at 6.5/12 and 6.0/12. I don’t buy the broad “scalable recipe for embodied learning” claim yet. These games are long-horizon, but they are resettable, visually stable, and cheap to fail in. Robotics has uglier gaps: hidden actions, contact dynamics, embodiment mismatch, and messy camera distributions. ASH shows unlabeled web video can carry real signal for long-horizon agents. It has not shown that signal survives the jump to physical embodiment.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

LiSA improves a fixed guardrail with structured memory, converts sparse failures into reusable policy abstractions, and reports gains on PrivacyLens+, ConFaide+, and AgentHarm while remaining robust under 20% label-flip noise.

#Agent#Safety#Memory#LiSA

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with no disclosed code, institution, or production deployment. It sits near the featured threshold for safety/alignment research.

editor take

LiSA is the right safety shape for agents: fixed guardrails, sparse feedback, structured memory. I’d trust the direction before trusting the claimed generalization.

sharp

LiSA makes the right bet: agent safety will not be fixed by constant fine-tuning, but by an auditable local memory layer. It turns rare failures into policy abstractions, then gates reuse with conflict-aware local rules and a posterior lower bound. The paper reports gains on PrivacyLens+, ConFaide+, and AgentHarm, plus robustness at 20% label-flip noise. That shape matches deployment better than another bigger backbone. Enterprise feedback is sparse, dirty, and tied to local policy. I still don’t buy the generalization claim without the full numbers. The abstract gives no effect size, and it does not show whether feedback resembles real tickets. A wrong safety memory can become stickier than a bad refusal.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Your CLIP Has 164 Dimensions of Noise: Exploring VLM Embedding Covariance Eigenspectra

The paper uses spectral decomposition of covariance matrices to split VLM latent spaces into semantic signal and shared noise components, and the title reports 164 noisy CLIP dimensions; the abstract says pruning shared noise dimensions is mainly harmless and can improve downstream task performance, but the RSS snippet does not disclose datasets or metric values.

#Multimodal#Vision#Embedding#CLIP

why featured

HKR-H and HKR-K pass: the title has a sharp numeric hook and the post gives a pruning mechanism. HKR-R is weaker because downstream gains, datasets, and reproducible settings are not disclosed.

editor take

CLIP having 164 noisy dimensions is a useful slap at embedding worship: not every high-dimensional direction deserves runtime trust.

sharp

This paper takes a clean swing at the “more embedding dimensions are better” habit. The authors use covariance eigenspectrum decomposition to split CLIP-style VLM latent space into semantic signal and shared multimodal noise, with the title calling out 164 noisy dimensions. The sharper claim is subgroup invariance: the noise geometry reportedly persists across data subsets, and pruning those directions is mostly harmless or improves downstream tasks. I buy the research direction, not the strength of the claim yet. The excerpt gives no datasets, metric deltas, CLIP variant, or base embedding width. A 164-dimensional cut means different things for a 768-wide ViT-L/14 projection versus a larger head. For image-text retrieval, deduping, and multimodal RAG, spectral cleanup is cheaper than another finetuned head. But without reproduction details, this is a promising diagnostic, not a drop-in compression recipe.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→MathAtlas: A Benchmark for Autoformalization in the Wild

MathAtlas collects about 52k theorems, definitions, exercises, examples, and proofs from 103 graduate mathematics textbooks, and adds roughly 178k dependency relations; strong baselines reach at most 9.8% correctness on theorem statements, while the best model scores only 2.6% on the 700-entity MA-Hard subset.

#Reasoning#Benchmarking#MathAtlas#Research release

why featured

HKR-H/K/R all pass, but this is an arXiv benchmark with reach concentrated in reasoning and formal math. Strong numbers support featured; niche technical scope keeps it in the low 70s.

editor take

MathAtlas drags autoformalization out of contest math; 9.8% and 2.6% correctness says theorem-prover demos are still far from library work.

sharp

MathAtlas lands a clean hit on autoformalization hype: dependency depth, not theorem glamour, is where current systems break. The dataset pulls about 52k entities from 103 graduate textbooks and adds roughly 178k dependency edges. Strong baselines top out at 9.8% correctness on theorem statements, 16.7% on definitions, and only 2.6% on the 700-entity MA-Hard split. That is a brutal read for Lean/Isabelle agent demos. MiniF2F and ProofNet-style sets often test isolated translation or proof search; MathAtlas forces models to carry definitions, notation, and chapter context. Honestly, getting one theorem formalized is a very different job from feeding a graduate text into mathlib without human cleanup.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

FEATUREDarXiv · cs.LG· atomEN04:00 · 05·15

→Research Proposes Universal Semi-Supervised Learning Framework with Structural Inference

The paper formalizes Universal Semi-supervised Learning and proposes SAGE. It avoids unlabeled-distribution estimation, uses high-order inter-sample structure, and reports an 8.52% average accuracy gain across five standard benchmarks.

#Fine-tuning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via a new method and +8.52% across 5 benchmarks. HKR-H/R miss: the item is thin, academic, and lacks a broader practitioner hook.

editor take

SAGE reports +8.52% average accuracy on 5 benchmarks. I buy the structural constraint angle, but no code is disclosed yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

The authors post-train a multi-turn CodeAct agent with reinforcement learning for FHIR-AgentBench, raising answer correctness from 50% with o4-mini to 77% with the smaller Qwen3-8B under execution-grounded LLM-judge rewards and data-integrity constraints.

#Agent#Reasoning#Tools#Qwen

why featured

HKR-K and HKR-R pass: the benchmark delta is concrete, and small vertical tool agents are relevant to practitioners. The narrow FHIR scope and single arXiv source keep it below featured.

editor take

Qwen3-8B jumps FHIR-AgentBench from 50% to 77%; healthcare agents need trained traversal discipline, not another tool wrapper.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→GEAR: Self-Distillation Method for Granularity-Adaptive Advantage Reweighting in LLM Agents

GEAR reshapes trajectory-level GRPO advantages with self-distillation signals, and experiments on eight mathematical reasoning and agentic tool-use benchmarks using Qwen3 4B and 8B models report consistent gains over GRPO, self-distillation baselines, and token- or turn-level credit assignment, with improvements reaching about 20% over GRPO on harder long-horizon settings.

#Agent#Reasoning#Fine-tuning#Qwen

why featured

HKR-K and HKR-R pass through a concrete GEAR mechanism and +20% over GRPO on 8 benchmarks. HKR-H is weak, and the narrow RL-training scope keeps it below featured.

editor take

GEAR reports up to 20% over GRPO on 8 benchmarks; I buy the direction—long-horizon credit assignment gets a usable scalpel.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

The paper defines minimal cores for overcomplete reasoning traces across six reasoning benchmarks, finding that 46% of steps are removable on average while preserving the original answer in 86% of cases, and the top three steps account for 65% of measured necessity mass.

#Reasoning#Interpretability#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with abstract-level numbers only; no tool, code, or adoption evidence is disclosed, so it stays in the lower 60–71 band.

editor take

Six benchmarks drop 46% of CoT steps with 86% answer retention; long traces carry dead tokens.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

Langzhou He and nine coauthors propose ActFocus, a token-level energy-informed reweighting method for agentic reinforcement learning; across four environments and multiple model sizes, it beats PPO and GRPO by up to 65.2 and 63.7 percentage points at the final step without extra runtime or memory cost.

#Agent#Reasoning#Fine-tuning#Langzhou He

why featured

HKR-K is strong and HKR-H has a concrete method hook, but this is a single arXiv training paper with no disclosed code, reproduction setup, or adoption signal, so it stays in the 60–71 band.

editor take

ActFocus beats PPO by up to 65.2 points across 4 environments; I buy action-token bottlenecks, pending task complexity in the PDF.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

HandITL blends human corrective intent with autonomous policy execution for bimanual dexterous manipulation, reducing takeover jitter by 99.8%, grasp failures by 87.5%, mean completion time by 19.1%, and producing policies that outperform standard teleoperation-trained policies by 19% on average across three long-horizon tasks.

#Robotics#Agent#Multimodal#HandITL

why featured

HKR-H/K/R all pass, but this is a single arXiv robotics paper. The post gives the mechanism and two metrics, not task scale, baselines, or code, so it stays at the high end of 60–71.

editor take

HandITL cuts takeover jitter 99.8%; strong result, but three long-horizon tasks is not general dexterous VLA yet.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

VER distills multiple vision foundation models into an expert library for robot learning, fine-tunes only a routing network with fewer than 0.4% of parameters for downstream tasks, and reports state-of-the-art results across 17 robotic tasks with multiple policy heads.

#Vision#Robotics#Fine-tuning#Research release

why featured

HKR-H/K/R all pass: the paper reports <0.4% tuning, 17-task SOTA, and dynamic routing. It stays below featured because it is a single arXiv research item without a named lab, artifact, or cross-source pickup.

editor take

VER tunes under 0.4% of parameters across 17 robot tasks; expert routing is practical, but SOTA needs real-robot replication.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

The paper proposes QAOD, a single-pass white-box hallucination detection framework that removes question-aligned directions from answer representations; on BioASQ out-of-distribution transfer, its orthogonal-only probe beats the best white-box baseline by up to 21% while using under 25% of generation cost.

#Safety#Interpretability#Benchmarking#QAOD

why featured

HKR-H/K/R all pass, but this is a single arXiv method paper limited to BioASQ OOD and white-box detection. Without a major lab release, tool artifact, or cross-source uptake, it stays in the 60–71 band.

editor take

QAOD beats white-box baselines by 21% on BioASQ OOD; hallucination probes are finally taking domain shift seriously.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Test-Time Learning with an Evolving Library

EvoLib lets large language models accumulate skills and reflective insights across test instances without parameter updates or external supervision, using a shared library plus weighting and consolidation to turn instance-specific abstractions into reusable knowledge over time.

#Reasoning#Code#Agent#EvoLib

why featured

HKR-H/K/R pass: the no-parameter test-time learning angle is clickable, and EvoLib adds a concrete shared-library mechanism. Score stays in 60–71 because the feed gives no benchmark results, code, or adoption signal.

editor take

EvoLib accumulates skills across tasks without parameter updates; no benchmark numbers disclosed, so I file it under memory engineering beating fine-tune iteration.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Collider-Bench evaluates LLM agents by asking them to reproduce LHC experimental analyses using only public papers and open scientific software, then scores predicted collision event yields with histogram metrics, per-task compute cost, and an LLM judge for qualitative failures; the paper reports that no agent reliably beats the physicist-in-the-loop solution on average.

#Agent#Code#Benchmarking#Collider-Bench

why featured

HKR-H/K/R all pass, but this is an arXiv domain benchmark with a high particle-physics barrier and weaker spread than general agent evals. No hard exclusion; it fits the 60-71 interesting band.

editor take

Collider-Bench makes agents reproduce LHC analyses and submit event yields; none reliably beats physicist-in-the-loop on average.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

The paper evaluates federated fine-tuning on the Sherpa.ai Federated Learning platform across four healthcare and finance datasets: MedQA, MedMCQA, FPB, and FiQA-SA. It compares LoRA, QLoRA, and IA3 under non-IID institutional settings, and reports performance close to centralized training, better results than isolated single-institution learning, and higher efficiency from QLoRA and IA3 with limited accuracy loss.

#Fine-tuning#Benchmarking#Sherpa.ai#Research release

why featured

HKR-K/R pass: the paper adds a healthcare/finance federated fine-tuning benchmark and method comparison, tied to private-data training pain. HKR-H is weak, and as a single arXiv item with no code or cross-source pickup, it stays below featured.

editor take

The paper tests federated tuning on 4 health/finance datasets; I don’t buy the “next frontier” label without node counts or privacy-attack evals.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Proxy Compression for Language Modeling

The paper introduces proxy compression, training one language model on raw byte sequences and externally compressed views while using only raw bytes at inference; code language modeling experiments show better fixed-compute efficiency than pure byte-level baselines, but the RSS snippet does not disclose exact improvement numbers.

#Inference-opt#Code#Research release#Open source

why featured

HKR-H and HKR-K pass: the train-time proxy versus raw-byte inference setup is concrete. HKR-R fails because no gain numbers, deployment target, or named lab push it beyond an interesting research item.

editor take

Proxy compression trains on bytes plus compressed views, then infers on bytes only; no gains disclosed, so don’t bury tokenizers yet.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Researchers introduce evolutionary multi-agent system for code solving

The paper introduces EvE, a decentralized co-evolving system for existing coding agents; it maintains two populations, code solvers and guidance states, and evaluates marginal gains through synchronous races with empirical Elo updates.

#Agent#Code#Reasoning#EvE

why featured

HKR-K and HKR-R pass: EvE has a concrete mechanism and targets coding-agent orchestration. The post lacks performance numbers, an open artifact, or production evidence, so it stays in the 60–71 band.

editor take

EvE scores agent marginal gains via synchronous races and Elo; ICON is neat, but benchmarks, code, and cost are undisclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Conformal Thinking: Risk Control for Reasoning on a Compute Budget

The paper frames reasoning token budget selection as a risk-control problem, using a target risk and validation set to set upper and parametric lower stopping thresholds, and reports compute-efficiency gains across multiple reasoning tasks while keeping error rates within the user-specified risk target.

#Reasoning#Inference-opt#Research release

why featured

HKR-K and HKR-R pass: it reframes reasoning-token budgets as risk control, relevant to cost-sensitive teams. No concrete savings rate, task list, or code is disclosed, so it stays in the 60–71 band.

editor take

Conformal Thinking sets stopping thresholds from target risk plus validation data; I like the framing, but the abstract omits token savings.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

CurveBench introduces 756 images of non-intersecting Jordan curves and asks models to recover the full rooted containment tree from visual input; Gemini 3.1 Pro reaches 71.1% tree-generation accuracy on CurveBench-Easy and 19.1% on CurveBench-Hard.

#Vision#Reasoning#Benchmarking#Gemini

why featured

HKR-H/K/R all pass, but this is a niche arXiv benchmark rather than a major model or product release. Concrete scores justify the upper 60–71 band, not featured.

editor take

Gemini 3.1 Pro scores 19.1% on Hard; CurveBench is another clean reminder that VLM vision still fails exact topology.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

MoMo uses a scalar user preference to modulate plan conservativeness at inference time without retraining; the paper reports results across six environments, where MoMo adjusts plan safety smoothly and improves temporal and preferential consistency over state-augmentation baselines.

#Reasoning#MoMo#Research release

why featured

HKR-H/K/R pass, but this is still an arXiv methods paper. The 6-environment result and no-retraining mechanism are useful; product impact or broad replication is not shown.

editor take

MoMo tunes plan conservativeness with one scalar across six environments. Nice no-retrain knob; failure rates stay undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

RxEval evaluates LLM medication recommendation with 1,547 multiple-choice questions covering 584 patients, 18 diagnostic categories, and 969 unique medications; across 16 LLMs, F1 ranges from 45.18 to 77.10, and the best Exact Match reaches only 46.10%.

#Reasoning#Benchmarking#RxEval#Research release

why featured

HKR-K and HKR-R pass: the benchmark gives concrete numbers and targets high-risk medical use. HKR-H is weak, and a single arXiv benchmark without product impact stays in 60-71.

editor take

RxEval tests 16 models; best Exact Match is 46.10%. Medication copilots still fail on stated patient facts.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Selective Safety Steering via Value-Filtered Decoding

The paper proposes value-filtered decoding, a test-time steering method that filters tokens with a value-based safety criterion and uses one threshold hyperparameter to control an explicit bound on false-intervention probability.

#Safety#Alignment#Inference-opt#Research release

why featured

HKR-K/R pass: it offers a concrete inference-time safety decoding mechanism and speaks to over-refusal cost. HKR-H is weak, and the feed gives no experiment scale or model results, so this stays in the interesting research band.

editor take

Value-filtered decoding bounds false interventions with one threshold; I buy the target, since safety steering often mangles safe answers.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym provides a framework for Claw-style personal agent development, with 13.5K synthesized tasks, 200 benchmark instances, and ClawGym-Agents trained via supervised fine-tuning plus a lightweight reinforcement-learning rollout pipeline.

#Agent#Tools#Benchmarking#ClawGym

why featured

HKR-K passes with task counts, benchmark size, and training recipe; HKR-R passes because agent evaluation tooling is a live pain point. HKR-H is weak, and this is a single arXiv paper, so it stays in all.

editor take

ClawGym ships 13.5K tasks and a 200-case bench; I buy the data loop, not the “soon released” IOU.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→AIS: Adaptive Importance Sampling for Quantized RL

AIS adds three real-time diagnostics to GRPO to tune importance sampling per batch, and on LLaDA-8B-Instruct, Qwen3-8B, and Qwen3.5-9B it matches the BF16 baseline on most mathematical reasoning and planning tasks while retaining FP8 rollout speedups from 1.5x to 2.76x.

#Reasoning#Fine-tuning#Inference-opt#LLaDA

why featured

HKR-K and HKR-R pass: the paper gives 3 GRPO diagnostics and 1.5–2.76x FP8 rollout speedups, tied to post-training cost. HKR-H fails, and the method is too technical for featured.

editor take

AIS uses 3 diagnostics to tune GRPO weights; keeping 1.5-2.76x FP8 rollout speed makes this a stability patch, not mere quantization thrift.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

The paper introduces conformal aggregation for Chain-of-Thought reasoning, replacing majority voting with weighted score aggregation and a conformal abstention rule, and reports finite-sample guarantees on confident-error rate across four benchmarks, four open-source models, and three score classes; on GSM8K, it reaches 90.1% selective accuracy while abstaining on under 5% of problems, versus 82% accuracy for majority voting.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-H and HKR-K pass: the abstention mechanism and GSM8K figure are concrete. Impact remains an arXiv methods paper without major-model, cost, or deployment evidence, so it stays in the 60–71 band.

editor take

Conformal CoT hits 90.1% selective accuracy on GSM8K; abstaining under 5% for +8.1 points is an engineering trade I buy.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Boosting LLM Reasoning via Human-Inspired Reward Shaping

The paper introduces T2T, a reward-shaping framework that encourages broader search on incorrect attempts and applies length penalties after correctness; experiments across 5 mainstream LLMs on MATH-500, AIME, and AMC report better performance than standard GRPO and recent baselines.

#Reasoning#Alignment#Benchmarking#Research release

why featured

HKR-H/K pass: the mechanism is concrete and tested on 5 models across 3 math benchmarks. HKR-R is weak because gains, code, and training cost are not disclosed, keeping it in the normal research band.

editor take

T2T expands search on failures and penalizes length after correctness; 5 LLMs beat GRPO on 3 math sets, but gains aren't disclosed.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

V2M-ZERO trains a text-to-music model on intra-modal music event curves, swaps in video event curves at inference, and reports state-of-the-art results on OES-Pub, MovieGenBench-Music, and AIST++ without paired video-music data, including 21-52% better temporal synchronization and 28% higher beat alignment on dance videos.

#Multimodal#Audio#Fine-tuning#V2M-ZERO

why featured

HKR-H and HKR-K pass: zero-pair training plus 21-52% sync gains give a concrete mechanism and number. HKR-R is narrow, limited to music-generation research, so it stays below featured.

editor take

V2M-ZERO claims 21–52% better sync with zero paired training; clever shortcut, but benchmark bias can flatter event-curve methods.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems

The paper proposes population-aware coordination interfaces that condition learned primal and dual maps on compact population summaries, reducing forecast error by 16–19% and capacity violations by 20–51% versus population-unaware baselines in a supply-chain capacity-control case study, while 20K-agent cohorts coordinate 500K-agent populations and simulator-trained primal maps reach 11.1% MAPE on real observations.

#Agent#Robotics#Benchmarking#arXiv

why featured

HKR-K is strong: the paper gives a concrete coordination mechanism and supply-chain deltas. HKR-R passes for constraint failures in multi-agent deployment, but HKR-H is weak and single-source arXiv limits the score.

editor take

Population summaries let 20K agents coordinate 500K; I buy the direction—constrained MAS needs less policy flexing, more planner interfaces.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

InfoSFT changes the SFT objective with medium-confidence token weighting and reports better generalization than vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks, while preserving prior capabilities; the abstract describes a one-line token-wise loss modification but does not disclose exact scores in the RSS snippet.

#Fine-tuning#Reasoning#Code#InfoSFT

why featured

HKR-K/R pass: InfoSFT offers a concrete SFT loss mechanism and claims gains on math, code, and CoT. No effect sizes, author authority, or reproducibility details are disclosed, so it stays in the interesting-research band.

editor take

InfoSFT changes one token-loss line; RSS gives no scores. I buy the direction, not the free-lunch framing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Diagnosing Training-Inference Mismatch in LLM Reinforcement Learning

The paper introduces VeXact to isolate training-inference mismatch in LLM reinforcement learning, where rollout generation and policy optimization assign different token probabilities under identical weights. The authors report that small token-level numerical disagreements can independently cause training collapse, alter the effective optimization problem, and require systems-level remedies rather than being treated as benign numerical noise.

#Alignment#Inference-opt#Research release

why featured

HKR-H/K/R all pass, but the item is an arXiv paper with abstract-level facts only; no code, scale, or external replication is disclosed. It stays in the upper all band, below featured.

editor take

VeXact reproduces token-probability drift under identical weights; stop blaming reward first, inference-stack numerics need acceptance tests.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs

UQ4CT calibrates confidence in the functional space induced by prompt-dependent mixtures of LoRA experts, and the paper reports over 25% lower Expected Calibration Error across four multiple-choice benchmarks and two open-ended generative QA tasks while preserving high accuracy under distribution shift.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-K/R pass: the paper gives a prompt-dependent LoRA expert-mixture mechanism and a >25% ECE drop, touching fine-tuned LLM deployment risk. HKR-H is weak because the title is technical and lacks a product hook.

editor take

UQ4CT cuts ECE by over 25% on 6 tasks; useful for LoRA calibration, but the generalization bill stays unpaid.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

GIFT combines GRPO-style group sampling, DPO-style implicit rewards, and UNA-style MSE to replace GRPO’s externally tuned beta with prompt-adaptive beta(x), and reports faster convergence than GRPO, DAPO, and GSPO on 7B-32B backbones.

#Fine-tuning#Reasoning#Alignment#GIFT

why featured

HKR-K and HKR-R pass: the mechanism and 7B-32B comparisons add signal, and GRPO alternatives matter to post-training teams. HKR-H fails because the title is jargon-heavy, so this stays below featured.

editor take

GIFT reports faster convergence on 7B-32B than GRPO, DAPO, and GSPO; endogenous beta(x) attacks a real RLVR tuning tax.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

DASE uses adaptive stopping for iterative LLM ensembles, committing on consensus and falling back to global frequency under fragmented evidence; on GPQA-Extended with N=546 and a 70B ensemble, its commit-type partition produced an 81.1% right-wall accuracy versus 41.5% left-wall accuracy, a 39.5 percentage-point routing gap.

#Reasoning#Inference-opt#Benchmarking#Research release

why featured

HKR-K/R pass: the paper has a concrete DASE mechanism and GPQA-Extended numbers, plus relevance to ensemble inference budgets. HKR-H is weak, and the feed lacks code, cost savings, or reproduction details, so it stays in 60–71.

editor take

DASE shows a 39.5pp routing gap on GPQA-Extended; I buy adaptive stopping, not more deliberation by default.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→M²RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

The paper introduces M²RNN, a nonlinear recurrent architecture with matrix-valued hidden states for language modeling; in a 7B MoE hybrid model, Hybrid M²RNN beats equivalent Gated DeltaNet hybrids by 0.4–0.5 perplexity points while using 3× smaller recurrent-layer states.

#Reasoning#Memory#Benchmarking#M²RNN

why featured

HKR-K is solid: the paper gives comparable perplexity and state-size claims. HKR-R is moderate for model-cost debates, but HKR-H is weak and this is a single arXiv architecture paper, so it stays below featured.

editor take

M²RNN cuts 0.4–0.5 PPL at 7B MoE with 3× smaller state; nonlinear RNNs just bit linear-attention hybrids.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Finding Interpretable Prompt-Specific Circuits in Language Models

The paper introduces ACC++, a circuit-tracing method that extracts attention-causal communication circuits from a single forward pass, without replacement models or patching. Across multiple models and a four-language IOI case study, ACC++ finds many low-dimensional signals with short natural-language descriptions, prompt-specific IOI circuit clusters, reused components across languages, and often language-specific signals.

#Interpretability#Reasoning#arXiv#Research release

why featured

HKR-H/K pass: ACC++ offers a concrete one-forward-pass circuit method and four-language IOI tests. HKR-R is weak, and this is a specialist research release rather than a product or lab-scale milestone.

editor take

ACC++ traces attention-causal circuits in one forward pass; the four-language IOI split between reused heads and language-specific signals is the hook.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→An Interpretable Latency Model for Speculative Decoding in LLM Serving

The paper proposes an interpretable latency model for speculative decoding in LLM serving. It infers effective batch size from request rate via Little’s Law, decomposes prefill, drafting, and verification demand, and validates the model with vLLM measurements across verifier and drafter sizes, sequence lengths, request rates, draft lengths, and acceptance probabilities.

#Inference-opt#Benchmarking#vLLM#Research release

why featured

HKR-K/R pass: the paper gives a testable latency mechanism for speculative decoding and flags degraded gains under load. HKR-H is weak, and the LLM-serving focus keeps it in the 60–71 band.

editor take

The paper uses Little’s Law to estimate batch size and shows vLLM load erodes speculative-decoding speedups; cleaner than offline speedup charts.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Towards Resource-Efficient LLMs: End-to-End Energy Accounting of Distillation Pipelines

arXiv:2605.13981 presents an end-to-end energy accounting framework for LLM distillation pipelines, logging GPU power by stage and measuring two methods: classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, with energy-quality Pareto frontiers and an open-source measurement harness.

#Fine-tuning#Benchmarking#Inference-opt#Research release

why featured

HKR-K and HKR-R pass: the accounting method and tool are useful for distillation work and touch GPU-cost pain. No quantified savings or broad deployment claim keeps it in the 60–71 research-signal band.

editor take

This paper accounts for full-pipeline GPU energy in logit distillation and synthetic-data SFT; good, teacher-side cost belongs in the bill.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

The paper tests seven KV-cache compression mechanisms on MATH-500 with Qwen-7B and Llama-8B DeepSeek-R1-Distill variants at budgets 64 and 128, rejects all seven, then reports that α with λ=0.5 passes Bonferroni in two of four model-budget cells without significant negative cells.

#Reasoning#Inference-opt#Benchmarking#Qwen

why featured

HKR-K is strong: models, budgets, MATH-500, and Bonferroni-tested negative results are concrete. HKR-R is moderate on inference cost, but HKR-H is weak and the arXiv paper is narrow, so it stays in 60-71.

editor take

Seven KV compressors failed on MATH-500 small budgets; α wins 2/4 cells, so trust the protocol before the method.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

OPT-ENGINE introduces a controllable-complexity benchmark covering 10 canonical operations research problems; its experiments show pure-text reasoning loses robustness as complexity increases, external tools fix local arithmetic only, and solver-integrated reasoning is mainly bottlenecked by automated constraint formulation.

#Reasoning#Tools#Benchmarking#OPT-ENGINE

why featured

HKR-H/K pass: the paper brings a new benchmark, 10 problem classes, and robustness findings under complexity scaling. The OR-modeling focus is niche and PTR/SIR are not unpacked, so it stays in 60–71.

editor take

OPT-ENGINE spans 10 OR tasks; I don’t buy pure CoT for optimization, constraint formulation is SIR’s wall.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

MPU addresses dual non-disclosure constraints in LLM machine unlearning with perturbed model copies and update aggregation; experiments on seven unlearning algorithms show most algorithms keep average degradation below 1% under noise up to 10%.

#Fine-tuning#Safety#MPU#Research release

why featured

HKR-K and HKR-R pass: the paper gives a concrete mechanism and experiment numbers, tied to privacy deletion and model safety. HKR-H is weak, and this is still a single arXiv paper with no disclosed artifact or adoption.

editor take

MPU holds under 10% noise across seven unlearning algorithms; dual non-disclosure feels closer to deployment than another forgetting metric.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI uses a multi-agent LLM/VLM closed-loop framework for robotic manipulation, taking a natural-language instruction and an environment image to generate atomic robot actions, while a Reflector agent performs targeted error recovery by reactivating only relevant agents instead of triggering full replanning.

#Agent#Robotics#Vision#MALLVI

why featured

Single arXiv robotics-agent framework with a concrete mechanism, but no disclosed metrics, task suite, or reproducibility details. HKR-K/R pass, HKR-H is weak, so it stays in all.

editor take

MALLVI discloses the loop, not success-rate numbers; targeted agent restarts smell like a practical robotics patch.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

The paper presents a dynamic abstention framework for LLM reasoning, terminating low-value chain-of-thought traces at each token position and using an abstention reward parameter to trade off compute against information.

#Reasoning#Inference-opt#Safety#Research release

why featured

HKR-H/K/R pass, but the body gives only the framework mechanism, with no experiment numbers, model scope, or artifact. As a single arXiv research item, it stays in the 60–71 band.

editor take

The paper gives token-level abstention, but no metrics in the snippet; CoT compute control needs value functions, not post-hoc confidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Researchers introduce higher-order linear attention mechanism reducing computational complexity

The paper introduces Higher-order Linear Attention, where the second-order case keeps a constant-size streaming state, computes each token in linear time, and avoids materializing any n×n attention matrix.

#Reasoning#Inference-opt#Research release

why featured

HKR-H and HKR-K pass: the mechanism is concrete and relevant to long-context inference efficiency. HKR-R is weak because no benchmark, code, or model-scale test is disclosed, so this stays in the 60–71 research band.

editor take

HLA claims constant-state second-order streaming per token; no benchmarks disclosed, so don’t confuse algebraic elegance with long-context wins.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

The study mines 5,382,249 contiguous Gherkin slices from 339 repositories and 276 upstream owners, collapsing them into 692,020 recurring patterns; its XGBoost classifier reaches 0.891 out-of-fold F1 under 5-fold cross-validation, beating a tuned rule baseline at 0.836 and the better open-weight LLM judge at 0.728.

#Code#Benchmarking#Sentence-BERT#XGBoost

why featured

HKR-H/K/R pass via the classic-ML-beats-LLM angle and concrete F1 data. The BDD test-suite refactoring niche limits reach, so this stays in all rather than featured.

editor take

XGBoost hits 0.891 F1 on a 200-slice labeled pool; LLM Judge gets 0.728. Small labels wobble, but don't worship LLM judges for code hygiene.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→NeuroAtlas Benchmarks Foundation Models for Clinical EEG and Brain-Computer Interfaces

NeuroAtlas evaluates foundation models on 42 EEG datasets and 260k hours across epilepsy, sleep medicine, brain age estimation, and brain-computer interfaces; the paper reports that EEG-specific FMs do not consistently beat generic time-series FMs, standard ML metrics miss clinical utility, and current models still lack an out-of-the-box unified EEG capability.

#Benchmarking#NeuroAtlas#Research release#Benchmark

why featured

HKR-H and HKR-K pass: the large EEG benchmark gives a counterintuitive result with concrete scale and comparisons. The clinical EEG/BCI focus narrows audience fit, so it stays below featured.

editor take

NeuroAtlas tests 42 datasets and 260k EEG hours; EEG-specific FMs still fail to reliably beat generic time-series FMs.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

ScaLoRA accumulates high-rank updates from consecutive low-rank increments and analytically scales LoRA columns; tests on LLMs up to 12 billion parameters report consistent gains and faster convergence versus LoRA variants across NLU, commonsense reasoning, and math tasks.

#Fine-tuning#Inference-opt#Reasoning#ScaLoRA

why featured

HKR-K and HKR-R pass: ScaLoRA offers a testable fine-tuning mechanism and 12B-parameter evaluation context tied to LoRA efficiency. HKR-H is weak, and the summary lacks concrete benchmark numbers.

editor take

ScaLoRA tests up to 12B parameters; low-rank increments stack into high-rank updates, and LoRA tuning still has convergence debt.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

LQM-ContextRoute models same-function tool-provider routing as a contextual bandit and improves F1 by 2.18 percentage points over SW-UCB on the main web-search load benchmark.

#Agent#Tools#RAG#arXiv

why featured

HKR-K and HKR-R pass: the mechanism and +2.18 F1 result are concrete, and the problem maps to production agents. HKR-H is weak, and a single arXiv paper stays below featured.

editor take

LQM-ContextRoute gains +2.18 F1 pp on web-search; I buy the setup—tool routing should price quality per service cycle.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→DMAP: A Distribution Map for Text

The paper presents DMAP, a method that maps text through a language model into unit-interval samples, and evaluates it in 3 case studies covering generation-parameter validation, machine-generated text detection, and forensic analysis of statistical fingerprints from synthetic-data post-training.

#Benchmarking#DMAP#Research release

why featured

HKR-K and HKR-R pass: DMAP offers a testable text-distribution mapping mechanism for detection and synthetic-data fingerprints. No effect sizes or released artifacts are disclosed, so it stays in the mid research band.

editor take

DMAP maps text into unit-interval samples and tests 3 cases; I buy the direction—perplexity is too blunt for text forensics.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Paper introduces dynamic latent routing method for improved low-data fine-tuning

The paper introduces Dynamic Latent Routing, a post-training method that jointly learns discrete latent codes, routing policies, and model parameters; in low-data fine-tuning across four datasets and six models, DLR matches or beats supervised fine-tuning with a mean gain of 6.6 percentage points.

#Fine-tuning#Reasoning#Tools#Research release

why featured

HKR-K is solid with 4 datasets, 6 models, and a +6.6-point gain; HKR-R fits low-data fine-tuning cost concerns. HKR-H is weak, and the single arXiv paper lacks code or production evidence, so it stays in all.

editor take

DLR beats SFT by 6.6 points across 4 datasets and 6 models; I’d wait for ablation replications before adopting it.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Synthetic Sociality: How Generative Models Privatize the Social Fabric

arXiv:2605.14090 proposes a Synthetic Sociality framework for analyzing how generative models automate “social doing” and either substitute for or mediate social relations; the abstract cites existing empirical research but does not disclose sample sizes or evaluation conditions.

#Alignment#Safety#arXiv#Silicon Valley

why featured

HKR-H/K/R pass, but this is an arXiv conceptual frame with no disclosed sample size or reproducible experiment. It belongs in the feed, below the 72 featured threshold.

editor take

arXiv 2605.14090 offers a theory, with no sample size disclosed; I’d test it against Replika-style attachment first.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration

OMAC defines five optimization dimensions for LLM-based multi-agent systems and uses two actors, the Semantic Initializer and the Contrastive Comparator, to optimize single dimensions and joint multi-dimension settings.

#Agent#Reasoning#Code#OMAC

why featured

HKR-K/R pass: the paper names concrete mechanisms and targets multi-agent collaboration reliability. HKR-H is weak, and the post gives no benchmark numbers, code release, or production impact, so it stays in the 60–71 research band.

editor take

OMAC names 5 MAS optimization dimensions, but the snippet gives no benchmark numbers; treat it as framework paper, not a new agent baseline.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Multi-Scale Dequant decomposes BF16 activations into low-precision components and removes INT8-to-BF16 weight dequantization from the GEMM path; its two-pass MXFP4 decomposition reaches 6.6 effective bits, and the paper’s latency and HBM models show up to 2.5x lower KV cache HBM traffic in attention.

#Inference-opt#arXiv#Ascend#Research release

why featured

HKR-K/R pass: the paper gives a testable mechanism and up to 2.5x lower HBM traffic tied to inference cost. The low-level quantization angle keeps it below featured.

editor take

MSD splits BF16 activations into low-precision parts, hitting 6.6 effective bits with two-pass MXFP4; Ascend-style dequant stalls get a serious attack.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

The paper presents DUET, a global-to-local method that optimizes LLM fine-tuning data mixtures from multiple feedback rounds on an unseen evaluation task, combining influence functions for data selection with Bayesian optimization; the abstract reports regret analysis and experiments across language tasks, but does not disclose exact benchmark scores in the snippet.

#Fine-tuning#Benchmarking#Research release

why featured

HKR-K/R pass: DUET offers a concrete mechanism for fine-tuning data mixtures using unseen-task feedback, tied to cost and generalization. HKR-H is weak, and no experimental numbers are disclosed, so this stays in 60–71.

editor take

DUET tunes fine-tuning mixtures from feedback rounds; scores are undisclosed. I buy the setup: encrypted user tasks break offline data recipes.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Exemplar Partitioning for Mechanistic Interpretability

The paper introduces Exemplar Partitioning, an unsupervised activation-dictionary method using about 10^3 fewer tokens than comparable SAEs; on AxBench latent concept detection at Gemma-2-2B-it L20, EP reaches mean AUROC 0.881, 0.126 above the canonical GemmaScope SAE entry and 0.030 below SAE-A at about 10^3 less build compute.

#Interpretability#Benchmarking#Alignment#Gemma

why featured

HKR-H and HKR-K pass: 10^3 fewer tokens and 0.881 AUROC provide a concrete mechanism and result. HKR-R is weak, and the mechanistic-interpretability niche keeps it in the interesting band.

editor take

EP hits 0.881 AUROC with ~10^3 fewer tokens; if SAE-A only wins by 0.030, the compute bill looks ugly.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→A Hardware-Aware, Per-Layer Methodology for Post-Training Quantization of Large Language Models

SOP reports lower weight reconstruction error than an E4M3 FP8 8.0 bpw per-layer-POT baseline across six open model families, using an FP6 E2M3sUE4M4 6.5 bpw operating point with 1.5 bpw less storage.

#Inference-opt#Research release

why featured

HKR-K/R pass: the paper gives model coverage and bpw comparisons tied to inference cost. HKR-H fails because the angle is a specialist PTQ method with no product release or artifact, so it stays in the 60-71 band.

editor take

SOP beats FP8 reconstruction at 6.5 bpw across six model families; I want task scores before calling this deployable.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→TopoPrimer: The Missing Topological Context in Forecasting Models

TopoPrimer feeds the global topological structure of a series population into Chronos and TimesFM, improves forecasting accuracy across four public benchmarks, cuts ECL MSE by up to 7.3%, keeps peak seasonal degradation within 10%, and reduces cold-start MAE by 27% versus a topology-free baseline.

#Benchmarking#Fine-tuning#TopoPrimer#Chronos

why featured

HKR-H and HKR-K pass: the mechanism and metrics are concrete. It remains a single forecasting-model paper with weak broader resonance, so it stays in all rather than featured.

editor take

TopoPrimer adds topology priors to Chronos and TimesFM on 4 benchmarks; 7.3% MSE is modest, 27% cold-start MAE is the signal.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→On the Unreasonable Effectiveness of Last-layer Retraining

The paper tests why last-layer retraining improves worst-group accuracy, rejects the neural-collapse mitigation hypothesis, and attributes the gain to better group balance in the held-out set under LLR, CB-LLR, and AFR.

#Fine-tuning#Alignment#Benchmarking#Research release

why featured

HKR-H/K/R pass, but this is a single arXiv mechanism paper. The feed summary gives the LLR explanation, not model scale, datasets, or reproduction details, so it stays in all rather than featured.

editor take

LLR boosts worst-group accuracy via held-out group balance; stop using neural collapse as the catch-all robustness story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Predict-then-Diffuse uses AdaRLP to estimate response length before D-LLM inference, then applies a small data-driven length increase to reduce truncation reruns; experiments on multiple datasets show lower FLOP than default D-LLM inference while preserving output quality.

#Inference-opt#Research release

why featured

HKR-K and HKR-R pass via a concrete inference mechanism and cost angle. HKR-H is weak, and the post lacks FLOP deltas, model sizes, and reproducible settings, so it stays in the 60–71 all band.

editor take

Predict-then-Diffuse predicts length then pads slightly; FLOP numbers are undisclosed, but D-LLM fixed-length tax deserves its own optimizer.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→EMA: Efficient Model Adaptation for Learning-based Systems

EMA reduces adaptation costs by 14.9-42.4% across eight learning-based systems and improves system performance, including network throughput, by 6.9-31.3%, using state transformers for warm-start adaptation and utility-prioritized labeling to balance training and labeling costs.

#Fine-tuning#Inference-opt#Research release

why featured

HKR-K/R pass: the paper gives concrete cost and throughput numbers. HKR-H fails, and as a single arXiv systems paper without release, major-lab backing, or adoption signal, it stays in the 60-71 band.

editor take

EMA cuts adaptation cost 14.9-42.4% across 8 systems; for systems ML, this beats another generic fine-tuning trick.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning

GPart maps a d-dimensional trainable vector directly into the full model weight space with one isometric partition matrix, stores only d+1 values including the vector and a random seed, and reports superior or comparable results against existing PEFT methods on natural language understanding, computer vision, and mathematical reasoning tasks.

#Fine-tuning#Vision#Reasoning#Research release

why featured

HKR-K and HKR-R pass: the paper gives a d+1 storage mechanism and tests across NLU, vision, and math reasoning. HKR-H is weak, and the technical PEFT framing keeps it in the 60–71 band.

editor take

GPart stores only d+1 values for PEFT; I don’t buy “removing the low-rank bottleneck” without disclosed baselines and model scale.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→SpeakerLLM Audio Language Model for Speaker Understanding and Verification Reasoning

SpeakerLLM uses a hierarchical speaker tokenizer to handle four tasks: single-utterance profiling, recording-condition understanding, utterance-pair comparison, and verification reasoning, while the authors state that SpeakerLLM-Base improves profile and condition understanding over general audio-LLMs and plan to release the metadata-enriched supervision dataset plus target-construction code.

#Audio#Reasoning#SpeakerLLM#Research release

why featured

HKR-H/K/R pass, but this is a vertical arXiv audio paper. The post gives a mechanism and planned release, not benchmark numbers or production adoption, so it stays in the 60–71 band.

editor take

SpeakerLLM unifies 4 speaker tasks. The sharp bit is forcing verification evidence, not another opaque similarity score.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Silent Neuron Theory and Plasticity Preservation for Deep Reinforcement Learning in Adaptive Video Streaming

The paper proposes ReSiN, which resets silent neurons using forward and backward propagation states, and reports up to 168% higher bitrate and 108% better QoE in an adaptive video streaming system while maintaining comparable smoothness.

#Reasoning#Alignment#arXiv#ReSiN

why featured

HKR-H/K pass: ReSiN links silent-neuron plasticity to streaming QoE, with +168% bitrate and +108% QoE. HKR-R is weak because the DRL streaming setting sits far from mainstream AI tooling, so it stays in 60–71.

editor take

ReSiN claims 168% higher bitrate; I don't buy the generalization story without disclosed baselines or network traces.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→ReMIA: A Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators

ReMIA evaluates privacy risk for tabular synthetic data generators with 2 SDG training runs and auxiliary data no larger than the original training set, while experiments across multiple datasets and SDGs report sensitivity comparable to state-of-the-art membership inference attacks.

#Safety#Benchmarking#Aindo#Research release

why featured

HKR-K/R pass: the 2-run privacy test gives a concrete, testable mechanism and touches synthetic-data compliance. HKR-H is weak, and a single arXiv paper with a technical privacy angle stays in the 60–71 band.

editor take

ReMIA needs 2 SDG training runs and nears shadow-MIA sensitivity; tabular synthetic-data privacy testing gets less ceremonial.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

The paper proposes TraFL, a trajectory-balance objective for diffusion language models that anchors a reward-tilted target distribution to a frozen reference model; across math reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting.

#Reasoning#Code#Fine-tuning#TraFL

why featured

HKR-K passes: TraFL offers a new post-training objective, constraint mechanism, and math/code benchmark claim. HKR-H and HKR-R are weak, so this fits the all tier rather than featured.

editor take

TraFL beats the base model across all math/code length settings; I care more whether trajectory locking reproduces independently.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

The study groups more than 20 time-series anomaly detection metrics into six problem-oriented dimensions and compares score distributions under genuine, random, and oracle detection scenarios; NAB and Point-Adjust show limited resistance to random-score inflation, while most event-level metrics retain stronger separability.

#Benchmarking#Research release#Benchmark

why featured

HKR-K is strong, and HKR-H comes from the random-detector score inflation finding. The scope is methodology-heavy and limited to time-series anomaly detection, so it stays in the 60–71 band.

editor take

This taxonomy sorts 20+ TSAD metrics into six dimensions; NAB and Point-Adjust inflating random detectors should embarrass old leaderboards.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Embedding Perturbation May Better Reflect Intermediate-Step Uncertainty in LLM Reasoning

The paper proposes measuring LLM intermediate-step uncertainty through sensitivity to perturbations on preceding token embeddings, and reports stronger uncertainty quantification performance than probability-based, sampling-based, and Bayesian baselines; the RSS abstract does not disclose datasets, model names, or numeric scores.

#Reasoning#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: the paper offers a concrete uncertainty metric and speaks to reasoning reliability. It remains a single arXiv methods paper without disclosed adoption or strong practical result, so it stays in 60–71.

editor take

Embedding perturbation flags shaky reasoning steps; scores are undisclosed, but this smells better than token-prob confidence.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

MUON+ inserts one normalization step after polar orthogonalization without adding optimizer state; the paper reports lower training and validation perplexity than Muon across GPT and LLaMA pre-training runs from 60M to 7B parameters and token-to-parameter ratios up to about 200.

#Fine-tuning#Inference-opt#Benchmarking#Muon

why featured

HKR-K is clear: mechanism, scale, and perplexity comparison are disclosed; HKR-R is limited to pretraining teams. This is optimizer research, not a model or product launch, so it fits the 60-71 signal band.

editor take

MUON+ adds one post-polar normalization; 60M–7B pretraining beats Muon, so I’d test it on our small stack first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

The paper introduces iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning. It uses structured prompts, semantic tokens, a Hierarchical Multi-Scope Transformer Encoder, and a Task-Conditioned Patch Decoder across six task types, including forecasting, imputation, classification, anomaly detection, and source de-mixing.

#Reasoning#Benchmarking#iAmTime#arXiv

why featured

HKR-K/R pass: the post gives a concrete mechanism and 6 task categories, with relevance to unified time-series modeling. It stays in 60–71 because no benchmark gains, code artifact, or major-lab signal are disclosed.

editor take

iAmTime spans six time-series tasks; RSS gives no benchmark numbers, so don’t crown instruction-conditioned ICL as time-series GPT yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Do-Undo Bench introduces an image-generation benchmark that requires models to simulate a real-world action and reverse it to the original state, using reversible actions from real scenarios; the arXiv snippet says current models struggle with reversibility but does not disclose benchmark size or scores.

#Multimodal#Vision#Reasoning#Research release

why featured

HKR-H/K pass: Do-Undo offers a fresh reversible-action test for causal understanding in image generation. HKR-R is weak, and sample size or major model results are not disclosed, so this stays in the normal research band.

editor take

Do-Undo Bench tests do-then-undo generation, but gives no size or scores; I buy the setup, not the causality claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

The IIQ paper proposes a 0-1000 index for measuring organizational AI integration, combining novelty-weighted time-decayed token stock, usage frequency, a recency gate, organizational leverage, task complexity, and autonomy; it frames IIQ as a deployment metric, not a direct model-capability score or causal productivity estimate.

#Benchmarking#Research release

why featured

HKR-K and HKR-R pass: IIQ proposes a 0-1000 organizational AI-impact index with five inputs. As a single arXiv framework, it lacks disclosed validation, enterprise samples, or adoption, so it stays in all.

editor take

IIQ compresses organizational AI adoption into 0–1000; I don’t buy it without disclosed token and autonomy weights.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Polaris applies experience-abstracted policy repair to a 7B model on MGSM, DROP, GPQA, and LitBench, using auditable policy patches rather than response-level correction or parameter tuning; the abstract reports consistent gains over the base policy and competitive baselines, but the post does not disclose the exact improvement numbers.

#Agent#Reasoning#Code#Polaris

why featured

HKR-K/R pass: 7B agents, experience-abstracted policy repair, and four benchmarks add signal, and small-model agents hit cost concerns. No concrete gains are disclosed, so this stays in the 60–71 research-release band.

editor take

Polaris only discloses a 7B run on four benchmarks; no gain numbers, but auditable policy patches sound less hand-wavy than self-correction.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kairos addresses temporal heterogeneity in time-series forecasting with dynamic patching, mixture-of-size encoding, and dynamic RoPE, and reports stronger zero-shot results with fewer parameters on two benchmarks, GIFT-Eval and Time-Series-Library.

#Reasoning#Benchmarking#Kairos#GIFT-Eval

why featured

HKR-K is clear and HKR-R is modest: the paper offers mechanisms and benchmarks, but it is a single arXiv time-series model item with no production replacement claim, so it stays in the 60–71 band.

editor take

Kairos uses dynamic patching on GIFT-Eval and TSL; parameter counts are undisclosed, so I buy the mechanism before the win.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

The paper evaluates fine-tuned RoBERTa, Binoculars, text feature analysis, and Random Forest ensembles under paraphrasing attacks, finding that Binoculars-inclusive ensembles achieve the strongest results but suffer the largest performance losses during attacks.

#Safety#Benchmarking#RoBERTa#Binoculars

why featured

HKR-K and HKR-R pass: the paper gives method-level comparisons under paraphrasing attacks and touches detector trust. It remains a routine arXiv benchmark, not a major model, product, or industry-moving release.

editor take

The paper tests RoBERTa, Binoculars, feature methods, and RF ensembles; Binoculars wins clean and bleeds hardest under paraphrasing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

The paper proposes neighbor distance minimization to learn non-basis-aligned subspaces without supervision, and tests the link between learned subspaces and circuit variables on known GPT-2 circuits and a 2B model.

#Interpretability#GPT-2#Research release

why featured

HKR-K passes: NDM and GPT-2/2B validation are concrete. HKR-H and HKR-R are weak, and the mechanistic-interpretability topic has a high specialty bar, so it fits the 60–71 interesting band.

editor take

NDM finds subspaces in GPT-2 circuits and a 2B model; I buy the direction, but the abstract gives no scores.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→SEDGE: Structural Extrapolated Data Generation

The paper proposes SEDGE for structural extrapolated data generation, gives reliability and approximate identifiability conditions under conservative assumptions, and tests two algorithmic routes—structure-informed optimization and diffusion posterior sampling—on synthetic data and extrapolated image generation.

#Multimodal#Inference-opt#arXiv#SEDGE

why featured

HKR-K/R pass: SEDGE states reliability conditions for new-spec data and tests two paths. HKR-H misses because the title is technical; no result numbers, code, benchmark, or lab signal keeps it in 60-71.

editor take

SEDGE formalizes extrapolated generation under conservative assumptions; don’t hype generalization without image scale or failure cases disclosed.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning

Realiz3D trains diffusion models with a domain covariate and small residual adapters to separate control signals from real or synthetic visual domains, targeting the domain gap created when image generators are fine-tuned on rendered 3D assets, and the paper evaluates it on text-to-multiview generation and texturing from 3D inputs.

#Vision#Multimodal#Realiz3D#Research release

why featured

HKR-K lands: the summary gives a concrete domain-covariate and residual-adapter mechanism. HKR-H/R miss because the post lacks metrics, datasets, open source status, or production-replacement proof.

editor take

Realiz3D adds a domain covariate and small residual adapters; I buy the target, since photoreal 3D has bled on render-domain bias for years.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Nexus: An Agentic Framework for Time Series Forecasting

Nexus decomposes time-series forecasting into multi-agent stages for macro fluctuations, micro fluctuations, and available contextual signals. The paper evaluates data after LLM knowledge cutoffs, spanning Zillow real estate metrics and volatile equities, and reports that Nexus matches or outperforms state-of-the-art TSFMs and strong LLM baselines.

#Agent#Reasoning#Tools#Nexus

why featured

HKR-K passes because the paper offers a testable mechanism and evaluation setup. HKR-H/R are weak: the title is academic, and the impact stays inside forecasting rather than the broader AI workflow.

editor take

Nexus splits forecasting into 3 agent roles; I don’t buy “beyond sequence modeling” until cutoff data and ablations hold up.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Croissant Baker generates validated Croissant metadata from local dataset directories through a modular handler registry, and the paper evaluates it on more than 140 datasets, scaling to MIMIC-IV with 886 million rows and 374 Parquet files while reporting 97–100% agreement against producer-authored or standards-derived ground truth.

#Tools#Croissant Baker#NeurIPS#MIMIC-IV

why featured

HKR-K is solid: 140+ datasets and MIMIC-IV at 886M rows across 374 Parquet files give scale. The topic is data-governance infrastructure, with weak HKR-H and a narrower audience, so it stays in the 60–71 all band.

editor take

Croissant Baker ran on 140+ datasets; local metadata beats upload-first workflows, but 97–100% agreement needs field-level error detail.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes

TILBench evaluates more than 40 imbalanced-learning algorithms across 57 tabular datasets and runs over 200,000 controlled experiments; the study finds that no single method consistently dominates, with performance depending on dataset characteristics and computational constraints.

#Benchmarking#TILBench#arXiv#Research release

why featured

HKR-K is solid: the paper adds scale and a testable “no single method wins” claim; HKR-R applies to tabular ML practitioners. The topic is a conventional ML benchmark, not a model/product industry event, so it stays in the 60–71 band.

editor take

TILBench runs 40+ algorithms on 57 tables with 200k experiments; stop defaulting to SMOTE and profile data plus compute first.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

The paper evaluates a two-tier edge-cloud retinal screening cascade on 733 APTOS 2019 test images, using MobileNetV3-small for local referable-DR triage and sending 49.52% of images to cloud-based RETFoundDINOv2, reducing cloud calls by 50.48% versus a cloud-only pipeline.

#Vision#Inference-opt#APTOS#MobileNetV3-small

why featured

HKR-K and HKR-R pass: MobileNetV3-small filters on edge, RETFoundDINOv2 verifies in cloud, with clear routing numbers. The medical-imaging scope is narrow, so it stays in the 60-71 band.

editor take

The cascade cuts cloud calls 50.48% on 733 APTOS images, losing 0.0017 Kappa; the rural-care pitch hides threshold risk.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Detecting Overfitting in Neural Networks During Long-Horizon Grokking Using Random Matrix Theory

The paper proposes an overfitting detector that needs no train or test data: it randomizes each layer’s weight matrix, fits the empirical spectrum with a Marchenko-Pastur distribution, and uses Correlation Traps to mark the anti-grokking phase where train accuracy stays high while test accuracy falls.

#Interpretability#Safety#Benchmarking#Research release

why featured

HKR-H/K pass: detecting overfitting without data is a real hook, and the RMT/MP/Correlation Traps mechanism is testable. HKR-R is weak; grokking plus random matrix theory keeps this narrow, so it stays in all.

editor take

The method flags overfitting without train/test data via spectral outliers; unnamed LLM evidence makes the broad claim weak.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→A Hormone-Inspired Emotion Layer for Transformer Language Models (HELT)

The paper introduces HormoneT5, which adds six continuous hormone-like values to a Transformer via specialized attention heads, and reports over 85% per-hormone accuracy within a 0.15 tolerance threshold on its curated emotion-labeled dataset.

#Alignment#Agent#HormoneT5#T5

why featured

HKR-H and HKR-K pass: the mechanism and metric are concrete, and the title has novelty. It remains a single arXiv paper with no disclosed open-source artifact, replication setup, or production-replacement claim, so HKR-R is weak and the item stays in 60–71.

editor take

HormoneT5 adds 6 continuous hormone values; 85% accuracy is on a curated emotion set, so the endocrine framing smells decorative.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

Gonzalez and five coauthors introduce tensor similarity, a weight-based metric for tensor models that is invariant to weight-space symmetries and computed with a recursive algorithm; the 22-page paper with 8 figures says it tracks functional training dynamics such as grokking and backdoor insertion better than existing metrics.

#Interpretability#Benchmarking#ML Nissen Gonzalez#Logan Riggs Smith

why featured

HKR-H and HKR-K pass: the title has a real hook, and the paper gives a new metric plus recursive mechanism. Its math-heavy interpretability focus keeps it in the 60–71 band, not featured.

editor take

Gonzalez et al. turn network similarity into recursive algebra; tensor-model scope keeps this far from real LLM verification.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Research proposes output alignment method for 1-bit post-training quantization of large language models

The paper proposes a PTQ method for 1-bit LLMs that targets two identified failure modes: error accumulation across layers and anisotropic distortion in representation space, and its experiments report consistent gains over existing 1-bit PTQ methods while keeping calibration-based post-training quantization computationally efficient.

#Inference-opt#Research release

why featured

LLM quantization matters for inference cost, so HKR-K/R pass via the stated PTQ mechanisms and cost nerve. HKR-H is weak, and the post gives no speed, memory, model-scale, or open-source details.

editor take

This 1-bit PTQ paper targets layer error and anisotropic distortion; no model sizes or scores in the snippet, so don’t buy “consistent gains” yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

R-DMesh uses a VAE to separate a conditional base mesh, relative motion trajectories, and a rectification jump offset, then trains on Video-RDMesh with over 500k dynamic mesh sequences to address pose mismatch between an input mesh and the first frame of a reference video.

#Multimodal#Vision#R-DMesh#Video-RDMesh

why featured

HKR-H/K pass: video-guided 3D mesh animation is a clear hook, and K comes from 500k+ sequences plus the three-part VAE decomposition. HKR-R is weak; no code, product path, or production metric is disclosed, so it stays in the 60–71 band.

editor take

R-DMesh trains on 500k dynamic meshes for pose mismatch; I buy the problem, not the abstract’s “solves” claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Causal Foundation Models with Continuous Treatments

The paper introduces a causal foundation model for continuous treatments. It trains a transformer on a synthetic causal corpus to reconstruct individual treatment-response curves from observational data, without extra training or fine-tuning on unseen tasks.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes via the continuous-treatment, observational-data, no-finetuning mechanism. HKR-H/R are weak, and the causal-inference arXiv framing is specialized, so this stays in 60–71.

editor take

The paper trains a transformer on synthetic causal data for zero-finetune dose-response curves; benchmarks are undisclosed, so “first” needs receipts.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

arXiv:2605.14075 proposes measuring LLM layer relevance by the accuracy drop after removing a layer, and reports that cosine similarity often has weak or moderate correlation with actual performance degradation across tested LLMs.

#Interpretability#Benchmarking#Inference-opt#Research release

why featured

HKR-K passes: the paper offers a testable layer-removal accuracy-drop metric and challenges cosine similarity as a proxy. HKR-H/R are weak, and the arXiv summary alone keeps it in all, below featured.

editor take

arXiv 2605.14075 ranks layers by accuracy drop after deletion; I buy the direction, but models and tasks aren't disclosed here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→PRAETORIAN: GNN Backdoor Defense Using Trigger Internal and External Characteristics

PRAETORIAN reduces average GNN backdoor attack success rate to 0.55% with a 0.62% clean-accuracy drop; under the same conditions, state-of-the-art defenses still leave average ASR above 20% and clean-accuracy loss above 3%.

#Safety#Benchmarking#PRAETORIAN#arXiv

why featured

HKR-K is strong: 0.55% ASR, 0.62% clean-accuracy loss, and SOTA >20% give a testable comparison. HKR-H is narrow, and HKR-R is weak because GNN backdoor defense lacks product or frontier-model impact.

editor take

PRAETORIAN cuts GNN backdoor ASR to 0.55%; I buy the mechanism forcing attackers into >80% ASR with >10% CA loss.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Merging Methods for Multilingual Knowledge Editing for Large Language Models: An Empirical Odyssey

Kunil Lee and coauthors evaluate six vector-merging variants for multilingual knowledge editing across two backbone LLMs, two editing methods, and 12 languages on MzsRE. Vector summation with shared covariance is the most reliable overall strategy, simple summation performs poorly, and TSVM improves some settings but shows limited mitigation of multilingual interference.

#Fine-tuning#Benchmarking#Kunil Lee#Ki-Young Shin

why featured

HKR-K passes: the paper gives a concrete multilingual knowledge-editing test matrix and result. HKR-H and HKR-R are weak, so this is useful niche research, not a featured item.

editor take

Lee et al. test 6 merging methods across 2 LLMs and 12 languages; shared-covariance summation wins, TSVM barely tames interference.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering

TabClustPFN clusters unseen tabular datasets in one forward pass while inferring both cluster assignments and cluster cardinality, and the paper says its code is available on GitHub.

#Reasoning#TabClustPFN#GitHub#Research release

why featured

HKR-H and HKR-K pass: the paper offers a concrete one-forward-pass clustering mechanism and open code. HKR-R fails because niche tabular clustering lacks a strong LLM/agent practitioner nerve, so it stays in the 60-71 all band.

editor take

TabClustPFN infers cluster count and assignments in one pass; scale is undisclosed, so the real test is messy tabular benchmarks.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Vendor-Conditioned Contrastive Learning for Predicting Organizational Cyber Threat Targets

The paper proposes TRACE, a CySecBERT-based vendor-conditioned contrastive learning framework, to predict seven organizational cyber-threat target categories using 129,126 samples from 352,866 posts across nine exploit databases and hacker forums, and reports 97.00% macro F1 under temporal out-of-distribution evaluation.

#Embedding#Fine-tuning#Benchmarking#CySecBERT

why featured

HKR-K passes with a named method, sample count, and temporal OOD F1; HKR-H and HKR-R are weak. The cybersecurity-targeting niche keeps it in the lower interesting band, so tier is all.

editor take

TRACE reports 97.00% macro F1 under temporal OOD; I’d audit label leakage before celebrating vendor-conditioned contrastive learning.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

The paper proposes L2R for MoE routing, assigning experts in a shared low-rank latent space and using Saturated Inner-Product Scoring to control Lipschitz behavior; experiments on an OLMoE-based language MoE model and an ImageNet vision MoE setting report improved routing geometry, expert discrimination, and overall performance, while the code is not yet released.

#Inference-opt#Benchmarking#OLMoE#ImageNet

why featured

HKR-K is present via a concrete routing mechanism, and HKR-R ties to MoE cost and stability. HKR-H is weak, and the post gives no result numbers, so this stays in all below featured.

editor take

L2R tests low-rank routing on OLMoE and ImageNet; code is unreleased, so the SIPS stability claim stays provisional.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→AMiD: Knowledge Distillation for LLMs with α-mixture Assistant Distribution

AMiD proposes an α-mixture assistant distribution for LLM knowledge distillation, makes α a tunable distribution design variable, generalizes the related divergence family, and releases code for arXiv:2510.15982v3 at the project repository.

#Fine-tuning#Inference-opt#KAIST#Research release

why featured

HKR-K passes with a concrete distillation mechanism and code. HKR-H/R are weak because benchmarks, model scale, and inference gains are not disclosed, so this stays in all.

editor take

AMiD makes KD’s α tunable and ships code; the snippet gives no benchmark numbers, so I don’t buy “superior” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Language-Induced Priors for Domain Adaptation

The paper proposes Language-Induced Prior, which turns textual target-domain descriptions into a choice model and integrates it with EM, validating the framework on three tasks: Gaussian estimation, C-MAPSS, and MuJoCo hopper.

#Reasoning#arXiv#Research release

why featured

HKR-K passes: the method has a concrete mechanism and tests on Gaussian, C-MAPSS, and MuJoCo hopper. HKR-H/R are weak, so this stays in the 60–71 academic-research band.

editor take

LIP plugs target-domain text into EM, tested on 3 tasks; I buy the cold-start need, not the “correct LLM prior” premise.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation

The study builds a six-state U.S. dataset with 9 million accident records and 1 million high-resolution satellite images, then shows multimodal embeddings reach 90.1% average AUROC, a 3.7% gain over graph-only GNN models.

#Multimodal#Vision#Embedding#arXiv

why featured

HKR-K passes with concrete dataset scale and AUROC gains. HKR-H/R are weak because the paper is a niche traffic-prediction application with no model, product, or tooling impact for AI practitioners.

editor take

Six-state data hits 90.1% AUROC; I trust the prediction lift more than the matched 24% precipitation effect.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Towards Fine-Grained and Verifiable Concept Bottleneck Models

The paper proposes a fine-grained CBM framework that grounds each concept in localized visual evidence; experiments use medical imaging benchmarks, but the RSS snippet does not disclose the number of datasets or specific performance metrics.

#Vision#Interpretability#Research release

why featured

HKR-K and HKR-R pass: the mechanism is concrete and medical-AI verifiability has practitioner pull. Missing dataset counts and performance numbers keep it in the 60-71 band.

editor take

FG-CBM grounds concepts in local evidence; RSS gives no dataset count or metrics, so I don’t buy the clinical-readiness leap.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

XFP achieves 138 tok/s single-stream decode on Qwen3.5-122B-A10B in V2 mode on RTX PRO 6000 Blackwell with TP=2, and reports 94.49% GSM8K strict-match across 3 seeds and 3,957 problems.

#Inference-opt#Benchmarking#Qwen#arXiv

why featured

HKR-K/R pass: XFP reports decode throughput for a 122B model and GSM8K strict-match accuracy, with clear serving-cost relevance. HKR-H fails because the angle is dense quantization detail for a narrow infra audience.

editor take

XFP hits 138 tok/s on Qwen3.5-122B; the auto codebook path is neat, but 397B evidence is single-seed GSM8K.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning

The paper shows that, on Needleman-Wunsch matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, while larger post-threshold datasets still generalize but require more gradient updates.

#Reasoning#Benchmarking#Research release

why featured

HKR-H/K pass: the title has a paradox hook and the paper gives a concrete data-scale result. HKR-R is weak because the arXiv study is narrow and distant from product, cost, or safety stakes.

editor take

Small Transformers hit NW exact-match fastest at mid-scale data; treating more data as faster convergence looks too lazy here.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Interestingness as an Inductive Heuristic for Future Compression Progress

The paper formalizes interestingness as an inductive heuristic for future compression progress, proves expected progress changes exponentially with the recency of the last observed breakthrough, and reports experimental confirmation across three universal computational paradigms.

#Reasoning#Benchmarking#Research release

why featured

HKR-H/K pass, but the item is an arXiv theory-paper abstract with limited reproducible detail and no product or agent link. This fits the 60–71 research-interest band.

editor take

This pins interestingness to compression progress across 3 paradigms; the gap to agent task selection is still engineering-sized.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

The paper introduces counterfactual time series forecasting with textual conditions, adds an evaluation framework covering factual and counterfactual settings without ground-truth future series, and proposes a text-attribution mechanism that separates mutable from immutable factors to improve forecasts under stochastic textual conditions.

#Benchmarking#arXiv#SeqML#Research release

why featured

HKR-H and HKR-K pass: the counterfactual setup is clickable, and the post names a new task, evaluation setup, and attribution mechanism. HKR-R is weak; as a single arXiv paper, it fits all, below featured.

editor take

arXiv 2605.14422 adds text-conditioned counterfactual forecasting; I don't buy no-ground-truth evaluation until TADiff shows its guardrails.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Vision-LLMs for Spatiotemporal Traffic Forecasting

The paper proposes ST-Vision-LLM for spatiotemporal mobile traffic forecasting, feeding historical global traffic matrices as image sequences into a Vision-LLM, using single-token floating-point encoding, two-stage numerical alignment, and GRPO, and reporting a 15.6% gain in long-term prediction accuracy.

#Vision#Multimodal#Fine-tuning#Research release

why featured

HKR-K passes via concrete mechanisms and a 15.6% accuracy gain. HKR-H/R are weak: this is domain traffic-forecasting research, not a general agent, product, or foundation-model competition story.

editor take

ST-Vision-LLM reports a 15.6% long-horizon accuracy gain. Treating traffic grids as images beats cramming time series into text.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

The paper formulates diffusion sequence generation as a finite-horizon MDP and derives an exact unbiased policy gradient over denoising steps, then uses entropy-guided step selection and one-step denoising rewards to estimate advantages without explicit sequence likelihoods or costly multi-step rollouts.

#Reasoning#Code#Fine-tuning#Research release

why featured

HKR-K passes because the mechanism is concrete for diffusion-LLM training watchers. HKR-H/R are weak, and the post discloses no result numbers, code artifact, or production impact, so it stays a normal research update.

editor take

DLM-RL gets an unbiased stepwise gradient; SOTA numbers aren’t in the snippet, so I’d inspect repo cost first.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→AudioMosaic: Contrastive Masked Audio Representation Learning

AudioMosaic constructs positive pairs with structured time-frequency masking on spectrogram patches, reduces memory usage for large-batch contrastive pre-training, and reaches state-of-the-art results on several standard audio benchmarks under linear probing and fine-tuning.

#Audio#Embedding#Benchmarking#AudioMosaic

why featured

HKR-K passes on a concrete training mechanism and benchmark claim; HKR-H and HKR-R are weak because the angle is academic and narrow. This is useful research signal, not featured-level industry news.

editor take

AudioMosaic uses structured time-frequency masks for positives; memory savings lack numbers, so hold the SOTA claim lightly.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification

MetaMoE unifies independently trained domain experts with public proxy data, uses diversity-aware proxy selection for router supervision, and outperforms recent privacy-preserving MoE unification methods on computer vision and NLP benchmarks.

#Fine-tuning#Alignment#Benchmarking#MetaMoE

why featured

HKR-K passes with a concrete mechanism and benchmark claim. HKR-H/R are weak: the angle is specialist, and the post lacks numbers, code, or production impact, so it stays in the lower research-news band.

editor take

MetaMoE trains routers with public proxy data; gains are undisclosed. Privacy MoE will hinge on proxy contamination, not expert count.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→MoZoo: Unleashing Video Diffusion Power in Animal Fur and Muscle Simulation

MoZoo synthesizes animal videos from coarse meshes under multimodal guidance, using RAR-RoPE, Asymmetric Decoupled Attention, and MoZooBench with 120 mesh-video pairs to evaluate fur simulation across animal skeletons and layouts.

#Multimodal#Vision#Benchmarking#MoZoo

why featured

HKR-H and HKR-K pass: the angle is novel and the post gives mechanisms plus MoZooBench size. HKR-R is weak because this is graphics-heavy arXiv research with limited near-term industry pull.

editor take

MoZooBench has only 120 mesh-video pairs; fur dynamics are hard, but that scale cannot carry the “cinematic-quality” claim.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→CA2: Code-Aware Agent for Automated Game Testing

CA2 trains a game-testing agent with function call traces and game state, then evaluates it in two instrumented environment types: state-based and image-based.

#Agent#Code#Valliappan Chidambaram Adaikkappan#Vincent Martineau

why featured

HKR-K passes because CA2 adds a concrete mechanism: call stacks plus game state for a testing agent. HKR-H/R are weak, and the excerpt gives no metrics, code, or production-replacement claim, so this stays niche research.

editor take

CA2 feeds call stacks to a testing agent across 2 environment types; I buy the direction, not the vague “consistent improvement.”

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review

The systematic review selected 134 studies from 2,067 papers published over ten years and identifies gaps in synthetic health tabular data evaluation, including no consensus on methods, inconsistent metric use, limited domain expert involvement, incomplete dataset reporting, and limited reproducibility.

#Benchmarking#arXiv#Research release

why featured

HKR-K is solid: 134 reviewed studies produce concrete evaluation gaps. HKR-R is niche to synthetic health tabular data, with no product, model, or open-source artifact, so this stays in all.

editor take

The review keeps 134 studies; synthetic health tabular evaluation is still metric soup, with clinicians and reproducibility missing.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→BioHuman: Learning Biomechanical Human Representations from Video

BioHuman introduces BioHuman10M, a dataset with synchronized video, motion, and muscle activations, and trains an end-to-end model that takes monocular video to jointly predict human motion and muscle activations.

#Vision#Multimodal#Benchmarking#BioHuman

why featured

HKR-H/K pass: the hook extends video human modeling to muscle activation, with BioHuman10M’s synced data modalities. HKR-R is weak; no product, open-source, or robotics deployment detail is disclosed, so it stays low-tier all.

editor take

BioHuman10M syncs video, motion, and muscle activation at 10M scale; activation is simulation-derived, so rehab claims need restraint.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

SCORE

H1·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Temporal Fair Division in Multi-Agent Systems: From Precise Alternation Metrics to Scalable Coordination Proxies

The paper introduces Rotational Periodicity and ALT temporal fairness metrics for repeated multi-agent resource competition, evaluates MBoE with 2, 3, 5, 8, and 10 agents, and reports that RP runs 12-25x faster than ALT while exposing Q-learning coordination failures that reward fairness misses.

#Agent#Benchmarking#Research release#Benchmark

why featured

HKR-K passes: new metrics, agent counts, and a 12-25x speed result. HKR-H/R are weak; this is a niche arXiv methods paper without product or major agent-framework impact, so it stays in the 40-59 band.

editor take

RP runs 12–25x faster than ALT on 2–10 agents; stop trusting Reward Fairness for repeated allocation agents.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

Pro-DG infers a facade hierarchy from one image and its segmentation, then uses procedural control maps in Stable Diffusion and ControlNet to perform structural edits such as floor duplication and window rearrangement.

#Vision#Multimodal#arXiv#Stable Diffusion

why featured

HKR-K passes because Pro-DG gives concrete inputs and a control mechanism; HKR-H/R are weak because the use case stays inside architectural facade generation, with no broad product or model-competition signal.

editor take

Pro-DG edits facades from one image plus segmentation. Metrics are undisclosed; the useful bit is procedural rules inside ControlNet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Communication-Efficient Federated Fine-Tuning

The paper proposes the FDA-Opt algorithm family for federated language-model fine-tuning, replacing FedOpt’s fixed exchange intervals with dynamic synchronization and outperforming FedOpt on downstream NLP experiments even when FedOpt uses hyperparameters optimized for those tasks.

#Fine-tuning#Research release

why featured

HKR-K passes on the dynamic-sync FDA-Opt mechanism, but the article gives no gain size, communication rounds, or reproducible setup. HKR-H/HKR-R are weak, so this stays a niche research signal.

editor take

FDA-Opt replaces FedOpt’s fixed exchange interval with dynamic sync; I buy the direction, but rounds and model sizes are undisclosed.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration

UniMamba integrates Mamba, FFT-Laplace Transform, TCN, and spatial-temporal attention for multivariate time-series forecasting, and the paper reports better forecasting accuracy and computational efficiency than prior models on eight public benchmark datasets.

#Reasoning#Benchmarking#UniMamba#Mamba

why featured

HKR-K passes via the concrete architecture mix and 8 public benchmarks. HKR-H/R are weak: this is a routine arXiv methods paper with no production replacement claim or open-source impact, so it stays in the upper 40-59 band.

editor take

UniMamba wins on 8 public benchmarks; without ablations or cost tables here, Mamba+attention+FFT-Laplace smells stacked.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression

RQ-MoE combines a two-level MoE with dual-stream quantization to adapt codebooks per input for high-dimensional embedding compression, and experiments report state-of-the-art or on-par reconstruction and retrieval with 6–14x faster decoding than prior vector quantization methods.

#Embedding#Inference-opt#KDEGroup#Research release

why featured

HKR-K/R pass: the paper has a concrete mechanism and 6–14x decoding claim. It remains a narrow embedding-compression paper with no major lab release, ecosystem signal, or production-replacement proof, so it stays below featured.

editor take

RQ-MoE claims 6–14x faster decoding; I’d benchmark ANN latency first, reconstruction scores don’t ship products.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

The paper proposes five 22M-parameter Mini-JEPA models with a router LLM selecting sensors per query; dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions, with Cohen's d=1.10 and p=0.031.

#Agent#RAG#Vision#Google AlphaEarth

why featured

HKR-K passes via the small-model fleet, routing LLM, dual retrieval setup, and effect size; HKR-H/R are weak because the hydrology focus is narrow. No hard exclusion applies, so it lands in low all.

editor take

Five 22M Mini-JEPAs beat AlphaEarth-only retrieval; the router’s perfect hit rate is on curated questions, so “agentic” feels inflated.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

SurF maps event sequences to i.i.d. unit-rate exponential noise via the Time Rescaling Theorem, trains one model across heterogeneous event streams, and reports the best time RMSE on 3 of 6 real-world benchmarks: Earthquake, Retweet, and Taobao.

#Reasoning#Benchmarking#SurF#Amazon

why featured

HKR-K passes via a testable mechanism and 6-benchmark result; HKR-H/R miss because this is a niche time-series modeling paper with no product or industry spillover.

editor take

SurF tops time RMSE on 3/6 benchmarks; TRT as a learnable bijection is a credible pretraining handle for async event streams.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

AaSP pre-trains audio spectrogram Transformers on AudioSet with AaPE, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization, then reports state-of-the-art fine-tuning results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while linear evaluation also shows gains on US8K and NSynth.

#Audio#Multimodal#Benchmarking#AudioSet

why featured

HKR-K passes via named mechanisms and three fine-tuning benchmarks. HKR-H/R fail because the paper is niche audio representation work with no code, effect sizes, or broader practitioner nerve.

editor take

AaSP pretrains on AudioSet and wins 3 fine-tuning benchmarks; audio SSL is finally treating patch aliasing as a first-class bug.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Causal Time Series Generation via Diffusion Models

The paper introduces CaTSG, a diffusion-based framework that uses backdoor-adjusted guidance and abduction-action-prediction to generate observational, interventional, and counterfactual time series across synthetic and real-world datasets.

#Reasoning#CaTSG#Research release

why featured

HKR-K passes for a concrete CaTSG mechanism and three generation targets. HKR-H/R are weak, and this single arXiv paper gives no production replacement or open-source impact.

editor take

CaTSG spans observational, interventional, and counterfactual series; smells like Pearl’s ladder inside diffusion sampling, with the causal graph still doing the hard work.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Deep Image Segmentation via Discriminant Feature Learning

The paper introduces DDA, an architecture-agnostic segmentation loss evaluated on DIS5K across multiple architectures, which maximizes between-class variance and minimizes within-class variance to improve segmentation accuracy, boundary sharpness, and model confidence without adding inference cost.

#Vision#Benchmarking#Research release#Benchmark

why featured

HKR-K passes on the DDA loss mechanism and “no inference cost”; HKR-H/R are weak because the title is academic and the audience is narrow. No hard exclusion, but this is niche vision research, so it lands in the 40–59 band.

editor take

DDA improves DIS5K boundaries across architectures with zero inference cost; honestly, loss-side fixes beat another segmentation head.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection

PaAno uses short temporal patches and a 1D CNN for time-series anomaly detection, training embeddings with triplet loss plus pretext loss and evaluating on TSB-AD across univariate, multivariate, range-wise, and point-wise measures.

#Embedding#Benchmarking#PaAno#TSB-AD

why featured

HKR-K passes because the method and benchmark setup are concrete. HKR-H/R are weak: this is a narrow time-series anomaly-detection paper with limited general AI-practitioner pull.

editor take

PaAno claims TSB-AD SOTA but gives no scores here; a 1D-CNN patch method beating heavy models needs code and tables.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Architecture-Aware Explanation Auditing for Industrial Visual Inspection

The paper tests explanation auditing on WM-811K with 9 classes and 172k wafer maps, where ViT-Tiny plus Attention Rollout records a Deletion AUC of 0.211, while Swin-Tiny, ResNet18+CBAM, and DenseNet121 plus Grad-CAM score 0.432-0.525 and RISE compresses all families to about 0.1.

#Vision#Interpretability#Benchmarking#WM-811K

why featured

HKR-K passes with dataset size, class count, and Deletion AUC comparison. HKR-H/R are weak: this is a narrow industrial-vision interpretability benchmark, useful but not broad enough for featured.

editor take

ViT-Tiny+Attention Rollout scores 0.211 Deletion AUC on WM-811K; heatmap audits hinge on readout and perturbation choice.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution

RoSHAP models SHAP score distributions with bootstrap resampling and kernel density estimation, then uses asymptotic Gaussianity under mild regularity conditions to reduce distribution-estimation cost while ranking features by activity, strength, and stability.

#Interpretability#Research release

why featured

HKR-K passes: RoSHAP introduces a concrete mechanism for stable feature attribution, but the post gives no experiment numbers, code, or production claim. The academic framing keeps it in the 40–59 band.

editor take

RoSHAP adds bootstrap+KDE stability to SHAP ranking; no cost numbers disclosed, so test it first on seed-sensitive feature selection.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→CAKE: Confidence in Assignments via K-partition Ensembles

CAKE evaluates per-point confidence in clustering assignments with K-partition ensembles, combining cross-run assignment stability and local geometric-fit consistency into one interpretable score in [0,1].

#Benchmarking#CAKE#Research release

why featured

HKR-K passes because the post states a testable mechanism: a [0,1] assignment-confidence score from stability and local geometry. HKR-H and HKR-R are weak; this is a narrow methods paper, not featured.

editor take

CAKE scores each clustered point in [0,1]; no code or datasets disclosed, so don't treat robustness proofs as usability.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signals

Dywave applies wavelet-based hierarchical decomposition to event-aligned dynamic tokenization for heterogeneous IoT sensing signals, and evaluations on five real-world datasets for activity recognition, stress assessment, and nearby object detection report up to 12% higher accuracy and up to 75% shorter input token lengths across mainstream sequence models.

#Inference-opt#Dywave#Research release#Benchmark

why featured

HKR-K passes on mechanism and numbers, but the story is niche IoT time-series research with little product or developer-workflow impact. No hard-exclusion rule is triggered, so it stays in the low-value research-signal band.

editor take

Dywave reports +12% accuracy and 75% fewer tokens on 5 IoT datasets; sensor-swap robustness is the hard test.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→bde: A Python Package for Bayesian Deep Ensembles via MILE

bde releases a Python package for Bayesian Deep Ensembles, built on a JAX implementation of MILE sampling-based inference, with scikit-learn compatible estimators for tabular regression and classification uncertainty quantification.

#Benchmarking#bde#JAX#scikit-learn

why featured

HKR-K passes via a concrete implementation and supported tasks; HKR-H and HKR-R are weak, with no major lab or broad industry impact. This fits the upper 40–59 band as a niche research-tool release.

editor take

bde ships JAX MILE samplers with scikit-learn estimators; another tabular uncertainty tool, but no benchmarks disclosed—don’t buy “fast” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Toward Privileged Foundation Models: LUPI for Accelerated and Improved Learning

The paper introduces PIQL, a framework that adds two train-time privileged-information sources to tabular foundation models: aggregate dataset statistics and encodings of data-generating programs; the abstract says PIQL improves convergence, final loss, and generalization, but the post does not disclose concrete experimental numbers.

#Fine-tuning#Inference-opt#Reasoning#Research release

why featured

HKR-K passes because PIQL gives a testable mechanism using two classes of training-time privileged information. HKR-H/R are weak, and no concrete experiment numbers are disclosed, so this stays in the lower research-signal band.

editor take

PIQL adds two train-time privileged signals for tabular FMs, but reports no numbers here; I don’t buy the “first framework” flex without code.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

The paper constructs three last-layer coefficients to predict class-wise forgetting ranks in rehearsal-based class-incremental learning, and identifies the self-induced interference coefficient as the strongest predictor under controlled experiments.

#Fine-tuning#Interpretability#Research release

why featured

HKR-K passes because the paper names three testable coefficients for forgetting order. HKR-H/R fail: the angle is academic and niche, with no broad product, cost, safety, or competition hook; no hard exclusion triggered.

editor take

3 last-layer coefficients predict forgetting ranks; snippet lacks datasets and effect sizes, so mitigation claims wait.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→AIM Framework for Standardised Explainability Evaluation in Graph Neural Networks

The paper introduces AIM, a framework that evaluates GNN explainability with three measure groups: Accuracy, instance-level explanations, and model-level explanations, then applies it to graph kernel networks and prototype networks, using the GKN case study to derive xGKN while the abstract does not disclose benchmark scores or datasets.

#Interpretability#Benchmarking#Research release

why featured

HKR-K passes on AIM metrics and xGKN, but HKR-H/HKR-R are weak. The GNN/GKN explainability angle needs specialist graph-ML background and gives no product path, triggering hard-exclusion-technical-accessibility; capped at 39.

editor take

AIM scores GNNs across accuracy, instance explanations, and model explanations. This 19-page TMLR paper pays down XAI’s benchmark debt.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→On the Burden of Achieving Fairness in Conformal Prediction

The paper derives a conservation law and lower bound for pooled split conformal calibration, showing that cross-group quantile heterogeneity creates irreducible group-wise coverage distortion and that Equalized Coverage conflicts with Equalized Set Size under the studied policy families.

#Benchmarking#Research release

why featured

Hard-exclusion-technical-accessibility applies: conformal-prediction fairness bounds are niche statistical theory with no product, agent, or engineering path. HKR-K passes, but the cap keeps it excluded.

editor take

The paper proves 1 conservation law and lower bound: pooled calibration turns group heterogeneity into coverage distortion. Fair conformal prediction has no free lunch.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval

The paper introduces The Spheres dataset with over one hour of multitrack orchestral recordings by Colibrì Ensemble, captured with 23 microphones, and provides isolated stems, estimated room impulse responses, and X-UMX baselines for orchestral family separation and microphone debleeding.

#Audio#Benchmarking#Colibrì Ensemble#The Spheres

why featured

HKR-K passes with concrete dataset size, capture setup, and baseline. HKR-H and HKR-R are weak because the story is niche music source-separation research, so it stays in all.

editor take

The Spheres offers 1 hour and 23-mic orchestral multitracks; small corpus, but stems plus RIR make it useful.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

WarmPrior replaces the standard Gaussian source distribution with a temporal prior built from recent action history, improving success rates for generative visuomotor robot control; the abstract does not disclose the number of tasks, success-rate gains, or sample sizes.

#Robotics#Inference-opt#WarmPrior#Research release

why featured

HKR-K passes for a testable mechanism in policy generation. The summary discloses no task count, success-rate gain, or sample size, and the angle is specialized robotics research, so it stays in the lower band.

editor take

WarmPrior swaps Gaussian sources for recent action history; no task counts or gains disclosed, but source distributions deserve control-stack attention.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Distributional Principal Autoencoders

The paper proposes Distributional Principal Autoencoder, which uses an encoder to adaptively choose latent dimensions and a decoder to match the conditional distribution given low-dimensional variables, with numerical results on climate data, single-cell data, and image benchmarks showing reconstruction of the original data distribution.

#Embedding#Benchmarking#Research release#Benchmark

why featured

HKR-K passes because the abstract gives a concrete mechanism and benchmark domains. HKR-H/R are weak: this is a technical representation-learning paper with no product, agent, or market hook.

editor take

DPA claims original-distribution reconstruction at any retained dimension; I don’t buy it without disclosed limits beyond climate, single-cell, and image benchmarks.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning

GFMate applies centroid and layer prompts after pre-training for Graph Foundation Models, then tunes them at test time with labeled and unlabeled target-domain data; experiments on 12 benchmark datasets report performance gains up to 30.63%, and the authors provide code on GitHub.

#Fine-tuning#Benchmarking#GFMate#Research release

why featured

HKR-K passes via 12 benchmarks and a 30.63% gain, but HKR-H and HKR-R miss: the graph-model prompt-tuning angle is niche and mostly academic. This fits the low-value research band, so tier is all.

editor take

GFMate reports up to 30.63% on 12 graph benchmarks; the useful bit is unlabeled target-graph tuning, not another few-shot prompt wrapper.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Measuring the Stability and Plasticity of Recommender Systems

The paper proposes an offline evaluation protocol that profiles recommender models after retraining by stability and plasticity, then reports preliminary results on three algorithm types using the GoodReads dataset, while the abstract does not disclose the exact metrics, model names, or numerical scores.

#Benchmarking#GoodReads#Research release#Benchmark

why featured

HKR-K passes: the paper offers a stability/plasticity offline evaluation protocol with GoodReads tests. The topic is niche recommender-system evaluation, with no product, open-source, or foundation-model impact shown.

editor take

The paper tests 3 recommender types on GoodReads; metrics and scores are undisclosed, but retraining drift belongs in offline eval.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning

NERVE tokenizes brain functional connectivity matrices into intra- and inter-network blocks and evaluates behavior and psychopathology prediction across three developmental cohorts: ABCD, PNC, and CCNP.

#Embedding#NERVE#ABCD#PNC

why featured

Triggers hard-exclusion-4: brain connectivity prediction is traditional science plus AI, with no agent, product, or engineering implication disclosed. HKR-K passes via the tokenization mechanism, while HKR-H and HKR-R fail.

editor take

NERVE tokenizes FC as network-pair blocks; three cohorts back transfer, and image-MAE defaults look lazy here.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Breaking the Reasoning Horizon in Entity Alignment Foundation Models

Yuanning Cui and four coauthors propose an entity alignment foundation model that uses seed entity pairs as local anchors for parallel encoding; the abstract reports experiments on unseen knowledge graphs, but the post does not disclose dataset counts or performance numbers.

#Reasoning#Yuanning Cui#Zequn Sun#Wei Hu

why featured

HKR-K comes from one mechanism: seed entity pairs as local anchors for parallel encoding; the post gives no datasets, metrics, or code. Niche entity alignment has weak practitioner resonance, so it sits in the 40–59 low-value research band.

editor take

Cui’s team uses seed entity pairs as anchors; no dataset counts or metrics are disclosed, so I don’t buy the “foundation model” label yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

The paper proposes Data-Augmented Game Starts, which samples intermediate states from offline demonstrations for two-player zero-sum imperfect-information games, and tests it on long-horizon variants of Kuhn Poker, Goofspiel, and a counterexample game under fixed compute budgets.

#Reasoning#Benchmarking#OpenSpiel#Research release

why featured

HKR-K passes because DAGS gives a concrete mid-state self-play mechanism and 3 test environments. HKR-H/R are weak: dry paper framing and limited relevance beyond niche RL/game research.

editor take

DAGS starts self-play from offline mid-states and reports lower exploitability under fixed compute; I buy the exploration trick, not the demo-coverage assumption.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Time Series Forecasting Through the Lens of Dynamics

The paper proposes the PRO-DYN nomenclature to analyze time-series forecasting models through dynamics, reporting two observations: under-performing architectures learn dynamics only partially, and placing the dynamics block at the model end is critical.

#Benchmarking#Research release

why featured

Only HKR-K lands: the post gives a PRO-DYN taxonomy and a module-placement claim, but no numbers, artifact, or product angle. This is niche forecasting research, so it stays in all.

editor take

PRO-DYN frames forecasting as dynamics-block placement; the snippet gives no benchmark scale, so I don’t buy the design-guide claim yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Exploitation of Hidden Context in Dynamic Movement Forecasting: From Recurrent to Graph Neural Networks and General Purpose Transformers

The paper evaluates LSTM, GNN, Transformer, and linear baselines for NBA movement forecasting under forecast horizons up to 2 seconds; a context-augmented hybrid LSTM achieves the lowest final displacement error at 1.51 m, beating TCNN, GAT, and Transformers while using less data and training time than GAT and Transformers.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-K passes because the paper gives a 2-second forecasting setup and 1.51m FDE result. HKR-H/R miss: this is a niche trajectory-forecasting benchmark with unclear product, agent, or platform impact.

editor take

Hybrid LSTM hits 1.51m FDE on 2s NBA forecasting; Transformers lose when short-horizon context beats model fashion.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Exploring Geographic Relative Space in Large Language Models through Activation Patching

The paper uses activation patching to examine how LLMs process relative geographic space; the RSS abstract discloses the mechanistic interpretability method but not the model names, datasets, or evaluation metrics.

#Interpretability#Research release

why featured

HKR-H barely passes on the geographic-representation hook, while HKR-K/R fail because the feed gives no models, datasets, metrics, or practical implication. This is relevant interpretability research, but thin and niche.

editor take

The paper uses activation patching for relative geography, but names no models or metrics; good question, thin evidence so far.

HKR breakdown

hook ✓knowledge —resonance —

→ open source

SCORE

H1·K0·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

The paper proposes a DRL method that routes one truck in real time for pick-up, drop-off, and charging actions in dockless bike-sharing systems; experiments use real-world data, but the RSS snippet does not disclose the exact reduction in availability failures.

#Agent#Robotics#Research release

why featured

HKR-K passes: the paper gives a real-time 1-truck dispatch mechanism tested on real data. H and R fail because this is a narrow operations application with no reported performance lift or AI-product implication.

editor take

DRL routes 1 truck for live rebalancing; no failure-rate delta is disclosed, so the engineering claim stays discounted.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints

The arXiv paper compares statistical methods, ensemble learning, and exploratory neural models for minority-class financial distress prediction, using SMOTE, five ensemble architectures including XGBoost and LightGBM, and SHAP attribution under severe class imbalance conditions.

#Benchmarking#Interpretability#arXiv#XGBoost

why featured

HKR-K passes weakly because the setup names concrete methods, but there are no result numbers or production implications. The applied finance paper is vertical, not hard-excluded, so it stays low-value but browseable.

editor take

The paper compares 5 ensemble models plus SMOTE; dataset and AUC are undisclosed, so I file it as routine risk-ML replication.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→XAI and Statistical Analysis for Reliable Intrusion Detection in the UAVIDS-2025 Dataset

Zarkadis and Douligeris compare tree ensembles, DNNs, hybrid stacking models, and ensemble neural networks on UAVIDS-2025 with stratified 10-fold cross-validation, then use SHAP and statistical tests to analyze XGBoost errors in Wormhole and Blackhole attacks.

#Interpretability#Benchmarking#Iakovos-Christos Zarkadis#Christos Douligeris

why featured

HKR-K passes via a new UAVIDS-2025 benchmark setup and model ranking; HKR-H/R are weak, and metrics are not disclosed. This is niche security-ML research, so it stays in all.

editor take

Zarkadis and Douligeris use 10-fold CV on UAVIDS-2025. XGBoost wins, but no scores are disclosed; SHAP isn't mechanistic interpretability.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

04:00

25d ago

arXiv · cs.LG· atomEN04:00 · 05·15

→Proposal and Study of Statistical Features for String Similarity Computation and Classification

The paper applies co-occurrence matrix and run-length matrix features to string similarity computation; in the first synthetic experiment set, COM and RLM beat other statistical features, and in 3 of 4 cases they were more significant than the second-best distance-based group with P-value below 0.001.

#Benchmarking#Research release#Benchmark

why featured

HKR-K passes on concrete experiment details, but HKR-H and HKR-R fail. This is a narrow string-similarity methods paper with no product, agent, or foundation-model industry impact, so it stays in the low-value non-excluded band.

editor take

COM/RLM won 3 of 4 synthetic cases at P<0.001; looks useful for brittle similarity checks, not semantic retrieval.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

SCORE

H0·K1·R0

papers · 2026-05-15

more

feeds

admin