ax@ax-radar:~/papers $ grep -E 'arxiv|paper' sources/tags
45 srcsignal 72%cycle 04:32

papers · 2026-05-13

284 papers · updated 3m ago
2026-05-13 · Wed
19:50
26d ago
HuggingFace Papers (takara mirror)· rssEN19:50 · 05·13
Fair and Calibrated Toxicity Detection with Robust Training and Abstention
The paper compares ERM, reweighted ERM, and Group DRO for toxicity classification, evaluating ranking, calibration, and abstention fairness with subgroup AUC, BPSN/BNSP AUC, error gaps, per-subgroup ECE, and 1,000 bootstrap confidence intervals.
#Safety#Benchmarking#Research release#Safety/alignment
why featured
HKR-K is solid: the paper gives concrete methods, 1,000 bootstrap CIs, and fairness dimensions for toxicity detection. HKR-R is narrow, with relevance mainly to safety/moderation teams; no hard exclusion, but it stays in the 60–71 band.
editor take
ERM hits global ECE 0.013 yet subgroup gaps reach 0.134; toxicity papers hiding behind AUC are missing the fairness bill.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
19:47
26d ago
HuggingFace Papers (takara mirror)· rssEN19:47 · 05·13
Distribution-Corrected Offline Data Distillation for Large Language Models
The paper proposes distribution-corrected offline reasoning distillation and evaluates it on GSM8K, MATH, MATH500, AMC, AIME, and OlympiadBench; the post does not disclose exact accuracy gains, model sizes, or training-cost numbers.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
HKR-K passes for a named distillation mechanism and benchmark set. HKR-H/R are weak: the post gives no gains or deployment cost, so this stays in the lower ordinary research band.
editor take
The paper tests 6 math benchmarks but gives no gains; I’d file it as a neat offline-distillation hypothesis for now.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
19:25
26d ago
HuggingFace Papers (takara mirror)· rssEN19:25 · 05·13
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank weight adaptation for multi-task LLM fine-tuning, and reports up to 6.67% average accuracy improvement over MTL-LoRA, MultiLoRa, C-Poly, and MoE on GLUE, SuperGLUE, MMLU, and commonsense reasoning benchmarks.
#Fine-tuning#Benchmarking#PEML#LoRA
why featured
HKR-K and HKR-R pass: the paper provides a concrete PEML mechanism and benchmark gains on GLUE, SuperGLUE, MMLU, and commonsense tasks. HKR-H is weak, and without open-source or production evidence this stays in the 60–71 band.
editor take
PEML reports up to 6.67% average gain; I have doubts, since base models and parameter budgets aren't disclosed.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
17:52
26d ago
arXiv · cs.AI· atomEN17:52 · 05·13
Quantifying Sensitivity for Tree Ensembles Using Symbolic and Compositional Methods
The paper introduces XCount to quantify sensitivity in decision tree ensembles by discretizing the input space, encoding the problem as an algebraic decision diagram, and splitting it into subproblems under certified error and confidence bounds; the snippet reports speedups over model counters but does not disclose benchmark numbers.
#Safety#Benchmarking#XCount#Research release
why featured
HKR-K passes for a concrete method, but HKR-H/R fail. The symbolic verification angle for tree-ensemble sensitivity triggers technical-accessibility fail, making it too narrow for general AI practitioners.
editor take
XCount quantifies sensitive regions for tree ensembles with ADDs and certified bounds; benchmark sizes are undisclosed, so I don't buy the speedup claim yet.
HKR breakdown
hook knowledge resonance
open source
50
SCORE
H0·K1·R0
17:45
26d ago
● P1arXiv · cs.AI· atomEN17:45 · 05·13
Research paper introduces AEvo meta-editing framework for agentic evolution with 26% performance gain
The paper introduces AEvo, a meta-editing framework where a meta-agent edits the procedure or agent context that drives future evolution; on agentic and reasoning benchmarks, AEvo outperforms five evolution baselines with a 26% relative improvement over the strongest baseline.
#Agent#Reasoning#Benchmarking#AEvo
why featured
HKR-H/K/R all pass, but this is a single arXiv paper with AEvo and a 26% benchmark claim, not a major lab release or product artifact; keep it in the 72–77 band.
editor take
AEvo edits the search machinery, not the next answer; 26% relative gain is sharp, but the abstract lacks task tables and cost, so don't crown it yet.
sharp
The two records are cs.AI and cs.LG entries for the same arXiv paper, with one abstract and one number. That is category distribution, not independent corroboration. AEvo’s useful claim is mechanical: the meta-agent does not propose the next candidate; it edits the procedure or agent context that drives later evolution. The authors report wins over five baselines on agentic and reasoning benchmarks, with a 26% relative gain over the strongest baseline, plus wins over four baselines on three open-ended optimization tasks. I like the direction because it targets the search loop, not just sampling plus reranking. But the abstract does not expose the benchmark list, token budget, or failure profile. Compared with DSPy-style prompt/program optimization, AEvo is more ambitious and harder to trust without a clean reproduction package.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
17:43
26d ago
arXiv · cs.AI· atomEN17:43 · 05·13
Neurosymbolic Auditing of Natural-Language Software Requirements
The paper presents VERIMED, a neurosymbolic pipeline that uses LLMs and an SMT solver to audit medical-device software requirements; on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%.
#Reasoning#Tools#Benchmarking#VERIMED
why featured
HKR-K is strong: the paper gives LLM+SMT counterexamples and a 55.4%→98.5% result. HKR-H and HKR-R pass, but the formal-requirements angle is niche, so it stays in all rather than featured.
editor take
VERIMED lifts hemodialysis verified accuracy from 55.4% to 98.5%; SMT counterexamples beat LLM self-consistency for medical audits.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
17:42
26d ago
HuggingFace Papers (takara mirror)· rssEN17:42 · 05·13
OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation
OmniLiDAR uses one text-conditioned diffusion framework to generate LiDAR scans across 8 domains, covering three distribution-shift types: adverse weather, sensor-configuration changes such as reduced beams, and cross-platform acquisition across vehicles, drones, and quadrupeds.
#Multimodal#Robotics#OmniLiDAR#Research release
why featured
HKR-H and HKR-K pass: 8-domain LiDAR generation and 3 shift types are concrete. HKR-R is weak because the story is specialized robotics sensor-data research, so it stays in all.
editor take
OmniLiDAR trains one generator across 8 LiDAR domains; I buy CDTS, not broad claims on unseen sensors yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
17:13
26d ago
arXiv · cs.CL· atomEN17:13 · 05·13
An LLM-Based System for Argument Reconstruction
The paper presents an end-to-end LLM system that reconstructs arguments from natural-language text into directed acyclic argument graphs with two component types, premises and conclusions, and three relation types, support, attack, and undercut; evaluation uses one manual textbook-based experiment and one quantitative benchmark comparison against prior annotation schemes.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes because the paper gives a testable graph mechanism and evaluation setup. HKR-H and HKR-R are weak: the title is academic, and the application pull for AI practitioners is narrow.
editor take
The system outputs 2 node types and 3 relation types; no scores disclosed, so “adequately recover” is doing too much work.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
17:11
26d ago
arXiv · cs.AI· atomEN17:11 · 05·13
Di-BiLPS achieves PDE solving under sparse observations with denoising-induced bidirectional latent approach
Di-BiLPS combines a VAE, latent diffusion, and contrastive learning to solve forward and inverse PDE tasks under sparse observations, achieving SOTA results with inputs as low as 3% and supporting zero-shot super-resolution over continuous spatial-temporal domains.
#Reasoning#Inference-opt#Di-BiLPS#Research release
why featured
Triggers hard-exclusion-1 and hard-exclusion-4: a specialist numerical-PDE paper with no product or agent implication. HKR-K passes on the 3% sparse-input claim, but the item stays capped as excluded.
editor take
Di-BiLPS hit 2 arXiv feeds; only the title is disclosed, with no benchmarks or sparsity rate.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
16:41
26d ago
HuggingFace Papers (takara mirror)· rssEN16:41 · 05·13
Conditional Latent Dynamics Network for Metropolitan Flood Digital Twins and Forecasting
CLDNet reduces a 96-hour basin-wide flood forecast for the Des Plaines River basin from about 55 minutes to about 29 seconds, using a rainfall-driven latent neural ODE and terrain-conditioned decoder, and reaches about 86% critical success index at the 0.5 m inundation threshold.
#Reasoning#Benchmarking#CLDNet#United States Geological Survey
why featured
Hard-exclusion-4 applies: this is an AI surrogate for hydrology simulation, with no agent, product, or general AI tooling implication. HKR-H and HKR-K pass, but the cap keeps it excluded.
editor take
CLDNet cuts a 96-hour flood run from 55 minutes to 29 seconds; ask for code and out-of-114-storm tests.
HKR breakdown
hook knowledge resonance
open source
49
SCORE
H1·K1·R0
16:10
26d ago
HuggingFace Papers (takara mirror)· rssEN16:10 · 05·13
Research on Stacked Ensemble Models for Bicuspid Aortic Valve Echocardiographic Diagnosis
The researchers trained a PLAX cine-loop stacked ensemble on 90 TTE patient studies to classify BAV versus TAV, reporting outer-CV F1 of 0.907 and recall of 0.877 across fixed splits and 10 random seeds.
#Vision#Multimodal#Interpretability#Research release
why featured
Hard-exclusion-4 applies: this is medical-imaging AI research with no product, agent, or industry deployment mechanism. HKR-K is supported by sample size and metrics, but HKR-H/R fail, so the score is capped below 40.
editor take
A stacked TTE ensemble hit 0.907 outer-CV F1 on 90 patients; I don’t buy the clinical claim before larger external validation.
HKR breakdown
hook knowledge resonance
open source
47
SCORE
H0·K1·R0
15:43
26d ago
HuggingFace Papers (takara mirror)· rssEN15:43 · 05·13
The WidthWall: A Strict Expressivity Hierarchy for Hypergraph Neural Networks
The paper uses homomorphism densities to characterize continuous hypergraph invariants and defines a strict hierarchy indexed by hypertree width, called the Width Wall. It analyzes 15 HGNN architectures, identifies information lost by clique expansion, and validates the limit on a real-world hypergraph node classification suite where graph-reduction baselines fail under wider pattern requirements.
#Benchmarking#Research release#Benchmark
why featured
hard-exclusion technical-accessibility fail: homomorphism density, hypertree width, and HGNN expressivity need niche graph-theory context with no product or agent hook. HKR-K passes, but HKR-H/R fail, so the item stays below 40.
editor take
WidthWall classifies 15 HGNNs by hypertree width; hidden dims and training tricks won’t patch missing higher-order structure.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
15:06
26d ago
HuggingFace Papers (takara mirror)· rssEN15:06 · 05·13
Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling
CaAD aligns a stochastic ego policy through ego-centric joint-causal modeling and joint-mode embeddings, reaching an 87.53 Driving Score and 71.81 Success Rate on Bench2Drive and a 91.1 PDMS on NAVSIM.
#Robotics#Reasoning#Benchmarking#CaAD
why featured
HKR-K passes with a concrete mechanism and Bench2Drive/NAVSIM numbers; HKR-H is weak, and HKR-R is limited to the AV niche. This is a useful robotics research item for all, not a broad featured story.
editor take
CaAD scores 87.53 on Bench2Drive; causal modeling is often hand-wavy, but the closed-loop numbers earn a feed slot.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
14:00
26d ago
HuggingFace Papers (takara mirror)· rssEN14:00 · 05·13
Bayesian Physics-Informed Neural Network for Lung Tumor Growth Prediction Published
The study uses a Bayesian physics-informed neural network to predict lung tumor growth from sparse longitudinal CT data in 30 National Lung Screening Trial patients, combining Gompertz dynamics, MAP estimation, and HMC sampling to produce posterior predictive distributions with about 0.20 cohort-level log-space RMSE and calibrated 95% credible interval coverage.
#Reasoning#National Lung Screening Trial#Research release
why featured
hard-exclusion-4 applies: this is a traditional science + AI crossover with no agent, product, or industry deployment angle. HKR-K passes on concrete metrics, but H/R fail, so it stays excluded.
editor take
Bayesian PINN predicts lung tumor growth on 30 NLST patients with ~0.20 RMSE; useful signal, not clinical evidence.
HKR breakdown
hook knowledge resonance
open source
44
SCORE
H0·K1·R0
13:47
26d ago
HuggingFace Papers (takara mirror)· rssEN13:47 · 05·13
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
The authors used locale-conditioned rotating three-shot prompts to stop Bonsai-1.7B regurgitation in 482/482 calls, but on the matched English NER subset, hybrid SLM substitution scored F1=0.346 versus faker at 0.506 with p < 0.001.
#Fine-tuning#Inference-opt#Benchmarking#OpenAI
why featured
HKR-K is strong and HKR-R is moderate: it has a reproducible prompt setup and 482/482 result, plus the F1 weakness versus faker. The scope is narrow and not productized, so it stays in 60–71.
editor take
Bonsai-1.7B hit 0 echoes in 482 locale-rotated 3-shot calls; F1 0.346 vs faker 0.506 says variety beats fluency.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
13:46
26d ago
HuggingFace Papers (takara mirror)· rssEN13:46 · 05·13
AI-Generated Slides: Are They Good? Can Students Tell?
The paper compares slide generation from instructor notes across NotebookLM, Claude, M365 Copilot, Cursor, and Claude Code, finding that coding assistants produced the most accurate, complete, and pedagogically sound slides, while students rated GenAI slides similarly to instructor-created slides and could not reliably identify which slides were AI-generated.
#Code#Benchmarking#NotebookLM#Claude
why featured
HKR-H/K/R all pass through a clear comparison and a surprising student-blindness result. Scope is education-heavy, and sample size, grading rubric, and reproducible setup are not disclosed, keeping it in the interesting band.
editor take
Five tools made slides, coding agents won; sample size is missing, so don't oversell students failing AI detection.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
13:40
26d ago
HuggingFace Papers (takara mirror)· rssEN13:40 · 05·13
MMSkills: Towards Multimodal Skills for General Visual Agents
The paper introduces MMSkills, a framework that packages textual procedures, runtime state cards, and multi-view keyframes into reusable multimodal skills; experiments cover GUI and game-based visual-agent benchmarks, but the post does not disclose exact scores.
#Agent#Multimodal#Vision#MMSkills
why featured
HKR-K is clear and HKR-R is present through agent reuse pain; HKR-H is weak. The paper offers a testable mechanism, but benchmark scores are not disclosed, keeping it in the interesting-not-featured band.
editor take
MMSkills packages procedures, state cards, and multiview frames; without scores, I’d file it as visual-agent memory engineering.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
13:06
26d ago
HuggingFace Papers (takara mirror)· rssEN13:06 · 05·13
PersonalAI 2.0: Enhancing knowledge graph traversal and retrieval with planning for personalized LLM agents
PersonalAI 2.0 improves personalized LLM agents with a dynamic GraphRAG pipeline using extracted entities, matched graph vertices, and clue queries; across six benchmarks, enabling the search-planning mechanism raises LLM-as-a-Judge scores by 18% versus disabling it.
#Agent#RAG#Reasoning#PersonalAI 2.0
why featured
HKR-K and HKR-R pass: the item gives 6 benchmarks and an 18% gain, tied to agent memory/RAG practice. HKR-H is weak, and the post lacks open-source artifacts, replication detail, or major-lab weight, so it stays in the 60-71 band.
editor take
PAI-2 gets +18% from search planning across six benchmarks; with LLM-as-a-Judge, I wouldn't call it a personalized-agent win yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
12:57
26d ago
HuggingFace Papers (takara mirror)· rssEN12:57 · 05·13
Twincher: Bijective Representation Learning for Continuous System Inversion
The paper introduces Twincher, an architecture using stacks of structured diffeomorphic transformations and tailored adversarial training to learn bijective representations between y and p, with experiments on synthetic systems showing better data efficiency and robustness than an inverse-modeling baseline.
#Reasoning#Robotics#Inference-opt#Twincher
why featured
HKR-K passes because Twincher includes concrete mechanisms and test conditions. HKR-H/R fail, and hard-exclusion-technical-accessibility applies: continuous-system inversion has no clear product or agent on-ramp.
editor take
Twincher targets robust inversion via bijective representations, but evidence stops at synthetic systems; physical-AI claims need real benchmarks.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
12:34
26d ago
HuggingFace Papers (takara mirror)· rssEN12:34 · 05·13
Cognifold: Always-On Proactive Memory via Cognitive Folding
Cognifold introduces a three-layer CLS agent memory with a prefrontal intent layer, using graph-topology self-organization to fold event streams, merge similar structures, decay stale ones, and surface intents when concept-cluster density crosses a threshold; the paper evaluates it with CogEval-Bench and 7 benchmarks across five cognitive domains.
#Agent#Memory#Benchmarking#Cognifold
why featured
HKR-H/K/R all pass, but the post stays at abstract level: no author authority, code, effect sizes, or production validation. This fits the upper end of the 60–71 research-release band.
editor take
Cognifold tests three-layer CLS memory on 7 benchmarks; I don’t buy the autonomy framing until CogEval-Bench is reproducible.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
12:23
26d ago
HuggingFace Papers (takara mirror)· rssEN12:23 · 05·13
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ aligns source and target vocabularies through a bilingual token lexicon, improves multilingual text compression rates across 15 languages, and restores vanilla model performance with as few as 1k fine-tuning steps.
#Fine-tuning#Inference-opt#TokAlign++#Research release
why featured
HKR-K passes: the method and test conditions are concrete for multilingual model or tokenizer migration work. HKR-H and HKR-R are weak, and a single technical paper fits the 60–71 all band.
editor take
TokAlign++ improves compression across 15 languages and recovers in 1k steps; vocab adaptation deserves more attention than tokenizer retraining.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
11:35
26d ago
HuggingFace Papers (takara mirror)· rssEN11:35 · 05·13
Backbone is All You Need: Assessing Vulnerabilities of Frozen Foundation Models in Synthetic Image Forensics
The paper proposes SIAA, a gray-box attack that uses only the detector’s ViT backbone and crafts adversarial examples in the target feature space; experiments cover multiple ViT-based detectors, few-shot learning, training misalignment, and transferability tests.
#Vision#Safety#Benchmarking#Research release
why featured
HKR-H/K/R all pass, but the post lacks success rates, dataset scale, and artifact details. This is useful safety research, not a same-day model or product event.
editor take
SIAA attacks ViT detectors with backbone knowledge only; no success rates disclosed, but frozen backbones look brittle here.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
11:02
26d ago
HuggingFace Papers (takara mirror)· rssEN11:02 · 05·13
Hierarchical Transformer Preconditioner for Interactive Physics Simulation
Hierarchical Transformer Preconditioner reaches 17.9 ms per frame on N=8,192 stiff multiphase Poisson systems, running 2.2x faster than GPU Jacobi, about 28x faster than GPU IC/DILU via AMGX multicolor_dilu, and 2.7x faster than neural SPAI retrained per scale on the same benchmark.
#Inference-opt#Research release#Benchmark
why featured
hard-exclusion-1/4 applies: a multiphase Poisson preconditioner is numerical methods plus physics simulation, with no agent, product, or general-model implication. HKR-K passes on benchmarks, but the item stays below 40.
editor take
Hierarchical Transformer Preconditioner hits 17.9 ms/frame at N=8,192; the serious bit is a full PCG loop captured in one CUDA Graph.
HKR breakdown
hook knowledge resonance
open source
50
SCORE
H0·K1·R0
10:53
26d ago
HuggingFace Papers (takara mirror)· rssEN10:53 · 05·13
Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning
Ego2World converts HD-EPIC egocentric cooking videos into executable symbolic worlds with hidden graph-transition state, evaluating agents that plan from local observations and execution feedback; experiments report that action-overlap scores overestimate physical-state success, while persistent belief memory improves task completion and reduces repeated visual exploration.
#Agent#Robotics#Memory#Research release
why featured
HKR-H/K/R pass, but the body only gives the mechanism; results, release status, and reproducible details are missing. This is useful agent-eval research, not a featured item.
editor take
Ego2World turns HD-EPIC cooking videos into hidden symbolic worlds; I buy the benchmark, action overlap is too forgiving for embodied planning.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
09:24
26d ago
HuggingFace Papers (takara mirror)· rssEN09:24 · 05·13
A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations
IfcLLM converts IFC models into relational and graph representations, and reports 93.3%-100% first-attempt accuracy on three IFC models with queries derived from 30 scenarios.
#Agent#Reasoning#Tools#IfcLLM
why featured
HKR-K passes with a concrete hybrid representation and small benchmark results. HKR-H and HKR-R are weak because the IFC/BIM angle is niche, so this stays in all rather than featured.
editor take
IfcLLM reports 93.3–100% first-try accuracy on 3 IFC models; 30 scenarios is too thin for general BIM querying claims.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
09:19
26d ago
HuggingFace Papers (takara mirror)· rssEN09:19 · 05·13
Improving Code Translation with Syntax-Guided and Semantic-Aware Preference Optimization
The paper introduces CTO, which combines source-code-derived semantic rewards with compiler-based syntax feedback inside DPO, and reports stronger results than existing baselines on C++, Java, and Python translation tasks.
#Code#Fine-tuning#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper states CTO’s training signals and C++/Java/Python translation tests. No open artifact, absolute metrics, or broad replication details are disclosed, so this remains a narrow code-research item.
editor take
CTO puts source-derived semantic rewards and compiler feedback into DPO. No numbers disclosed, so I don’t buy “significantly outperforms.”
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
08:41
26d ago
HuggingFace Papers (takara mirror)· rssEN08:41 · 05·13
DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution
DiffST applies one-step sampling and whole-video processing to real-world STVSR, adds CFCA and VRG for spatiotemporal aggregation and video-level guidance, and reports about 17× faster inference than previous diffusion-based STVSR methods.
#Vision#Multimodal#Inference-opt#DiffST
why featured
HKR-H and HKR-K pass via the 17x speed claim and one-step whole-video design. Scope stays narrow: a single STVSR paper with no product adoption or broad practitioner debate, so tier all.
editor take
DiffST reports 17× faster diffusion STVSR; I buy one-step sampling more than “leading results” without metrics here.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
08:30
27d ago
HuggingFace Papers (takara mirror)· rssEN08:30 · 05·13
GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
GeoBuildBench evaluates large language models and multimodal agents on 489 Chinese textbook-style geometry problems, requiring each agent to generate a DSL program that constructs diagrams satisfying explicit objects and verifiable constraints; evaluated models still produce structural hallucinations, omit objects, and fail to use visual or constraint feedback for self-correction.
#Multimodal#Reasoning#Agent#GeoBuildBench
why featured
HKR-K/R pass, but GeoBuildBench is a narrow academic benchmark. It gives a concrete dataset size and failure modes, without model-release or product impact, so it sits in 60–71.
editor take
GeoBuildBench tests DSL construction on 489 Chinese geometry problems; I buy the setup because hallucinated diagrams finally hit executable checks.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
08:14
27d ago
HuggingFace Papers (takara mirror)· rssEN08:14 · 05·13
Research paper introduces Decision Pattern Shift theory explaining model generalization
The paper introduces Decision Pattern Shift, representing each sample with a GradCAM-based channel-contribution vector and measuring deviation from the training class-average pattern; experiments across multiple datasets and architectures report an almost linear correlation between DPS magnitude and the generalization gap, with nearly all Pearson r values above 0.8.
#Vision#Interpretability#Benchmarking#Research release
why featured
HKR-K is strong: DPS uses GradCAM channel-contribution vectors and reports correlations above 0.8. HKR-R is limited to generalization-evaluation readers; HKR-H is weak, so this stays in all.
editor take
DPS links GradCAM channel vectors to generalization gaps at r>0.8; nice, but ViT and non-classification transfer decide its value.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
07:37
27d ago
HuggingFace Papers (takara mirror)· rssEN07:37 · 05·13
SECOND-Grasp: Semantic Contact-guided Dexterous Grasping
SECOND-Grasp combines vision-language reasoning, SGCR, and inverse kinematics to generate 3D contact maps, reaching 98.2% lifting success on seen categories and 97.7% on unseen categories after training on DexGraspNet.
#Robotics#Vision#Reasoning#SECOND-Grasp
why featured
HKR-K is strong and HKR-R applies to embodied-AI practitioners, but this is a single paper summary and DexGraspNet gains are not product proof. Score stays in the interesting-not-featured band.
editor take
SECOND-Grasp hits 98.2%/97.7% on DexGraspNet; I care less about that than its gap to real cluttered bins.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
06:54
27d ago
HuggingFace Papers (takara mirror)· rssEN06:54 · 05·13
Does Language Matter for Spoken Word Classification? A Multilingual Generative Meta-Learning Approach
The paper applies Generative Meta-Continual Learning to spoken word classification, trains monolingual models on English, German, French, and Catalan plus bilingual and multilingual variants, and finds the multilingual model performs best while unique training hours indicate performance better than the number of languages.
#Audio#Fine-tuning#Benchmarking#Research release
why featured
HKR-H and HKR-K pass on a concrete multilingual speech finding, but HKR-R is weak. The paper is narrow research without product or agent implications, so it stays in the 40–59 band.
editor take
The paper trains EN/DE/FR/CA models; I buy unique hours over language count as the cleaner performance driver.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H1·K1·R0
06:41
27d ago
HuggingFace Papers (takara mirror)· rssEN06:41 · 05·13
When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation
The paper evaluates proprioceptive encodings for robotic manipulation and finds that an episode-wise relative frame outperforms baselines in real-robot experiments, while the post does not disclose the number of tasks, robot platforms, or metric values.
#Robotics#Research release
why featured
HKR-H/K pass: the hook is absolute state failing, and the paper adds an episode-relative coordinate mechanism. Missing task counts and metrics keep it niche robotics research, with HKR-R weak.
editor take
The paper says episode-wise relative frames win; no task counts or metrics, so don’t refactor proprioception yet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H1·K1·R0
06:08
27d ago
HuggingFace Papers (takara mirror)· rssEN06:08 · 05·13
An Agentic LLM-Based Framework for Population-Scale Mental Health Screening
The paper proposes a LangChain-agent pipeline for population-scale mental health screening, and its transcript-based depression detection proof of concept uses cosine similarity, dynamic Top-k, and a 0.75 threshold while locking validated stages to prevent regressions.
#Agent#RAG#Tools#LangChain
why featured
HKR-H/K/R all pass, but the post only shows a proof of concept and method details; no real population scale, clinical validation, or shipped product is disclosed, so it stays in the 60–71 band.
editor take
Only a PoC is disclosed: cosine, dynamic Top-k, 0.75 threshold; no cohort size or AUC, so I don’t buy population-scale screening.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
05:07
27d ago
HuggingFace Papers (takara mirror)· rssEN05:07 · 05·13
JEDI Joint Embedding Diffusion World Model for Online Reinforcement Learning
JEDI learns its latent space end to end from a diffusion denoising loss within a JEPA framework, reports competitive Atari100k results, and reduces VRAM by 43%, makes world-model sampling more than 3x faster, and makes training 2.5x faster versus the pixel diffusion baseline.
#Reasoning#Inference-opt#Benchmarking#JEDI
why featured
HKR-K passes on mechanism and efficiency numbers, while HKR-H is weak and HKR-R stays niche to RL researchers. Technical depth limits audience fit, but no hard-exclusion rule is triggered.
editor take
JEDI cuts Atari100k VRAM 43% and sampling 3×; I buy the efficiency, but shifted task profiles smell risky.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:37
27d ago
HuggingFace Papers (takara mirror)· rssEN04:37 · 05·13
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education
The paper presents KITE, a RAG-based tutoring system for algorithm tracing and problem-solving, using a multimodal retrieval pipeline and intent-aware Socratic responses, and evaluates it with three assessment forms: RAGAs metrics, expert pedagogical review, and simulated two-turn student interactions.
#RAG#Multimodal#Agent#KITE
why featured
HKR-K passes because KITE gives a concrete multimodal RAG and intent-aware tutoring mechanism. HKR-H/R are weak: the academic framing lacks a click hook and only lightly touches practitioner stakes, so it stays in 60–71.
editor take
KITE discloses three eval modes and two-turn simulated students; I don’t buy tutoring efficacy without live classroom data.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:36
27d ago
HuggingFace Papers (takara mirror)· rssEN04:36 · 05·13
Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction
The study tested ALM-based coding on five de-identified motivational interviewing audio sessions, generated 12 reasoning trajectories per utterance from four prompts and three stochastic samples, then used majority voting to reach 52.56% accuracy and 46.40% macro-F1.
#Multimodal#Audio#Reasoning#Research release
why featured
HKR-K passes with concrete mechanism and metrics; HKR-H and HKR-R are weak. The clinical coding niche and 5-audio sample keep it far from AI product or industry decisions, so it stays in the low-value research band.
editor take
Five sessions and 12 trajectories hit 52.56% accuracy; self-consistency does not pay off the generalization debt in clinical coding.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·13
ExploitGym releases 898 real vulnerability exploitation tasks to test AI agents
ExploitGym introduces 898 exploitation tasks from real vulnerabilities across userspace programs, Google V8, and the Linux kernel; Anthropic Claude Mythos Preview produced working exploits for 157 instances, while OpenAI GPT-5.5 completed 120 instances under the evaluated configurations.
#Agent#Reasoning#Benchmarking#Anthropic
why featured
HKR-H/K/R all pass: the paper has a sharp exploit-agent hook, concrete benchmark numbers, and clear safety resonance. It is still a research benchmark, not a major model or product release, so it stays in the 78–84 featured band.
editor take
ExploitGym has 898 real vuln-exploit tasks, and Claude Mythos Preview clears 157; cyber evals are finally leaving CTF theater.
sharp
Two sources cover ExploitGym with the same core numbers: 898 real vulnerability tasks, 157 successes for Claude Mythos Preview, and 120 for GPT-5.5. That alignment reads like one Berkeley RDI paper/blog source chain, not independent reporting. My take: this benchmark will pressure model labs faster than bug-finding evals, because the target is unauthorized code execution, not a crash PoC. The setup is concrete: source code, build instructions, a triggering PoV input, a containerized runtime, and a two-hour cap per task across 520 userspace, 185 V8, and 193 Linux kernel instances. Don’t overread it as live internet compromise; safety filters were disabled under structured research access, and the body does not disclose attacker cost outside the lab. Still, 157/898 is enough to move exploit development from scary slideware into measurable agent capability.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
04:00
27d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·13
DECO: Sparse Mixture-of-Experts Achieves Dense Model Performance on Edge Devices
DECO matches dense Transformer performance under identical total parameter budgets and training tokens while activating only 20% of experts, and its specialized acceleration kernel delivers a 3.00× speedup over dense inference on real hardware.
#Inference-opt#THUNLP#Research release#Open source
why featured
HKR-H/K/R all pass: edge-side sparse MoE is a concrete hook, with 20% activation and 3.00x real-hardware inference speedup. It stays in the 78–84 band because this is an arXiv research release, not a deployed product or major lab launch.
editor take
DECO activates 20% of experts for a 3.00× hardware speedup; I buy the direction, not the leap to phone-ready deployment yet.
sharp
All 3 sources are the same arXiv title across cs.CL and cs.LG, so the alignment is indexing breadth, not independent validation. DECO’s concrete claim is strong: under equal total parameters and training tokens, it activates 20% of experts, matches dense Transformer performance, and reports a 3.00× real-hardware speedup over dense inference. I like the direction for on-device MoE, but the phrase “end-side devices” needs pressure-testing. The abstract does not name the chip, batch size, sequence length, memory bandwidth, or comparisons against llama.cpp, MLC, or ExecuTorch-style deployment stacks. ReLU routing, learnable expert-wise scaling, and NormSiLU sound like practical engineering moves. Without a device matrix, 3.00× is still a clean paper win, not proof that sparse MoE is ready for phones.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
04:00
27d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·13
TextSeal Localized LLM Watermark for Provenance and Distillation Protection
TextSeal adds dual-key generation, entropy-weighted scoring, and multi-region localization on Gumbel-max sampling; its evaluation reports no perceptible quality difference in 6,000 A/B comparisons across 5 languages.
#Safety#Inference-opt#Benchmarking#TextSeal
why featured
HKR-H/K/R all pass: the paper has a concrete localized-watermark hook, mechanisms, and a 6,000-trial multilingual evaluation. It is strong research signal, not a major lab product release, so it stays below P1.
editor take
TextSeal moves watermarking from whole-text detection to segment-level provenance and distillation traces; if it holds up, model laundering gets harder to deny.
sharp
Two arXiv categories carry the same TextSeal paper with identical framing, so this is one paper signal, not independent validation. The authors claim Gumbel-max sampling, dual-key generation, entropy-weighted scoring, multi-region localization, and no perceptible quality loss across 6,000 A/B judgments in 5 languages. The sharp part is the “radioactive” distillation claim. Classic text watermarking, including SynthID-text-style systems, has struggled with paraphrase, mixed authorship, and low-entropy generations. TextSeal says it localizes watermark signal inside heavily mixed human/AI documents, survives distillation, supports speculative decoding and multi-token prediction, and adds zero inference overhead. I like the ambition, but the abstract does not expose false-positive rates, attack budgets, or third-party replication. Until those are visible, this is a strong lab claim, not a production-grade accountability layer.
HKR breakdown
hook knowledge resonance
open source
90
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
AntiPaSTO trains Gemma-3-1B with 800 synthetic contrasting pairs and no preference labels; on DailyDilemmas it reaches 6.9x the prompting baseline Steering F1 and wins on 5 of 6 tested value axes.
#Alignment#Safety#AntiPaSTO#Gemma
why featured
HKR-K is strong: 800 synthetic pairs, no preference labels, and 6.9x F1 are testable. HKR-H/R pass on honesty steering, but scope is limited to Gemma-3-1B and DailyDilemmas, so it stays below featured.
editor take
AntiPaSTO beats prompting by 6.9x using 800 synthetic pairs; I buy the direction, not the Gemma-3-1B deployment story.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
The Scaling Law of Evaluation Failure: How Data Sparsity and Item Difficulty Gaps Break Simple Averaging
The paper runs simulations across 4 domains and shows simple-average rankings drop from Spearman ρ=1.000 at 100% coverage to ρ=0.809 at 67% coverage under high difficulty heterogeneity, while a 2PL IRT model maintains ρ≥0.996 across all tested conditions.
#Benchmarking#Safety#Research release#Benchmark
why featured
HKR-H/K/R all pass: the title challenges leaderboard averaging, the summary gives testable numbers, and the topic hits evaluation trust. Kept in all because this is a single arXiv methods paper with simulations only; no production adoption or cross-source debate is shown.
editor take
Simple averaging falls to ρ=0.809 at 67% coverage; sparse benchmark leaderboards using means are bias machines.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Reconsidering the Energy Efficiency of Spiking Neural Networks
The paper re-evaluates SNN energy efficiency against functionally equivalent QNNs using log2(T+1)-bit baselines. Under typical neuromorphic hardware, SNNs with T=5–10 need average spike rates below 6.4% to beat QNNs.
#Inference-opt#Benchmarking#Research release#Benchmark
why featured
HKR-H/K pass via the contrarian SNN claim and the 6.4% test condition. HKR-R misses because the topic is hardware-specialist and far from mainstream model or agent workflows.
editor take
SNNs need sub-6.4% spike rates at T=5–10 to beat QNNs; plenty of neuromorphic efficiency claims need an audit.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
The paper introduces CausalPitfalls, a benchmark that evaluates LLM causal inference with structured tasks across difficulty levels and grading rubrics, covering pitfalls such as Simpson’s paradox and selection bias, and using two protocols: direct prompting and code-assisted prompting with executable statistical analysis.
#Reasoning#Code#Benchmarking#CausalPitfalls
why featured
HKR-H/K/R pass: the title has a counterintuitive hook, the benchmark design adds concrete mechanisms, and the topic hits LLM reliability concerns. Kept in all because the summary gives no scores, sample size, or strong finding.
editor take
CausalPitfalls tests LLMs under 2 prompting protocols. No model scores in the snippet, so don’t buy causal-reasoning claims yet.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
The paper proposes RACC, which extracts safety representations from LLM hidden states using a small harmful-prompt calibration set and measures jailbreak test-suite quality with six coverage criteria across individual and compositional safety concepts.
#Safety#Benchmarking#Interpretability#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete method and 6 criteria for safety-test coverage. HKR-H is weak, and the feed lacks results, model scope, or debate signal, so it stays in the 60–71 band.
editor take
RACC calibrates safety representations from a small harmful-prompt set and scores six coverage criteria; I buy the direction, pending reproducible code.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Slicing and Dicing: Configuring Optimal Mixtures of Experts
The paper studies MoE configuration across more than 2,000 pretraining runs with models up to 6.6B total parameters; expert count and granularity dominate final quality, while dropless routing gives a consistent gain.
#Reasoning#Benchmarking#arXiv#Research release
why featured
HKR-K is strong via the experiment count and concrete MoE findings; HKR-R holds for training-cost tradeoffs. Single arXiv paper and architecture-detail angle keep HKR-H weak, so it stays all.
editor take
2,000 pretraining runs make the MoE recipe less mystical: expert count and granularity dominate; dropless routing is the small reliable win.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Controllable User Simulation
The paper formalizes controllable user simulation as a causal inference problem, proves that supervised fine-tuning on post-hoc trajectory labels injects look-ahead bias, and shows that under policy shift this failure makes evaluation metric variance grow geometrically.
#Agent#Fine-tuning#Benchmarking#arXiv
why featured
HKR-K/R pass: the paper maps controllable user simulation to causal inference and flags hindsight-label fine-tuning as an agent-eval hazard. HKR-H is weak, and this is a single arXiv theory paper without a disclosed tool or benchmark.
editor take
The paper proves post-hoc trajectory labels inject look-ahead bias; I buy the framing, but geometric variance needs scale.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
SoK: Unlearnability and Unlearning for Model Dememorization
arXiv:2605.11592v1 presents the first integrated analysis of model dememorization, covering pre-training unlearnability and post-training machine unlearning, with 3 stated contributions: a unified taxonomy, empirical evaluation of robustness and shallow dememorization, and a theoretical guarantee on dememorization depth for certified unlearning.
#Safety#Alignment#Fine-tuning#Research release
why featured
HKR-K and HKR-R are clear, and HKR-H is modest, but this is still an arXiv SoK without product impact, benchmark numbers, or visible industry pickup; defaulting to the 60–71 band keeps it in all.
editor take
arXiv 2605.11592 splits dememorization into pre/post-training; the useful part is admitting “forgetting” breaks under weight perturbations.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST labels sampled code candidates with sandbox execution and trains ranking with GRPO, improving LiveCodeBench Best-of-4 across 5 models from 4B to 30B. On Qwen3-30B-Thinking and LiveCodeBench v6, judgment gains +6.2 NDCG, single-sample pass@1 gains +3.1, and Best-of-4 accuracy gains +4.1.
#Code#Reasoning#Fine-tuning#Qwen
why featured
HKR-K and HKR-R pass: the mechanism and LiveCodeBench deltas are concrete, and useful for code-model builders. Single arXiv paper with benchmark gains keeps it in the interesting-not-featured band.
editor take
DuST adds +4.1 Best-of-4 on Qwen3-30B-Thinking LCB v6; discriminative GRPO turns wasted samples into training signal.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP improves dLLM parallel decoding with training-free early-convergence token detection; versus confidence-based decoding, it reduces average denoising steps by about 30%, and on GSM8K with dParallel it reaches 7.2 tokens per step while preserving model precision.
#Inference-opt#Reasoning#LEAP#GSM8K
why featured
HKR-K and HKR-R pass: LEAP names a concrete mechanism plus ~30% fewer denoising steps and 7.2 tokens/step. HKR-H is weak and the topic is specialist inference research, so it stays in the 60–71 band.
editor take
LEAP hits 7.2 tokens/step on GSM8K+dParallel; I care how much survives outside the dLLM niche.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
The paper proposes EPGS, which perturbs input embeddings with Gaussian noise and measures gradient-magnitude spikes to detect high-confidence factual errors in LLMs; the abstract says it significantly outperforms entropy-based and representation-based baselines, but does not disclose datasets or exact scores.
#Safety#Interpretability#Benchmarking#Research release
why featured
HKR-H/K/R all pass, but this is a single arXiv method paper with no product integration, open-source artifact, or adoption signal disclosed. Score stays in the 60–71 band as all.
editor take
EPGS probes embedding noise for gradient spikes; datasets and scores are undisclosed, so I’d treat it as a neat hypothesis.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
The paper proposes enterprise discovery agents and evaluates enterprise cascade prediction with CascadeBench; the abstract says offline-trained world models perform well in-distribution but degrade when deployment dynamics change, while discovery-based agents read active configuration at inference time.
#Agent#Reasoning#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the paper has a clear question hook and a new benchmark. HKR-R is weak, and this is a single arXiv paper without adoption or artifact signals, so it stays in 60–71.
editor take
CascadeBench tests enterprise cascade prediction; I buy runtime config reading, because offline world models are brittle under tenant-logic drift.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning
TMRL uses Context-Smoothed Pre-training to inject forward-diffusion noise into policy inputs, then modulates diffusion timesteps during RL fine-tuning, giving explicit exploration control and enabling real-world fine-tuning on complex robot manipulation tasks in under one hour.
#Robotics#Fine-tuning#Research release#Open source
why featured
HKR-K/R pass: one-hour real-robot finetuning and timestep-modulated RL are concrete claims. HKR-H is weak due to a jargon-heavy title, and missing code, lab, and benchmark details keep it in the 60–71 all band.
editor take
TMRL claims sub-1-hour real-robot fine-tuning; I’d stress-test the VLA image-policy case, since task counts aren’t disclosed.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Architecture Determines Observability of Transformers
The paper evaluates 14 models and finds that controlling for output confidence removes 60.3% of raw activation-probe signal on average; on downstream QA, a WikiText-trained probe with no task-specific tuning catches about one in eight confident errors missed by output-confidence monitoring at a 20% flag rate.
#Interpretability#Safety#Benchmarking#Pythia
why featured
HKR-H/K/R pass: the paper challenges probe assumptions and gives concrete numbers across 14 models. Kept in all because this is a single arXiv interpretability paper with no product, model release, or external replication.
editor take
Across 14 models, confidence control removes 60.3% of probe signal; stop treating probes as magic, architecture pre-decides observability.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
GRAFT: Graph-Tokenized LLMs for Tool Planning
GRAFT maps each tool node to a dedicated special token and trains on the model’s sampled trajectories with on-policy tool context distillation; the paper reports state-of-the-art results on exact sequence matching and dependency legality, while the RSS abstract does not disclose dataset names or numerical scores.
#Agent#Tools#GRAFT#Research release
why featured
HKR-H/K/R pass via the graph-token tool-planning mechanism and dependency-validity claim. Importance stays below featured because this is a single arXiv paper with no disclosed production adoption or ecosystem traction.
editor take
GRAFT tokenizes tool nodes; datasets and scores are undisclosed, so treat the SOTA claim as abstract-level only.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
SkillGen: Verified Inference-Time Agent Skill Synthesis
SkillGen synthesizes one auditable skill from base-agent trajectories, uses contrastive induction over successful and failed trajectories, and verifies impact by comparing the same instances with and without the skill to count both repairs and regressions.
#Agent#Reasoning#Tools#SkillGen
why featured
HKR-K/R pass: the paper gives a concrete skill-synthesis and regression-check mechanism. No model, task set, success rate, or artifact is disclosed, so it stays in the 60–71 band.
editor take
SkillGen synthesizes 1 auditable skill; counting regressions beside repairs is the agent-skill eval hygiene many papers skip.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Training-Inference Consistent Segmented Execution for Long-Context LLMs
The paper proposes a training-inference consistent segment-level generation framework that restricts gradient propagation to KV states from the immediately preceding segment, while allowing head-specific forward access to older KV states, and reports about 6x lower peak prefill memory at 128K than full-context attention with FlashAttention.
#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete mechanism and a 128K/1⁄6 memory claim, tied to long-context serving cost. HKR-H is weak, and no code, major-lab validation, or production adoption is disclosed.
editor take
At 128K, prefill peak memory drops ~6x; I’m watching whether truncated cross-segment credit assignment quietly costs capability.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Latent Chain-of-Thought Improves Structured-Data Transformers
The paper proposes a recurrent latent CoT scheme for structured-data Transformers and evaluates it on 36 time-series and tabular datasets; it beats the baseline on 8 of 9 time-series datasets with a 10.99% average gain and on 22 of 27 tabular datasets with a 5.31% average gain.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-H and HKR-K pass: the mechanism and 36-dataset results are concrete. As a single arXiv paper without named-lab pull or production replacement evidence, it stays in the lower interesting band.
editor take
Latent CoT wins 30 of 36 structured-data datasets; I buy the signal, pending compute-matched depth details.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Toxicity Detection Should Measure Contextual Harm, Not Text-Intrinsic Badness
arXiv:2503.16072v4 proposes the Contextual Stress Framework, defining toxicity as a relation between perceived norm violation and induced stress or disruption, and introduces CSF-Eval to separate text risk, norm violation, disruption, uncertainty, and policy action.
#Safety#Alignment#Benchmarking#Research release
why featured
HKR-H/K/R all pass, but the evidence is an arXiv framework summary only, with no major-lab backing, deployment case, or visible debate. This stays in the upper 60–71 research-release band.
editor take
CSF-Eval splits toxicity into 5 evaluation targets; I buy the direction, but no dataset or metrics are disclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Differentially Private Synthetic Text Generation for Retrieval-Augmented Generation (RAG)
The paper proposes DP-SynRAG, a framework that uses LLMs to generate reusable differentially private synthetic RAG databases, avoiding repeated query-time noise injection and additional privacy loss under a fixed privacy budget.
#RAG#Safety#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete DP-SynRAG mechanism for reusable private RAG stores. No metrics, epsilon settings, or deployment results are disclosed, so it stays below featured.
editor take
DP-SynRAG moves DP noise into a reusable synthetic corpus; no epsilon or datasets disclosed, so I don't buy the SOTA claim yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
The paper analyzes two channels of spurious correlation learning in preference optimization for log-linear policies, mean spurious bias and causal-spurious correlation leakage, and proposes tie training with equal-utility preference pairs to reduce reliance on spurious features without degrading causal learning.
#Alignment#Safety#Fine-tuning#Research release
why featured
HKR-K/R pass: the paper gives two DPO spurious-correlation channels and a tie-training mitigation. Single arXiv summary, no experiment numbers or code disclosed, and the topic is technical, so it stays in 60–71.
editor take
DPO gets two spurious-correlation channels in log-linear policies; tie training is neat, but LLM scale is undisclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe
The paper tests decentralized MAPF on POGEMA 8x8 maps with four agents: PPO reaches 95.8% clean success and 2.5% under the strongest attack, while Adv-PPO+MACER raises worst-case success to 77.5% ± 6.0% across three seeds with under one percentage point clean-cost.
#Agent#Robotics#Safety#arXiv
why featured
HKR-H/K/R pass, but this is a narrow MAPF robustness paper rather than a broad agent product or major lab release. Concrete attack and recovery numbers keep it in all, below featured.
editor take
Adv-PPO+MACER lifts strong-attack success from 2.5% to 77.5%±6.0%; tiny 8x8/4-agent setup, but the robustness gain is concrete.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models
FERMI improves membership inference attacks against tabular diffusion models across three architectures and three real-world relational datasets, raising TPR@0.1FPR over single-table baselines by up to 53% in white-box settings and 22% in black-box settings.
#Safety#Benchmarking#FERMI#arXiv
why featured
HKR-K and HKR-R pass: the paper gives a concrete attack setup and +53% TPR@0.1FPR. HKR-H is weak, and the single arXiv paper stays in the interesting-but-not-featured band.
editor take
FERMI lifts TPR@0.1FPR by up to 53% across 3 architectures and 3 datasets; single-table privacy tests look underfit.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
The paper proposes Instruction Lens Score for detecting object hallucinations in MLLMs, combining a Calibrated Local Score with a Context Consistency Score, and the method requires no auxiliary model or additional training while reporting tests across multiple benchmarks and MLLM architectures.
#Multimodal#Vision#Safety#Research release
why featured
HKR-H/K/R all pass, but the post gives no performance numbers, benchmark results, or code status. This is useful research signal, not a same-day industry story.
editor take
InsLen detects object hallucination without training; no benchmark numbers in the abstract, so treat it as reproducible candidate, not defense.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
The paper introduces Test-Time Personalization, sampling N candidates from a personalized policy model and selecting the best with a personalized reward model; the authors prove oracle selection has expected utility that grows logarithmically with the candidate count.
#Reasoning#Inference-opt#Alignment#Research release
why featured
HKR-K is clear: the paper gives a testable mechanism and a logarithmic utility claim. HKR-R is moderate for personalization builders, but HKR-H is weak and the article is a single arXiv paper with no adoption or concrete experiment numbers.
editor take
TTP samples N candidates then reranks; the log-utility ceiling is clean, but N, task count, and baselines aren’t disclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling
The paper introduces EvoTD, a data-synthesis framework that searches a dual-axis space of algorithmic skills and complexity attributes, using Crossover, Parametric Mutation, and a dynamic ZPD filter to generate learnable reasoning tasks.
#Reasoning#Fine-tuning#EvoTD#Research release
why featured
HKR-K passes via a concrete task-generation mechanism; HKR-R is narrow to reasoning-training practitioners. A single arXiv abstract with no benchmark gains, repo, or reproducibility details stays in all.
editor take
EvoTD turns synthetic tasks into skill×complexity search; no gain numbers in the snippet, so judge it by code reproducibility first.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Breaking Winner-Takes-All: Cooperative Policy Optimization Improves Diverse LLM Reasoning
The paper proposes GCPO, replacing independent rollout scoring with team-level credit assignment; correct non-redundant rollouts contribute to a determinant-volume coverage over reward-weighted semantic embeddings, and the code is planned for release.
#Reasoning#Alignment#Benchmarking#Research release
why featured
Single arXiv methods paper with a concrete RL mechanism, so HKR-H/K pass. Missing authorship signal, experiment numbers, and released code keep it in the 60–71 band.
editor take
GCPO pays non-redundant correct rollouts via determinant-volume credit; I buy the direction, but the abstract lacks base models and gains.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
The paper proposes OPEFO, a strict on-policy entropy-flow balancing method that rescales token-level entropy-increasing and entropy-decreasing updates by their contribution to entropy change, and reports improved RLVR training stability and final performance on six mathematical reasoning benchmarks.
#Reasoning#Alignment#Fine-tuning#Research release
why featured
HKR-H/K pass: the paper names a testable RLVR instability mechanism and proposes OPEFO with 6 math benchmarks. The topic is specialized training research; code, model scale, and external replication are not disclosed, so it stays all.
editor take
OPEFO improves RLVR stability on six math benchmarks; until code and models land, don’t swap out GRPO stacks.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
The paper validates a three-regime framework for context-parametric conflict with 9,970 API calls across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3, reporting Regime 2 certainty gradients for all five models and Regime 3 task framing shifts from near-100% context following to 6–71%.
#Reasoning#Benchmarking#Anthropic#OpenAI
why featured
HKR-K and HKR-R pass via the 9,970-call multi-model evaluation, but HKR-H fails. The summary lacks main findings, effect sizes, and reproducible setup details, so this stays in all rather than featured.
editor take
9,970 calls split context-vs-memory conflict into three regimes; I buy the frame if open task sets reproduce it.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
DreamPolicy: A Unified World-Model Policy for Scalable Humanoid Locomotion
DreamPolicy uses an autoregressive diffusion world model trained on aggregated rollouts from specialized policies to generate future trajectories; experiments report up to 27% higher performance than the strongest baseline on unseen terrains and 38% on combined terrains.
#Robotics#Reasoning#DreamPolicy#Research release
why featured
HKR-K is strong with a concrete mechanism and two benchmark gains. HKR-R is narrower to robotics, HKR-H is weak, and the article only provides abstract-level detail, so it stays below featured.
editor take
DreamPolicy reports +27% on unseen terrains and +38% on combined terrains; I buy the route, but hardware transfer is undisclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics
The paper shows that aggregation distorts behavioral curves: on Goodreads with 3.3M users across 9 genres, individual users peak at about 11 exposures while the aggregate peaks at about 34, and Amazon Electronics with 18M reviews shows a 5.3x distortion driven by survival bias.
#Benchmarking#Goodreads#Amazon#MovieLens
why featured
HKR-H/K/R all pass, but this is a methodological arXiv paper with impact centered on recommender and user-dynamics modeling; concrete datasets and survival-bias mechanism keep it in the high-all band.
editor take
Goodreads peaks at 11 individual vs 34 aggregate exposures; tuning rec frequency on aggregates bakes in survivor bias.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
VideoGPA uses a geometry foundation model to derive dense preference signals and trains video diffusion models with DPO; the abstract says it uses minimal preference pairs, but the post does not disclose the exact count.
#Multimodal#Vision#Alignment#VideoGPA
why featured
HKR-K is solid via the geometry-prior preference signal plus DPO mechanism, and HKR-R lands for video-generation quality pain. Missing metrics, lab context, and exact pair counts keep it in the 60–71 research band.
editor take
VideoGPA feeds DPO with geometry-derived preferences; pair count is undisclosed, so I buy the automation, not the “minimal” claim.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib decouples model-internal tensor observation from the LLM inference hot path using Ring^2, with 0.4%–6.8% overhead in offline batch inference, 6% average overhead in moderate online serving, and 2x–15x lower latency overhead than comparable observability baselines.
#Inference-opt#Interpretability#Tools#DMI-Lib
why featured
HKR-K/R pass: Ring^2 plus overhead numbers make a testable systems claim, and low-overhead internals matter for serving teams. HKR-H is weak; this is a narrow arXiv systems tool, so it stays in 60–71.
editor take
DMI-Lib cuts tensor-observation overhead to 0.4%–6.8% offline and 6% online; observability is becoming serving infrastructure, not debug glue.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
An End-to-End Framework for Building Large Language Models for Software Operations
The paper proposes OpsLLM for software-operations QA and root-cause analysis, using human-in-the-loop data curation, supervised fine-tuning, and a domain process reward model for reinforcement learning; it reports 0.2%–5.7% QA accuracy gains and 2.7%–70.3% RCA gains over existing open-source and closed-source LLMs.
#Fine-tuning#Reasoning#Alignment#OpsLLM
why featured
HKR-K and HKR-R pass via concrete training mechanisms and RCA gains, but HKR-H fails because the angle is a dry framework paper. Single arXiv item, useful but below the 72 featured threshold.
editor take
OpsLLM reports 2.7%–70.3% RCA gains; with only 15K SFT samples, that 70.3% smells like a soft baseline.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
The paper proposes that LLMs update in-context beliefs in a low-dimensional conceptual belief space and tests this on story understanding, reporting 3 findings: belief trajectories lie on structured manifolds, linear probes decode representations to predict behavior, and representation interventions causally steer trajectories.
#Reasoning#Interpretability#Research release
why featured
HKR-H and HKR-K pass: the title has a clear conceptual hook, and the summary gives concrete mechanisms such as probes and interventions. Impact stays research-heavy, with no code, model scale, or applied result disclosed.
editor take
The paper reports 3 story-understanding findings; I like the low-dimensional trajectory hook, but RSS omits models, layers, and task scale.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
SureLock locks unmasked positions whose posterior has stabilized during Masked Diffusion LM decoding, skips their query projection and feed-forward sublayers, and reduces per-iteration cost from O(N²d) to O(MNd), with 30–50% lower algorithmic FLOPs on LLaDA-8B at comparable generation quality.
#Inference-opt#Reasoning#LLaDA#SureLock
why featured
HKR-K is strong and HKR-R is present through inference-cost pressure. The scope is narrow Masked Diffusion LM research with no product adoption data, so it stays in the 60–71 band.
editor take
SureLock cuts LLaDA-8B algorithmic FLOPs by 30–50%; diffusion LMs first need to squeeze out wasted decoding compute.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
The paper tests target-adaptive text-tabular prediction in controlled bargaining and negotiation games, training on 13 frontier-LLM agents and testing on 91 held-out scaffolded agents; at K=16, Observer features improve response-prediction AUC by about 4 points and reduce bargaining offer-prediction error by 14%.
#Agent#Reasoning#Benchmarking#arXiv
why featured
HKR-H/K/R all pass via the agent-profiling hook, concrete K=16 results, and predictability concerns. The work stays inside controlled bargaining games, so it fits the 60–71 research-signal band rather than featured.
editor take
13 LLMs train, 91 agents test; K=16 adds 4 AUC points, making counterpart modeling feel experimentally real.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Principled Latent Diffusion for Graphs via Laplacian Autoencoders
LG-Flow moves graph diffusion into a latent representation that scales linearly with node count, supports near-lossless reconstruction for undirected graphs and DAGs, and reports up to a 1000x speed-up over state-of-the-art graph diffusion models.
#Reasoning#Inference-opt#LG-Flow#Research release
why featured
HKR-H/K pass on the 1000x speedup and linear latent mechanism. HKR-R fails: graph diffusion is specialized, and the post does not disclose code, benchmark setup, or product impact.
editor take
LG-Flow reports up to 1000x speedup; I want the near-lossless decoder tested on large sparse graphs and constrained DAGs.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
The paper decomposes the RLHF-DPO performance gap into an explicit representation gap under exact optimization and an implicit representation gap under finite samples, and shows in a sparse ground-truth reward construction that RLHF needs fewer samples than DPO to recover an effective reward model.
#Fine-tuning#Alignment#Reasoning#Research release
why featured
HKR-H/K/R all pass: the paper targets the RLHF/DPO tradeoff with concrete representation-gap and sample-need claims. I keep it at 68 because it is a single theory-heavy arXiv item with no disclosed code, scale, or adoption signal.
editor take
The paper shows sparse-reward cases where RLHF needs fewer samples than DPO; skipping the reward model just moves the bill to data.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
BOOST: Bottleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
BOOST proposes Bottleneck-aware Tensor Parallelism for low-rank bottleneck LLM training, combining online-RMSNorm, linear-layer grouping, and low-rank activation checkpointing; evaluations report 1.46-1.91x speedup over full-rank baselines and 1.87-2.27x over naive 3D parallelism.
#Inference-opt#Research release
why featured
HKR-K/R pass: the paper gives 1.46-1.91x training speedups and concrete optimization mechanisms. HKR-H is weak, and low-rank training infrastructure is too niche for featured.
editor take
BOOST reports 1.46-1.91x training speedups; I want the accuracy ledger, since the abstract only says “minimum impact.”
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Elastic Attention Cores for Scalable Vision Transformers
VECA replaces direct patch-to-patch attention with C learned core tokens, so N image patches exchange information only through the cores and ViT attention complexity drops from O(N²) to O(N) when C is fixed.
#Vision#Inference-opt#Alan Z. Song#Andrew F. Luo
why featured
HKR-H/K/R all pass narrowly: the mechanism and complexity claim are concrete, and cost resonates. Single arXiv paper; excerpt lacks benchmarks, code, and reproducible results, so it stays in the 60–71 band.
editor take
VECA cuts ViT attention from O(N²) to O(N); I buy the direction, but “competitive” lacks numbers here.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
The paper proposes a batch-adaptive RL post-training objective that replaces fixed clipping with normalized effective sample size from policy ratios. The same statistic caps score-function weights and sets an off-policy regularizer, so updates tighten when stale or mismatched data concentrate ratios; experiments report matching or exceeding tuned baselines, with no new objective hyperparameters and code released on GitHub.
#Fine-tuning#Alignment#FeynRL#Research release
why featured
HKR-K/R pass: the mechanism is concrete and targets RL post-training clipping and tuning pain. HKR-H is weak, and the single arXiv item gives no experiment numbers or artifact details, so it stays in all.
editor take
FeynRL swaps fixed clipping for normalized ESS with zero new objective hyperparams; I buy the direction, pending code-level reproduction.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
SimDist pretrains action-conditioned robotic world models with physics simulators, then adapts to real-world data by transferring the encoder, reward model, and value function while updating only the latent dynamics model with prediction losses. The paper reports gains across contact-rich manipulation and quadruped locomotion tasks, but the RSS snippet does not disclose task counts, dataset size, or quantitative scores.
#Robotics#Reasoning#Research release#Open source
why featured
HKR-K and HKR-R pass: SimDist’s sim pretraining plus real-phase latent-dynamics update is a concrete robotics mechanism. HKR-H is weak, and the snippet gives no success rate, sample count, or artifact, so it stays in all.
editor take
SimDist updates only latent dynamics; task counts and scores are missing, so I buy the mechanism, not the “rapid” label.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Training Transformers for KV Cache Compressibility
The paper proposes KV-Compression Aware Training, a continued pretraining method that masks KV slots during training so the model uses fewer cache entries; experiments evaluate downstream compression quality-budget tradeoffs on retrieval, long-context QA, and compressed-prefix continuation perplexity.
#Inference-opt#Memory#Reasoning#Research release
why featured
HKR-K/R pass: the mechanism is clear and KV-cache cost is practical. HKR-H is weak, and the body discloses no compression, latency, or accuracy numbers, so this stays below featured.
editor take
KV-CAT masks KV slots during continued pretraining; I buy the bet: cache compression needs training pressure, not post-hoc tricks alone.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Efficient Adjoint Matching for Fine-tuning Diffusion Models
The paper proposes Efficient Adjoint Matching for reward fine-tuning of diffusion models, reformulating the SOC problem with a linear base drift and modified terminal cost, and reports up to 4x faster convergence than AM on text-to-image benchmarks including PickScore, ImageReward, HPSv2.1, CLIPScore, and Aesthetics.
#Fine-tuning#Vision#Alignment#Research release
why featured
HKR-K and HKR-R pass on the 4x convergence claim, SOC rewrite, and training-cost angle. HKR-H is weak because the title is a dense method name, and the audience is mostly diffusion fine-tuning researchers.
editor take
EAM reports up to 4x faster convergence than AM; closed-form adjoints are the cost cut diffusion RLHF needed.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making
The paper proposes MAC, a masked agent collaboration framework that selects Pareto-optimal LLM agents using model size, inference time, diversity score, and throughput ratio, then masks the agent output with the lowest cross-consistency value during medical decision-making collaboration.
#Agent#Reasoning#Research release
why featured
HKR-K and HKR-R pass: the mechanism is concrete and medical decisions sharpen the reliability stakes. Metrics, datasets, and baselines are not disclosed here, so it stays in the 60–71 band.
editor take
MAC selects agents via 4 metrics, then masks lowest consistency; no dataset or gain is disclosed, so I don't buy the medical-decision uplift yet.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Block-R1: Rethinking Block Size in Multi-domain RL for Diffusion Large Language Models
Block-R1 studies block-size conflict in multi-domain RL post-training for diffusion large language models, releases the 41K-sample Block-R1-41K dataset, a Block Size Conflict Score, and a benchmark, with experiments covering 13 datasets, 7 RL algorithms, and multiple dLLM backbones.
#Reasoning#Benchmarking#Fine-tuning#Block-R1
why featured
HKR-K is strong: the paper gives a dataset, metric, and benchmark scale. HKR-H comes from the unusual “block size conflict” angle; it stays in all because the dLLM/RL scope is narrow and lacks broad practitioner resonance.
editor take
Block-R1 spans 13 datasets and 7 RL algorithms; dLLM post-training should stop treating block size as an inference knob.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Interpretability Can Be Actionable
The paper proposes evaluating interpretability by actionability, defines two dimensions—concreteness and validation—and identifies five domains where interpretability provides unique leverage; the RSS abstract does not disclose the domain list or empirical results.
#Interpretability#Research release#Commentary
why featured
HKR-K/R pass: the paper offers a concrete framework and safety relevance. HKR-H is weak, and the feed discloses no experiments, author pull, or reproducible evaluation, so it stays in the interesting-not-featured band.
editor take
This paper pins interpretability to concreteness and validation; fair move, but the five leverage domains are undisclosed.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
Vision2Code introduces a reference-code-free benchmark with 2,169 examples from 15 datasets, where nine open-weight and proprietary models perform better on chart-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams.
#Vision#Code#Benchmarking#Vision2Code
why featured
HKR-K and HKR-R pass: the paper gives concrete benchmark scale and model comparisons for image-to-code reliability. HKR-H is weak, and a single arXiv benchmark stays in the 60–71 band.
editor take
Vision2Code tests 9 models on 2,169 cases; charts pass, spatial, chemistry, and circuit diagrams still crack.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
The paper proposes HE-SNR, a fine-grained entropy metric for guiding mid-training on SWE-bench, and validates it on models up to 560B parameters with 32K and 128K context windows.
#Code#Benchmarking#Reasoning#SWE-bench
why featured
HKR-K is clear and HKR-R is limited: HE-SNR adds a metric for SWE-bench mid-training with scale details, but the item only gives abstract-level facts and no direct product impact.
editor take
HE-SNR is tested at 560B and 32K/128K; I like the PPL challenge, but SWE-bench gains aren’t disclosed.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Hölder Policy Optimisation
HölderPO uses the Hölder mean to unify token-level probability aggregation and anneals parameter p during training; it reports 54.9% average accuracy across math benchmarks, a 7.2% relative gain over standard GRPO, and 93.8% success on ALFWorld.
#Reasoning#Alignment#Benchmarking#HölderPO
why featured
HKR-K passes with a concrete mechanism and benchmark delta; HKR-H and HKR-R are weak. This is useful RL-optimization research, but technical and not broad enough for featured.
editor take
HölderPO reports 54.9% math average, 7.2% over GRPO; if p-annealing prevents collapse, one GRPO tuning knob stops being folklore.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
When to Ask a Question: Understanding Communication Strategies in Generative AI Tools
arXiv 2605.11240 proposes a stylized user-LLM interaction model with an objective balancing user burden and preference representation, then uses an empirical evaluation to test the model’s predictions and practical implications.
#Alignment#Reasoning#Research release#Safety/alignment
why featured
HKR-H/K/R all pass because the paper targets a real AI-product UX tradeoff and states a concrete modeling mechanism. Still, the post lacks sample size, effect numbers, and artifact details, so it stays in the 60–71 band.
editor take
2605.11240 puts question count into the objective; I buy the framing, but the snippet gives no eval scale.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FastUMAP: Scalable Dimensionality Reduction via Bipartite Landmark Sampling
FastUMAP reports the lowest runtime on 7 of 9 benchmark datasets under a default-implementation comparison on one workstation; on 70,000-sample MNIST and Fashion-MNIST, it finishes in about 4.6 seconds and reaches 91.4% mean kNN accuracy versus 94.6% for the strongest accuracy baseline.
#Embedding#Inference-opt#Benchmarking#FastUMAP
why featured
HKR-K is strong and HKR-R is present for embedding/visualization workflows, with concrete benchmark numbers. The topic remains a narrow dimensionality-reduction paper, so it stays in the 60–71 band.
editor take
FastUMAP wins runtime on 7/9 sets and embeds 70k samples in 4.6s; 91.4% kNN accuracy makes it a sweep tool, not final evidence.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FERA: Uncertainty-Aware Federated Reasoning for Large Language Models
FERA coordinates heterogeneous clients with private demonstrations through a training-free federated protocol, using multi-round reasoning traces and uncertainty-weighted aggregation; the abstract says it outperforms federated training and training-free baselines, but the post does not disclose benchmark counts or accuracy numbers.
#Reasoning#Alignment#Benchmarking#FERA
why featured
HKR-K/R pass: the mechanism is concrete and relevant to private-data reasoning workflows. HKR-H fails, and missing benchmark count or accuracy keeps it in the 60–71 research-signal band.
editor take
FERA gives the federated reasoning mechanism, not benchmark counts or accuracy; training-free is appealing, but convergence proof is not evidence.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
From Model Uncertainty to Human Attention: Localization-Aware Visual Cues for Scalable Annotation Review
The study tested localization uncertainty cues with 120 participants, and annotators receiving cues achieved higher label quality while finishing faster overall; box-level analysis showed effort shifted toward high-uncertainty predictions, and the code is available.
#Vision#Alignment#Tools#Research release
why featured
HKR-K is solid: 120 participants, localization-aware uncertainty cues, faster and higher-quality review. HKR-H is weak, and the scope is annotation workflow research rather than a same-day model or product event.
editor take
A 120-person study says localization cues improve quality and speed; annotation tools should stop treating class confidence as enough.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Research paper presents procedural-skill SFT analysis across Qwen3.5 model capacity tiers
The paper measures procedural-skill SFT on 0.8B, 2B, and 4B Qwen3.5 using a 200-task/40-skill holdout, with SFT-attributable gains of +0.070, +0.040, and +0.075 under matched-path LLM-only scoring.
#Fine-tuning#Benchmarking#Reasoning#Qwen
why featured
HKR-K and HKR-R pass: the paper gives concrete SFT gains by Qwen3.5 size and speaks to fine-tuning tradeoffs. HKR-H is weak, and the scope is narrow, so it stays in the interesting band.
editor take
Qwen3.5 0.8B/2B/4B SFT gains are +0.070/+0.040/+0.075; 353 demos show a pattern, but single-seed keeps it provisional.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
Xiantao Jiang proposes QuIDE, a quantized-network evaluation metric using I=(C×P)/log₂(T+1) to score compression, accuracy, and latency; six experiments report 4-bit quantization as optimal for MNIST and Llama-3-8B, while 8-bit performs better for ResNet-18 on ImageNet-1K and 4-bit PTQ fails under the accuracy-gated variant I'.
#Inference-opt#Benchmarking#Xiantao Jiang#Llama-3-8B
why featured
HKR-K and HKR-R pass via a concrete quantization metric and cost/latency relevance, but HKR-H misses. As a single arXiv inference-optimization paper with limited product impact, it stays in the 60–71 all band.
editor take
QuIDE folds compression, accuracy, latency into I=(C×P)/log₂(T+1); I don’t buy one score for deployment trade-offs.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Localization Boosting for Growth Markets: Mitigating Cross-Locale Behavioral Bias in Learning-to-Rank
Adobe Express researchers propose a multi-objective learning-to-rank framework that combines click supervision, VLM-derived relevance labels, and locale-aware boosting; across five locales, the model improves relevance while restoring local content visibility, but the abstract does not disclose metric values or dataset size.
#Vision#Multimodal#Benchmarking#Adobe Express
why featured
Adobe Express’s LTR paper has a concrete mechanism and 5-locale evidence, but it is a narrower search-ranking/localization story. HKR-K/R pass, HKR-H is weak, so it stays in all.
editor take
Adobe Express tested locale-aware boosting across 5 locales; metrics and dataset size are undisclosed, so don’t crown it a localization fix.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation
The paper treats machine learning performance metrics as random variables and evaluates their distributions with quantiles and confidence intervals; its real-data and simulation studies report meaningful statistical inference with 10-25 repeated training runs, while standard nonparametric confidence intervals still apply.
#Benchmarking#Research release#Benchmark
why featured
HKR-K and HKR-R pass: the paper offers a concrete statistical mechanism and a testable 10-25 repeat-training claim, tied to benchmark reliability. HKR-H is weak, and a single arXiv methods paper stays in the all band.
editor take
The paper says 10–25 repeats can estimate quantile CIs; I buy the direction—single-score SOTA tables are overdue for demotion.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts
STRUM converts raw recordings into playable Clone Hero/YARG charts for drums, guitar, bass, vocals, and keys, reaching 0.838 drum onset F1 on a 30-song benchmark at ±100 ms tolerance. The authors release code, model weights, and the full benchmark manifest.
#Audio#Benchmarking#Tools#STRUM
why featured
HKR-H and HKR-K pass: the open-source model turns recordings into playable rhythm-game charts and reports a 30-song benchmark with 0.838 F1. The niche topic misses HKR-R, so it stays in the 60–71 band.
editor take
STRUM hits 0.838 drum F1 on 30 songs, but guitar sits at 0.651; the released weights matter more than the score.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows
The researchers trained an expert-in-the-loop LLM extraction workflow on seven annotated molecular glue papers, reached record-level F1 of 0.98, transferred it to PROTACs by terminology substitution with F1 above 0.93, and expanded molecular glue and PROTAC database records by 81% and 92%.
#Agent#RAG#Benchmarking#arXiv
why featured
HKR-K/R pass: the paper gives testable numbers for agentic literature extraction, including F1 0.98 and database growth. The protein-degradation domain is narrow, so audience fit stays in the interesting-but-not-featured band.
editor take
Seven papers to F1 0.98 is neat; the 92% expert-validated new glue records make this a credible curation template.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference
ChunkFlow schedules chunk-granular prefetching for three diffusion transformers on two PCIe H100 GPUs with Ulysses sequence parallelism, delivering up to 1.28x step-time speedup over SGLang layerwise offloading and reducing peak GPU memory by up to 49% versus a no-offload baseline when workloads are large enough.
#Inference-opt#ChunkFlow#SGLang#H100
why featured
HKR-K/R pass on reproducible infra numbers and cost resonance; HKR-H fails because the title is dense systems jargon. No hard-exclusion, but the niche DiT inference scope keeps it in the 60–71 band.
editor take
ChunkFlow hits 1.28x over SGLang on two PCIe H100s; DiT offloading finally treats PCIe contention as the problem.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
A Formal Comparison Between Chain of Thought and Latent Thought
The paper formally compares Chain of Thought and latent thought, showing that latent thought supports more efficient parallel computation, while CoT enables approximate counting and sampling through stochastic decoding.
#Reasoning#Research release
why featured
HKR-H and HKR-K pass: the paper targets the CoT vs latent-thought split and names parallelism plus approximate counting/sampling. The formal research angle lacks product or engineering impact, so it stays in the 60–71 band.
editor take
The paper separates latent thought for parallelism and CoT for stochastic counting; don’t mystify hidden reasoning—task structure decides.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol
The paper proposes an audit-constrained protocol for LLM reasoning evaluation, generating prompt variants from a finite component grammar under a fixed query budget; across three audited slices, CAPS did not improve audited yield or unique prompt-key discovery over uniform sampling.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K is solid: the paper proposes an audit-constrained reasoning-test protocol and reports CAPS did not beat uniform sampling across 3 slices. HKR-R is limited to eval practitioners, with no product or model-release impact, so it sits in the 60–71 band.
editor take
CAPS lost to uniform sampling across 3 audited slices; prompt-failure hunting needs budgets and audits, not cherry-picked mismatches.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Curriculum Learning-Guided Progressive Distillation in Large Language Models
The paper proposes CLPD, a distillation framework that orders training examples from easy to hard and schedules teachers with increasing capacity; the abstract says CLPD outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple reasoning benchmark settings.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete distillation mechanism tied to cost-sensitive model work. HKR-H fails, and the post lacks exact gains or source authority, so it stays below featured.
editor take
CLPD orders samples and teacher capacity together; model sizes are undisclosed, so don’t canonize “stronger teachers fail” yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Epistemic Uncertainty for Test-Time Discovery
UG-TTT maintains a small ensemble of low-rank adapters over a frozen base model, adds a per-token mutual-information exploration bonus to policy gradients, and raises maximum reward on 3 of 4 scientific discovery benchmarks while preserving higher solution diversity.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
HKR-K passes with a concrete mechanism and 3/4 benchmark gains; HKR-H is weak and HKR-R is narrow. As a single arXiv method paper without code, production replacement, or major-lab adoption, it sits in 60–71.
editor take
UG-TTT wins 3 of 4 discovery benchmarks; I buy per-token mutual information over single-model confidence for exploration.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Towards Order Fairness: Mitigating LLMs' Order Sensitivity through Dual Group Advantage Optimization
The paper proposes Dual Group Advantage Optimization, a reinforcement-learning method that balances intra-group accuracy advantage and inter-group stability advantage to train LLMs for order-stable correct outputs, with experiments reported on RAG, mathematical reasoning, and classification tasks, plus two metrics, Consistency Rate and Overconfidence Rate, and released code at github.com/Hyalinesky/DGAO.
#RAG#Reasoning#Alignment#Research release
why featured
HKR-K and HKR-R pass: DGAO names a concrete training mechanism for order sensitivity in RAG, math, and classification. The summary gives no lift numbers, code status, or reproducible setup, so it stays in the lower research band.
editor take
DGAO optimizes order fairness with two advantages. I don't buy “superior” until baselines and gains are disclosed.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries
The paper analyzes compromised-Provider attacks in SAGA and proposes four mitigations: SAGA-BFT, SAGA-MON, SAGA-AUD, and SAGA-HYB; the abstract describes trade-offs across Byzantine resilience, monitoring, and auditing, but the post does not disclose benchmark numbers.
#Agent#Safety#Alignment#SAGA
why featured
HKR-H/K/R pass via the compromised-provider threat model and named mitigations. No evaluation numbers are disclosed, and Byzantine governance is academic, so this stays in the 60–71 research band.
editor take
SAGA gets 4 mitigations, but no benchmark numbers disclosed; single-Provider agent governance invites Byzantine failure sooner or later.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
ASD-Bench evaluates 17 model configurations on 4,068 AQ-10 records across 3 age cohorts and 4 axes; 10 of 17 models reach F1 and AUC of 1.000 for adults, while AdaBoost still has ECE of 0.302, separating accuracy from calibration.
#Benchmarking#Interpretability#Safety#ASD-Bench
why featured
HKR-H and HKR-K pass via the perfect-score anomaly and concrete benchmark setup. HKR-R is weak: this is a vertical ASD-screening paper with no product, open model, or adoption signal, so it stays in the 60–71 band.
editor take
ASD-Bench tests 17 models on 4,068 AQ-10 records; adult F1=1.000 smells too easy, and clinical validity is unproven.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
CATS accelerates LLM decoding on memory-limited edge devices while keeping peak device memory equal to the target model alone. The paper evaluates real edge devices across five benchmarks and reports up to 5.08x wall-clock speedup with no generation-quality loss, beating the SOTA method by up to 1.45x under edge memory constraints.
#Inference-opt#Research release#Benchmark
why featured
HKR-K and HKR-R pass via concrete speed/memory claims and edge-deployment cost relevance. HKR-H is weak, and the inference-optimization paper is specialized, so it stays in the 60–71 band.
editor take
CATS reports 5.08x max speedup across five benchmarks; edge inference is gated by peak memory, not just smaller models.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
fg-expo: Frontier-Guided Exploration-Prioritized Policy Optimization via Adaptive KL and Gaussian Curriculum
FG-ExPO adds Accuracy-Conditioned KL Scaling and Gaussian Curriculum Sampling to GRPO, evaluates DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base on six math reasoning benchmarks, and raises AIME 2025 pass@32 from 63.33% to 76.67%.
#Reasoning#Fine-tuning#Benchmarking#DeepSeek
why featured
HKR-K is strong and HKR-R lands because the paper gives a testable gain for small reasoning-model RL. HKR-H fails due to jargon-heavy framing; code, training cost, and robustness evidence are not disclosed, so it stays in all.
editor take
FG-ExPO lifts AIME 2025 pass@32 to 76.67%. I buy AKL/GCS tweaks over another round of GRPO folklore.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Rotary Masked Autoencoders are Versatile Learners
RoMAE extends RoPE to continuous positions and enables MAE-style interpolation and representation learning without time-series-specific architecture changes, covering irregular multivariate time series, images, and audio while surpassing specialized time-series architectures on difficult datasets including the DESC ELAsTiCC Challenge.
#Multimodal#Embedding#RoMAE#RoPE
why featured
HKR-H/K pass: RoMAE extends RoPE to continuous positions across irregular time series, images, and audio, with DESC ELAsTiCC results. HKR-R is weak because this remains an academic architecture paper without a product or deployment hook.
editor take
RoMAE runs continuous RoPE across irregular series, images, and audio; learned embeddings breaking RoPE relativity is the sharper warning.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Couple to Control: Joint Initial Noise Design in Diffusion Models
The paper proposes joint initial-noise design for diffusion models: each noise stays marginally standard Gaussian, while cross-sample dependence is designed, improving gallery diversity on SD1.5, SDXL, and SD3 without adding sampling cost.
#Multimodal#Vision#Inference-opt#arXiv
why featured
HKR-K is clear and HKR-R applies to image-generation teams, but this is a method paper with abstract-level claims only; no uplift numbers or code are disclosed, so it stays in the 60–71 band.
editor take
Coupled noise boosts diversity on SD1.5, SDXL, and SD3 at zero sampling cost; treating seed independence as designable is overdue.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Stop Marginalizing My Dreams: Model Inversion via Laplace Kernel for Continual Learning
The paper introduces REMIX for data-free continual learning. It uses a Laplace kernel to model structured feature covariance. Memory scales linearly with feature dimension, and computation adds only a logarithmic factor. The authors report gains on standard DFCIL benchmarks, and the code is available on GitHub.
#Memory#Benchmarking#arXiv#GitHub
why featured
HKR-K is solid: REMIX gives a Laplace-kernel covariance, linear memory, and code. HKR-R is narrow around continual-learning cost, while HKR-H is weak, so this stays in the 60–71 research-interest band.
editor take
REMIX makes covariance memory linear in feature dimension; I buy the direction—DFCIL pseudo-samples outgrew diagonal assumptions.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
The paper introduces Asymmetric Langevin Unlearning, which uses public data to reduce certified unlearning noise costs. It proves an O(1/n_pub^2) suppression factor, claims a computational advantage over retraining, and tests privacy with variational Rényi divergence and membership inference attacks under distribution mismatch.
#Fine-tuning#Safety#Benchmarking#Research release
why featured
HKR-K is concrete via the certified-unlearning noise factor, and HKR-R comes from deletion compliance versus utility. Theoretical arXiv framing limits accessibility and product impact, so it stays in the 60–71 band.
editor take
ALU claims O(1/n_pub²) unlearning-cost suppression; the snippet omits model scale and datasets behind its mass-deletion utility claim.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
The paper formalizes PO2 multi-grid 4-bit quantization, where each value group selects among two or more grids, and reports clear gains for small-group MXFP/NVFP-style formats while the advantage vanishes for very large groups; source code is available on GitHub.
#Inference-opt#IST-DASLab#Llama#Research release
why featured
HKR-K and HKR-R pass via a concrete 4-bit quantization mechanism and cost/deployment relevance. HKR-H fails, and the topic stays specialized inference engineering, so it remains below featured.
editor take
PO2 multi-grid 4-bit wins on small MXFP/NVFP groups, then fades at large groups; useful trick, hardware cost decides it.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
STRABLE: Benchmarking Tabular Machine Learning with Strings
STRABLE introduces a benchmark corpus of 108 real-world tables with strings and numbers and evaluates 445 pipelines; on categorical-dominant tables, advanced tabular learners paired with simple string embeddings deliver good predictions at low computational cost, while large LLM encoders become competitive on free-text-dominant tables.
#Benchmarking#Embedding#STRABLE#Research release
why featured
HKR-K passes because the paper adds a concrete benchmark and result: 108 real tables and 445 pipelines. HKR-H is weak and HKR-R is narrow, so it fits the 60–71 all band rather than featured.
editor take
STRABLE tests 108 tables and 445 pipelines; don’t rush LLM encoders for strings when simple embeddings plus tabular learners win on cost.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
VPG-EA improves the ε³ comprehensive efficiency metric by 8.73% on DeepSeek-R1-Distill-Qwen-1.5B and 12.37% on 7B, using a parameter-shared dual-stream setup, cross-view filtering of pseudo-efficient paths, and variational distillation to transfer efficient posterior patterns into the prior policy.
#Reasoning#Inference-opt#DeepSeek#Qwen
why featured
HKR-K and HKR-R pass: the paper gives efficiency numbers on DeepSeek-R1-Distill-Qwen 1.5B/7B and targets reasoning cost. HKR-H is weak, and as a single arXiv methods paper it stays in the 60–71 band.
editor take
VPG-EA lifts ε³ by 8.73%/12.37% on two Qwen distills; I’d audit whether ε³ just rewards shorter reasoning.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Hanhan Zhou, Shamik Roy, and Rashmi Gangadharaiah propose an adaptive steering scheduler for discrete diffusion language models, tested on four 124M-8B-parameter DLMs and seven steering tasks; on simultaneous three-attribute control, it reaches up to 93% steering strength, 15 percentage points above the strongest baseline while preserving generation quality.
#Alignment#Interpretability#Inference-opt#Hanhan Zhou
why featured
HKR-H/K pass: the paper has a control-without-breakage hook and concrete numbers across model sizes and tasks. HKR-R is weaker because DLM intervention work is specialized and lacks product, open-source, or deployment detail.
editor take
Zhou et al. hit 93% three-attribute steering across 4 DLMs and 7 tasks; autoregressive-style steering looks sloppy here.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction
The paper proposes MedTPE, which merges frequent co-occurring medical token pairs and fine-tunes only 0.5–1.0% newly introduced token embeddings; across four clinical prediction tasks, it reduces input length by up to 31% and inference latency by 34–63% while maintaining or improving performance.
#Inference-opt#Fine-tuning#MedTPE#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete compression mechanism and latency numbers, tied to clinical deployment cost. It remains a single arXiv method paper with a narrow domain, below featured threshold.
editor take
MedTPE cuts EHR tokens 31% and latency 63%; for clinical LLMs, token-pair merging beats risky pruning.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Intention-Conditioned Flow Occupancy Models
InFOM uses flow matching to predict an agent’s temporally distant occupancy states with a latent intention variable, and its experiments on 36 state-based and 4 image-based benchmark tasks report a 1.8× median return improvement and a 36% success-rate increase over alternative pre-training methods.
#Agent#Reasoning#Robotics#arXiv
why featured
HKR-K passes because the summary gives a mechanism and benchmark numbers. HKR-H/R are weak: the title is academic, and impact remains at paper-evaluation level, so this fits all rather than featured.
editor take
InFOM reports 1.8× returns across 40 tasks; making intention a sampled latent is neat, but replication will decide its bite.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General
PaPE encodes positions for vision tokens with a parabola-based scheme, and ImageNet-1K extrapolation experiments report up to a 10.5% absolute gain over the next-best encoding.
#Vision#Multimodal#Benchmarking#Research release
why featured
HKR-K passes with a concrete mechanism and +10.5% reported gain. HKR-H/R are weak, and a position-encoding paper is narrow technical research, so it fits all below featured.
editor take
PaPE claims up to +10.5% on ImageNet-1K extrapolation; I’d inspect the 8-dataset table before trusting the encoding.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning
The paper proposes MarsTSC, a VLM agentic reasoning framework for few-shot multimodal time-series classification, using three roles—Generator, Reflector, and Modifier—and a self-evolving knowledge bank; experiments cover 12 time-series benchmarks and 6 VLM backbones, but the snippet does not disclose exact scores or model names.
#Agent#Reasoning#Multimodal#MarsTSC
why featured
HKR-H/K pass: the VLM plus few-shot multimodal time-series angle is fresh, with 3 roles, a self-evolving KB, 12 benchmarks, and 6 backbones. HKR-R is weak because this stays in a niche research setting without product or cost impact.
editor take
MarsTSC spans 12 benchmarks and 6 VLMs, with no scores disclosed; agentic reflection earns skepticism until TSC gains beat simpler test-time tricks.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
LatentHDR uses one diffusion pass to generate a coherent latent scene representation, then maps it to exposure-specific latents with a conditional latent-to-latent head; experiments on synthetic data and the SI-HDR benchmark report state-of-the-art dynamic range and an order-of-magnitude compute reduction.
#Multimodal#Vision#Inference-opt#LatentHDR
why featured
HKR-K passes with a concrete mechanism and a 10x compute reduction on SI-HDR. HKR-H/R are weak because panoramic HDR generation is niche, so this stays in the lower interesting band.
editor take
LatentHDR cuts HDR exposure-stack generation to one diffusion pass; for HDR, latent constraints beat burning samples.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Extending Kernel Trick to Influence Functions
The paper presents a dual representation of influence functions whose computational cost scales with dataset size rather than model size, estimating parameter, output, and loss changes after data-point removal when models are larger than datasets or parameter-space influence evaluation is infeasible.
#Fine-tuning#Interpretability#Research release
why featured
HKR-K is clear: a dual influence-function representation changes the scaling from model size to dataset size. HKR-H is weak, and the paper lacks experiment numbers, code, or product implications, so it stays in all.
editor take
This shifts influence-function cost from parameters to dataset size, but needs linearizable models and an output-dimension × dataset matrix.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Agent-Based Post-Hoc Correction of Agricultural Yield Forecasts
The paper proposes a structured LLM agent for post-hoc correction of agricultural yield forecasts, evaluated on a proprietary strawberry dataset and a public USDA corn harvest dataset, where Llama 3.1 8B produced the strongest corrections and reduced XGBoost strawberry MAE by 20% and MASE by 56%.
#Agent#Tools#Llama#LLaVA
why featured
HKR-K passes with datasets, baseline, and error reductions; HKR-H/R are weak because crop forecasting is far from mainstream AI products or agent workflows. No hard exclusion, but it stays in the 60-71 band as a niche paper.
editor take
Llama 3.1 8B cut strawberry XGBoost MAE 20%; I buy post-hoc agents over retraining for real farm budgets.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
RT-Transformer: The Transformer Block as a Spherical State Estimator
The paper models the Transformer block as directional state estimation on a hypersphere, where attention aggregates evidence, residual connections perform incremental updates, and normalization retracts the updated state back onto the hypersphere.
#Interpretability#Reasoning#Research release
why featured
HKR-H/K pass: the title offers a counterintuitive model and the body gives three module mappings. HKR-R is weak; a single arXiv theory paper without metrics, code, or product impact stays in all.
editor take
RT-Transformer unifies attention, residuals, and normalization as spherical estimation; I buy the geometry, but no empirical gains are disclosed.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Rotation-Preserving Supervised Fine-Tuning
Hangzhan Jin and five coauthors propose RPSFT, which penalizes changes in projected top-k singular-vector blocks of pretrained weight matrices; the 31-page arXiv paper includes 13 figures, reports improved in-domain/OOD trade-offs on math reasoning fine-tuning, and releases code on GitHub.
#Fine-tuning#Reasoning#Hangzhan Jin#Doina Precup
why featured
HKR-K is solid and HKR-R is niche but real for fine-tuning practitioners; the excerpt gives no measured gains, model scale, or benchmark results, so this stays in the lower interesting band.
editor take
RPSFT penalizes top-k singular-vector rotation; plain idea, runnable code, and a cleaner engineering patch than another SFT recipe.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Variance-aware Reward Modeling with Anchor Guidance
The paper proposes Anchor-guided Variance-aware Reward Modeling, using two coarse response-level anchor labels to resolve non-identifiability in Gaussian reward models from pairwise preferences, and evaluates the method on simulation studies plus four real-world diverging-preference datasets.
#Alignment#Fine-tuning#Benchmarking#Research release
why featured
HKR-K passes because the paper states a concrete mechanism: two coarse anchor labels for Gaussian reward-model identifiability, tested on 4 datasets. HKR-H and HKR-R are weak, so this stays in all, not featured.
editor take
AVRM identifies Gaussian reward variance with two response-level anchors; I buy the setup, and 4 disagreement datasets beat BT margin shrinkage.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Shaping Zero-Shot Coordination via State Blocking
The paper introduces State-Blocked Coordination, which creates a family of virtual environments via state blocking and improves zero-shot coordination across multiple benchmarks, including generalization to human partners.
#Agent#Reasoning#Benchmarking#Research release
why featured
HKR-K passes for a concrete mechanism: state blocking creates virtual environments for zero-shot coordination. HKR-H and HKR-R are weak because the post gives no metrics, code artifact, or product-facing implication.
editor take
SBC uses state blocking to create virtual environments; with no benchmark names or numbers, I file it as training perturbation, not a ZSC answer.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Demystifying When Pruning Works via Representation Hierarchies
The paper analyzes pruning through three representation spaces—embedding, logit, and probability—and finds that logit-to-probability nonlinear transformation amplifies pruning deviations, which accumulate across generation steps; the abstract says code is available on GitHub but does not disclose model sizes or benchmark scores.
#Inference-opt#Interpretability#Benchmarking#CASE-Lab-UMD
why featured
HKR-K comes from the pruning-failure mechanism; HKR-R comes from model-compression cost pressure. The item reads like an abstract, with no numbers, model list, or reproducible setup disclosed.
editor take
The paper splits pruning into 3 representation layers; softmax error amplification is plausible, but no model sizes or scores are disclosed.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
The paper introduces APIE, an active prompting framework for information extraction that ranks unlabeled samples using format uncertainty and content uncertainty, and reports stronger extraction accuracy and robustness than baselines across four benchmarks.
#RAG#Reasoning#Benchmarking#Research release
why featured
HKR-K is clear: APIE provides a testable sample-ranking mechanism and reports gains on 4 IE benchmarks. HKR-R is limited to IE and annotation workflows, with no broad model, product, or open-source impact disclosed.
editor take
APIE beats strong baselines on 4 IE benchmarks, but gains aren’t disclosed; format uncertainty is the production-shaped bit here.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Fully AI-Generated Image Detection: Definition, Recent Advances and Challenges
The arXiv review surveys fully AI-generated image detection and organizes prior work around two detector-design components: dataset construction and artifact extraction.
#Vision#Safety#Benchmarking#Research release
why featured
HKR-K/R pass: the survey gives a definition and a two-part detection pipeline. HKR-H is weak, and the post lacks a new model, dataset size, or evaluation numbers, so it stays in all.
editor take
This survey narrows detection to datasets and artifacts; model-specific wins still fail when the generator changes.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
A Survey of On-Policy Distillation for Large Language Models
This arXiv survey formalizes On-Policy Distillation as f-divergence minimization over student-sampled trajectories, and organizes distillation, RLHF, and imitation-learning work along three design axes.
#Fine-tuning#Alignment#Reasoning#Research release
why featured
HKR-K passes: the survey formalizes OPD as f-divergence minimization over student-sampled trajectories and uses 3 design axes. It is a methods survey, not a model release or reproducible experiment, so it sits in the 60–71 band.
editor take
This survey maps OPD onto 3 design axes; useful as accounting across distillation, RLHF, and imitation learning, not new algorithmic fuel.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
The paper proposes CREDIT, a contrastive reward for on-policy self-distillation, by showing token rewards sum to conditional pointwise mutual information and using a batch-contrastive baseline to isolate input-specific credit; across coding, scientific reasoning, and tool-use benchmarks on two model families, CREDIT reports the strongest aggregate performance with negligible extra compute.
#Reasoning#Code#Tools#CREDIT
why featured
HKR-K passes for the CREDIT reward mechanism and code/science/tool benchmarks. HKR-H and HKR-R are weak because this is a narrow training-method paper, so it stays in the 60–71 band.
editor take
CREDIT reframes self-distillation reward as conditional pMI and wins across two model families; I want ablations on batch negative quality.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts a standard pretrained LLM into an encoder, a looped reasoning block, and a decoder, using four components for stable latent looping; the abstract does not disclose specific base models, datasets, or performance numbers.
#Reasoning#Inference-opt#LoopUS#Research release
why featured
HKR-H/K pass: the paper offers a concrete looped latent-refinement mechanism with a 3-part architecture and 4 stabilizers. Missing models, datasets, and performance numbers keep it in the ordinary research-release band.
editor take
LoopUS splits a pretrained LLM into 3 looped stages; no models or scores disclosed, so treat it as latent test-time compute for now.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
On What We Can Learn from Low-Resolution Data
The paper analyzes low-resolution sample contributions using Kullback-Leibler divergence and derives bounds tied to downsampling information loss. It reports experiments with a vision transformer and a convolutional neural network showing that adding low-resolution data consistently improves performance when high-resolution training data is scarce.
#Vision#Benchmarking#Research release
why featured
HKR-K is present via a concrete mechanism and testable claim, and HKR-R touches training-data scarcity. No exact gains, artifact, or major-lab impact are disclosed, so this stays in the 60–71 band.
editor take
The paper bounds low-res sample value with KL; no datasets or gains disclosed, so treat it as a theory patch for mixed-resolution training.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Learning Adapter Rank via Symmetry Breaking
The paper introduces LRVD and BayesLoRA, which break LoRA rotational gauge symmetry to learn effective adapter rank and predictive uncertainty with O(r) extra parameters, while the abstract says BayesLoRA matches or exceeds low-rank sparsification baselines at comparable training cost.
#Fine-tuning#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the mechanism is concrete and tied to LoRA fine-tuning cost. No benchmark gains, datasets, or released artifact are disclosed, so this stays in the lower research-release band.
editor take
BayesLoRA learns rank and uncertainty with O(r) extra parameters; I buy this over post-hoc LoRA rank pruning.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
More Edits, More Stable: Understanding Lifelong Normalization in Sequential Model Editing
The paper introduces StableEdit, which strengthens Lifelong Normalization with an explicit warm-up stage and full whitening; removing LN causes immediate performance collapse, and the authors provide code on GitHub.
#Fine-tuning#Alignment#StableEdit#MINE-USTC
why featured
HKR-H/K pass: the paper gives StableEdit, warm-up/full whitening, and an LN-removal collapse claim. HKR-R is weak because sequential model editing is niche and no production-scale validation is disclosed.
editor take
StableEdit splits LN into warm-up and full whitening; without horizon counts disclosed, I’d treat it as mechanism work.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation
The paper studies grokking on modular arithmetic tasks across multiple random seeds and finds that spectral entropy of the representation covariance matrix crosses a stable task-specific threshold before test accuracy rises; a representation-mixing intervention delays both entropy collapse and grokking, including under norm-matched controls.
#Reasoning#Interpretability#Benchmarking#Research release
why featured
HKR-K passes: the paper offers a testable grokking predictor and intervention result. HKR-H/R are weak because the framing is technical and lacks a broad practitioner nerve, so it stays in all.
editor take
Spectral entropy crosses threshold before test accuracy; I buy the diagnostic, but LLM relevance needs non-toy validation.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
TabDLM uses masked diffusion language models for text and categorical fields, and continuous diffusion with specialized numeric token embeddings for numerical fields; the paper reports stronger results than diffusion and LLM baselines across multiple benchmarks, but the abstract does not disclose dataset names or metric values.
#Multimodal#Benchmarking#TabDLM#Research release
why featured
HKR-K passes: TabDLM adds a joint diffusion design for mixed tabular fields and claims wins over diffusion and LLM baselines. HKR-H and HKR-R are weak, so it stays in the lower interesting band.
editor take
TabDLM splits text, categorical, and numeric fields; no datasets or scores in the abstract, so I don’t buy the LLM-baseline win yet.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
MaskTab handles industrial tabular data with learnable missing-value tokens, twin-path pretraining, and an MoE-augmented loss, reporting +5.04% AUC and +8.28% KS over prior art on industrial-scale benchmarks.
#Embedding#Fine-tuning#Benchmarking#MaskTab
why featured
HKR-K passes on concrete mechanisms and benchmark deltas: +5.04% AUC and +8.28% KS. HKR-H and HKR-R are weak because this is a niche tabular ML paper, so it stays in all.
editor take
MaskTab reports +5.04% AUC and +8.28% KS; I’d wait for replication beyond private industrial benchmarks.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Causal Bias Detection in Generative Artificial Intelligence
The paper formalizes causal fairness for generative AI, derives decompositions by causal pathway and by replacement of real-world mechanisms with model mechanisms, and evaluates race and gender bias in large language models across multiple datasets.
#Safety#Alignment#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper offers a causal-path/model-replacement bias framework and maps to safety/compliance concerns. Sparse result detail and no major-lab signal keep it in the normal research band.
editor take
This paper treats generative models as arbitrary conditional mechanisms, but models and datasets are undisclosed; useful framework, thin empirical trust.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
DarkQA: Benchmarking Vision-Language Models on Visual-Primitive QA in Low-Light Indoor Scenes
DarkQA provides 9.4K deterministically generated, verifiable question-image pairs across five visual-primitive families to evaluate VLM perceptual degradation under multi-level low-light indoor scenes. The abstract says code and the benchmark dataset will be released upon acceptance, and it does not disclose a fixed public release date.
#Vision#Multimodal#Benchmarking#DarkQA
why featured
HKR-K passes via a concrete benchmark size and setup, but HKR-H and HKR-R are weak because the low-light indoor primitive task is niche and the artifact is not yet released. This fits a routine research/benchmark item, not featured.
editor take
DarkQA has 9.4K low-light indoor QA pairs; RAW-space degradation is solid, but no data until acceptance, so don't cite rankings yet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
The paper introduces AsymGRPO, which splits GRPO advantage estimation into positive and negative outcome-conditioned channels and reports gains over strong RLVR baselines on five mathematical reasoning benchmarks across model backbones.
#Reasoning#Alignment#Benchmarking#Research release
why featured
HKR-K passes via a concrete mechanism and 5 math-reasoning benchmark results. HKR-H and HKR-R are weak, and the RLVR-training focus keeps it in all, below featured.
editor take
AsymGRPO beats RLVR baselines on five math benchmarks; splitting positive and negative advantages gives GRPO a sharper entropy brake.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Looking and Listening Inside and Outside: Multimodal AI Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making
arXiv 2602.07668v2 proposes the L-LIO framework, adding audio to the LILO vision framework, and evaluates three safety cases: driver speech classification for impairment states, passenger spoken instructions for planning interfaces, and external-agent guidance where audio disambiguates vision-only cues.
#Multimodal#Audio#Vision#Research release
why featured
HKR-K passes because the paper names a concrete mechanism and 3 test cases for multimodal driver safety. HKR-H and HKR-R are weak: the angle is academic and the practitioner audience link is narrow, so it sits in the low 60s.
editor take
L-LIO tests 3 safety cases, but sample size is undisclosed; in-car audio helps, yet pilot evidence isn’t a safety stack.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
The paper proposes OGLS-SD, an outcome-guided logit-steering framework that contrasts successful and failed on-policy trajectories using verifiable outcome rewards to calibrate teacher logits; the abstract says it improves reasoning performance over standard OPSD and other variants across diverse benchmarks, but the post does not disclose scores.
#Reasoning#Fine-tuning#Alignment#Research release
why featured
HKR-K passes because the mechanism is concrete: verifiable outcome rewards plus logit steering. HKR-H and HKR-R are weak, and benchmark scores are not disclosed, so this stays in the lower all band.
editor take
OGLS-SD steers teacher logits with success/failure traces; no scores disclosed, so I’m filing it as an RL-distillation patch.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting
The paper proposes a hyperparameter-free covariance-weighted GRPO method that uses a Gaussian kernel to down-weight extreme token-level updates; the abstract says it improves downstream performance across reasoning benchmarks over GRPO, but the post does not disclose benchmark scores.
#Reasoning#Alignment#Benchmarking#Research release
why featured
HKR-H/K pass: the title targets extreme-token GRPO instability, and the post gives a Gaussian-kernel advantage-reweighting mechanism. Score stays at 62 because benchmark numbers are not disclosed and appeal is narrow.
editor take
Covariance-weighted GRPO claims no hyperparameters; no scores disclosed, so I read this as a stability patch, not reasoning progress.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging
The paper introduces ORBIT for GenRetrieval fine-tuning, tracking distance from initial model weights and applying weight averaging once a maximum threshold is exceeded to constrain drift and reduce rapid forgetting of general language reasoning abilities.
#Fine-tuning#RAG#Reasoning#ORBIT
why featured
HKR-K and HKR-R pass: the post states ORBIT’s drift-threshold and weight-averaging mechanism, tied to GenRetrieval forgetting. As a single arXiv method note with no metrics, code, or product impact disclosed, it stays in the 60–71 band.
editor take
ORBIT caps GenRetrieval drift by thresholded weight averaging; no models or scores in the snippet, so treat it as an anti-forgetting patch.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Research paper proposes entropy polarity control method for reinforcement fine-tuning
The paper proposes PAPO, a reinforcement fine-tuning method that uses token-level entropy polarity to control RLVR updates, and reports stronger results than competitive baselines on mathematical reasoning and agentic benchmarks; the abstract does not disclose the specific models, datasets, or reward improvement numbers.
#Fine-tuning#Reasoning#Agent#arXiv
why featured
HKR-K passes on a concrete mechanism: PAPO applies token-level entropy-polarity control to RLVR. HKR-H and HKR-R are weak, and the abstract omits models, datasets, and lift, so this stays in the lower research-release band.
editor take
PAPO moves RLVR entropy control to tokens; only the abstract is disclosed, with no models, datasets, or gains, so treat it as unverified.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
ADMM-Q: An Improved Hessian-Based Weight Quantizer for LLM Post-Training Quantization
ADMM-Q replaces GPTQ in existing LLM quantization pipelines and reduces WikiText-2 perplexity on Qwen3-8B from 12.85 to 10.06 in W3A16, from 9.29 to 8.68 in W4A8 SmoothQuant, and from 66.11 to 19.42 in W2A4KV4 SpinQuant.
#Inference-opt#Qwen#Research release#Benchmark
why featured
HKR-K is strong with testable perplexity numbers, and HKR-R touches low-bit deployment costs. The ADMM/Hessian PTQ angle is specialized and lacks product or framework impact, so it stays in all.
editor take
ADMM-Q cuts Qwen3-8B W2A4KV4 perplexity 66.11→19.42; 2-bit weights aren’t dead, GPTQ is the old bottleneck.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero injects LLM-derived conceptual priors only at the MCTS root and alternates world-model learning with LLM fine-tuning on Jericho and BabyAI; the abstract says it improves exploration efficiency and asymptotic performance, but the post does not disclose exact gains.
#Agent#Reasoning#Fine-tuning#PriorZero
why featured
HKR-K passes on the MCTS-root LLM-prior mechanism and alternating training loop. HKR-H/R are weak, and the post gives no lift numbers, so this sits in the 60–71 research-release band.
editor take
PriorZero injects LLM priors only at the MCTS root on Jericho and BabyAI; no gains disclosed, so I file it under clever engineering.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Meta-Learning and Targeted Differential Privacy to Improve the Accuracy-Privacy Trade-off in Recommendations
The paper applies targeted DP only to stereotypical user data likely to reveal gender or age, and uses meta-learning to improve robustness to remaining DP noise; the abstract says this improves accuracy and lowers empirical privacy risk versus uniform DP and full-DP baselines, but does not disclose dataset names or numeric results.
#Fine-tuning#Alignment#Research release
why featured
HKR-K comes from targeted DP on gender/age-revealing data plus meta-learning for noise robustness; HKR-R is limited to privacy-utility tradeoff teams. No metrics, artifact, or deployment detail keeps it in all.
editor take
The paper discloses targeted DP plus meta-learning, but no datasets or numbers; isolating “stereotypical” users makes the privacy boundary thornier.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
AVA-DINO adapts frozen DINOv3 visual features with two specialized branches and text-guided routing, reporting tests on nine industrial and medical benchmarks and 93.5% image-AUROC on MVTec-AD without target-specific training.
#Vision#Multimodal#Benchmarking#AVA-DINO
why featured
HKR-K passes because the summary gives a testable method and 9-benchmark result. HKR-H and HKR-R are weak; without a major lab, product path, or disclosed artifact, this sits in the lower interesting band.
editor take
AVA-DINO reports 93.5% AUROC on MVTec-AD; the routing regularizer matters more than the frozen DINOv3 wrapper.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
The paper proposes SAEParate, which uses a concept-aware contrastive objective to organize SAE latent representations into concept-specific clusters and evaluates text-to-image diffusion unlearning on UnlearnCanvas, with the abstract claiming state-of-the-art results and stronger joint style-object unlearning but not disclosing numerical metrics in the snippet.
#Vision#Alignment#Safety#SAEParate
why featured
HKR-K and HKR-R pass: the paper offers a concrete mechanism and benchmark, and touches safety/copyright control for image models. HKR-H is weak, and the work remains specialized research without product impact.
editor take
SAEParate tests diffusion unlearning on UnlearnCanvas; no metrics in the abstract, so trust the cluster-separation mechanism first.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
The paper introduces MULTI, a two-stage Textual Inversion method that disentangles lens, sensor, viewpoint, and domain factors, then evaluates the method on the new DF-RICO benchmark for novel image generation.
#Vision#Multimodal#Fine-tuning#MULTI
why featured
HKR-K passes via the two-stage Textual Inversion method and DF-RICO benchmark. HKR-H and HKR-R miss: this is a narrow vision paper with no product tie-in, major lab, or industry nerve.
editor take
MULTI splits lens, sensor, viewpoint, and domain via two-stage Textual Inversion; no scale disclosed, so treat it as a control diagnostic.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Online Continual Learning with Dynamic Label Hierarchies
The paper introduces DHOCL and HALO for online continual learning with dynamic label hierarchies, where taxonomies evolve horizontally and vertically and each sample provides supervision at one hierarchy level; experiments on multiple benchmarks report higher hierarchical accuracy, lower mistake severity, and better continual performance than existing methods.
#Fine-tuning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes: the paper defines DHOCL, proposes HALO, and reports multi-metric benchmark gains. HKR-H/R are weak because the work is a niche academic ML setting with no product or industry-distribution hook.
editor take
HALO claims gains with single-level supervision, but benchmark names and margins are undisclosed; I buy the setting before the SOTA claim.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP modifies visual latent reasoning on Qwen2.5-VL 7B with three alignment levels: feature-level PCA-aligned latent heads, context-level auxiliary visual supervision, and capacity-guided selective latent supervision; the abstract says it achieves the best mean perception and reasoning performance among supervised variants, but it does not disclose exact scores.
#Reasoning#Multimodal#Vision#Qwen
why featured
HKR-K passes because the paper names a three-layer alignment method and Qwen2.5-VL 7B setup, but HKR-H and HKR-R are weak. With no disclosed scores, this stays in the lower interesting band.
editor take
GAP adds three visual-latent alignment layers to Qwen2.5-VL 7B; no scores disclosed, so I read it as a norm-mismatch diagnosis paper.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
The paper introduces MICL, a continual learning scenario for MLLMs spanning image, audio, video, captioning, and question-answering across six tasks, and proposes MoInCL with pseudo-target generation and instruction-based knowledge distillation to reduce catastrophic forgetting under modality and task-type shifts.
#Multimodal#Memory#Fine-tuning#Research release
why featured
HKR-K passes via the MICL setup and MoInCL mechanism; HKR-H is weak and HKR-R stays niche. Single arXiv method paper, useful for multimodal fine-tuning readers but below featured.
editor take
MICL spans 6 cross-modal tasks; I buy the setup, but no gains are disclosed, so don’t parrot MoInCL as SOTA yet.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows
DiFaReli++ uses a conditional DDIM for single-view face relighting and trains only on 2D images, without light-stage data, relit pairs, multi-view images, or lighting ground truth.
#Vision#Multimodal#DiFaReli++#Multi-PIE
why featured
HKR-K passes because the paper states a concrete 2D-only training setup without light-stage, paired, multiview, or lighting ground truth. HKR-H and HKR-R are weak; no hard-exclusion applies, so this sits in the 60-71 niche research band.
editor take
DiFaReli++ trains single-view relighting on 2D images only; Multi-PIE scores aren’t disclosed, so don’t overbuy the no-lighting-GT claim.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement
FedSurrogate defends federated learning against backdoor attacks by combining bidirectional gradient alignment filtering, layer-adaptive anomaly detection, and downscaled surrogate updates from similar benign clients, keeping false-positive rates below 10% across all tested datasets and attack types versus 31–32% for the nearest comparable baseline, while holding attack success rates below 2.1%.
#Safety#Alignment#Benchmarking#FedSurrogate
why featured
HKR-K passes: the method and metrics, including false positives below 10% and ASR below 2.1%, are concrete. HKR-H/R are weak because FL backdoor defense is niche research, so it stays in all.
editor take
FedSurrogate reports <10% false positives; with baselines at 31–32%, I’d demand non-IID reproduction before buying the win.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training
The paper frames OUI as an early, label-free, activation-based structural signal and reports its use across 3 settings: supervised learning for weight-decay regimes, PPO actor-critic for learning-rate regimes, and online control for layer-wise weight-decay adaptation.
#Interpretability#Benchmarking#Research release
why featured
HKR-K passes: OUI is a concrete label-free activation signal across three settings. HKR-H/R are weak; the angle is academic with no product, cost, or safety spillover, so this stays in the lower research band.
editor take
OUI spans supervised, PPO, and online control in 3 settings; I’d ask for baselines and failures first.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
The paper introduces EHR-RAGp, a retrieval-augmented foundation model that uses a prototype-guided retrieval module to select patient-history chunks by prediction task; the abstract says it outperforms EHR foundation models and transformer baselines across multiple clinical prediction tasks, but does not disclose task counts or metric values.
#RAG#Embedding#Benchmarking#EHR-RAGp
why featured
HKR-K passes: EHR-RAGp has a concrete prototype-guided retrieval mechanism. HKR-H and HKR-R are weak, and the post gives only abstract-level benchmark claims without datasets, margins, or reproducibility details.
editor take
EHR-RAGp retrieves patient-history chunks via prototypes; no task counts or metrics disclosed, so I buy the EHR context patch, not the model leap.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Intrinsic Vicarious Conditioning for Deep Reinforcement Learning
The paper introduces vicarious conditioning as an intrinsic reward mechanism for deep reinforcement learning, implements four steps—attention, retention, reproduction, and reinforcement—and evaluates it in MiniWorld Sidewalk and Box2D CarRacing without requiring the demonstrator agent’s policy or reward function.
#Agent#Memory#Reasoning#Research release
why featured
HKR-K passes: the paper gives a 4-step vicarious-conditioning reward mechanism and two testbeds. HKR-H/R are weak; the angle is academic and lacks product impact or industry tension.
editor take
The paper reports only MiniWorld and CarRacing; I don’t buy it yet—without curves, it smells like observation learning rebranded as intrinsic reward.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Diffusion-State Policy Optimization for Masked Diffusion Language Models
DiSPO branches at selected intermediate masked states and updates only newly filled tokens; experiments on LLaDA-8B-Instruct show it improves over diffu-GRPO and SPG on math and planning benchmarks under matched rollout compute and optimizer steps.
#Reasoning#Fine-tuning#Benchmarking#LLaDA
why featured
HKR-K passes: the post gives DiSPO’s resampling and token-update mechanism plus LLaDA-8B-Instruct comparisons. HKR-H/R are weak because this is a niche training-algorithm paper with no product impact.
editor take
DiSPO reuses cached logits at masked states with no extra rollouts; I buy the trick, but LLaDA-8B is not proof of breadth.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
From Observations to States: Latent Time Series Forecasting
The paper proposes LatentTSF, which shifts time series forecasting from observation-space regression to latent-state prediction; the method uses an AutoEncoder to project observations into a learned state space, and the abstract reports consistent gains in forecasting accuracy and representation quality on widely used benchmarks.
#Benchmarking#Research release#Open source#Benchmark
why featured
HKR-K passes on the LatentTSF mechanism, while HKR-H and HKR-R are weak: no concrete benchmark numbers, adoption context, or practitioner pain point is disclosed. That keeps it in the lower-value all tier.
editor take
LatentTSF forecasts in AE latent space; the snippet gives no numbers. I buy the setup, not the “Latent Chaos” branding.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
The Confusion is Real: GRAPHIC -- A Network Science Approach to Confusion Matrices in Deep Learning
The paper introduces GRAPHIC, an architecture-agnostic method that derives confusion matrices from intermediate layers with linear classifiers and treats them as directed graph adjacency matrices to analyze class confusion across training epochs and layers.
#Interpretability#Benchmarking#GRAPHIC#Research release
why featured
HKR-K passes via a testable mechanism for layerwise confusion analysis. HKR-H and HKR-R are weak, and the item is a sparse arXiv research note with no product impact or industry debate.
editor take
GRAPHIC turns linear-probe confusion matrices into graphs; useful tooling, but flatfish/man reads like visualization win, not reliability evidence.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Sparsity and Out-of-Distribution Generalization
The paper proposes three conditions for OOD generalization: distinguished features, sparse hypotheses, and sufficient overlap between train and test distributions on restrictions to relevant or hypothesized features.
#Reasoning#Alignment#Benchmarking#Research release
why featured
HKR-K passes: the paper offers a concrete OOD generalization framework. HKR-H and HKR-R are weak, and only abstract-level detail is disclosed, with no numbers, code, or industry deployment angle.
editor take
The paper gives 3 OOD conditions; extending Blumer sample bounds is useful theory, not a benchmark story.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment
Instruct-ICL uses one MLLM to generate task-specific instructions as Chain-of-Thought guidance for a second MLLM, evaluates post-disaster VQA on FloodNet against a zero-shot baseline, and reports consistent accuracy gains, while the abstract does not disclose model names or numeric accuracy results.
#Multimodal#Vision#Reasoning#arXiv
why featured
HKR-K passes via a reproducible two-MLLM mechanism on FloodNet, but the post gives no improvement number. The application is narrow and lacks product, agent, or major-lab relevance, so it stays below featured.
editor take
Instruct-ICL only says FloodNet beats zero-shot; no model names or gains. Disaster VQA needs reliability, not prompt-workflow vibes.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing
The researchers used a commodity smartwatch to synchronize microphone audio with 6-axis inertial signals for face-to-face conversation detection, evaluating convolutional and attention-based networks across an 11-participant lab study and a 24-participant semi-naturalistic study with macro F1 scores of 82.0±3.0% and 77.2±1.8%, respectively.
#Multimodal#Audio#Research release
why featured
HKR-H/K/R all land lightly: the study has a privacy hook and concrete F1 results. Its impact stays low because it is wearable sensing/applied ML, not a model, product, or agent workflow update.
editor take
A commodity watch hits 77.2% macro F1 in semi-natural settings; on-device is nice, but 24 people is thin evidence.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H1·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Improving the Performance and Learning Stability of Parallelizable RNNs for Ultra-Low Power Applications
The paper proposes CMRU and αCMRU, replacing BMRU’s state update with a cumulative formulation that restores gradient flow and creates skip connections through time. Experiments report better convergence stability, lower initialization sensitivity, and performance matching or exceeding LRUs and minGRUs at small model sizes, especially on discrete long-range retention tasks.
#Benchmarking#Inference-opt#Research release#Benchmark
why featured
HKR-K passes via CMRU/αCMRU and the cumulative-update mechanism. HKR-H and HKR-R are weak, and the sequence-model architecture focus limits appeal beyond specialist readers.
editor take
CMRU fixes BMRU’s gradient blocking via cumulative updates. Small-model wins matter, but simulated low power is not silicon proof.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
arXiv:2603.00191v3 proposes LoDA, which uses two energy-based objectives to split LoRA into general and task-specific subspaces, fixes down-projections, learns up-projections with Gradient-Aligned Optimization, and applies a closed-form recalibration before merging updates into the backbone; the snippet says experiments beat existing continual-learning methods but does not disclose benchmark numbers.
#Fine-tuning#Memory#Benchmarking#arXiv
why featured
HKR-K passes because the summary names LoDA’s decomposition mechanism and GAO projection learning. HKR-H/R are weak, and no benchmark numbers or practical replacement claim are disclosed, so this stays in all.
editor take
LoDA splits LoRA into shared and isolated subspaces; no scores disclosed, so I buy the mechanism, not the win claim.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FLARE: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning
FLARE evaluates federated-learning client reliability with multi-dimensional reputation, adaptive thresholds, reputation-weighted aggregation, and LDP, and experiments with 100 clients on MNIST, CIFAR-10, and SVHN report up to 16% robustness gains while keeping convergence within 30% of the non-attacked baseline.
#Fine-tuning#Alignment#Benchmarking#FLARE
why featured
HKR-K passes via datasets, 100-client setup, 16% gain, and concrete mechanisms. HKR-H and HKR-R are weak: federated-learning reliability is academically useful but narrow, with no product or agent impact disclosed.
editor take
FLARE reports up to 16% robustness gains on 100-client MNIST/CIFAR/SVHN; I want non-IID runs and code before trusting it.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
A Semi-Supervised Framework for Speech Confidence Detection Using Whisper
The paper proposes a semi-supervised framework that fuses Whisper encoder embeddings, eGeMAPS descriptors, and vocal stress and disfluency probabilities, achieving 0.751 Macro-F1 and a 3% minority-class gain over a unimodal Whisper baseline.
#Audio#Embedding#Fine-tuning#Whisper
why featured
HKR-K passes with a concrete architecture and Macro-F1 number. HKR-H/R are weak: this is a narrow speech-classification paper with no product path, code release, or broader industry impact disclosed.
editor take
Whisper hybrid hits 0.751 Macro-F1; I don’t buy the semi-supervised gloss, the 3% minority-class gain is the useful claim.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA
FedRot-LoRA aligns client LoRA updates with orthogonal transformations before aggregation, reducing aggregation error caused by rotational invariance in low-rank factorizations without increasing communication cost or restricting model expressivity.
#Fine-tuning#Alignment#Research release
why featured
HKR-K passes because the post gives a concrete mechanism: orthogonal alignment before federated LoRA aggregation with no extra communication cost. HKR-H/R are weak: the angle is narrow, with no benchmark gains or deployment stakes disclosed.
editor take
FedRot-LoRA aligns factors before aggregation with zero extra comms; nice trick, but no numbers here, so don’t buy “stable training” yet.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
The paper compares five self-supervised pretraining objectives for ECG foundation models using up to 11 million public samples; contrastive predictive coding slightly leads JEPA on transfer, and structured state space models outperform transformers and CNNs across tested pretraining methods.
#Benchmarking#Research release#Benchmark
why featured
HKR-K passes because the paper gives concrete scale and model comparisons. HKR-H and HKR-R are weak: ECG foundation-model training is narrow medical-signal work with no product or agent implication disclosed.
editor take
ECG pretraining scales to 11M public samples; SSM beating transformers matters more than CPC edging JEPA.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM
STARC clusters KV pairs by semantic similarity and maps them to PIM-aligned memory regions; on HBM-PIM, it reduces attention-layer latency by 19%–31% and energy use by 19%–27% versus token-wise sparsity methods.
#Inference-opt#STARC#arXiv#Research release
why featured
HKR-K is solid: KV clustering, PIM-bank mapping, and 19%–31% latency plus 19%–27% energy cuts. HKR-H is weak, and HBM-PIM specialization lowers the score.
editor take
STARC cuts HBM-PIM attention latency 19–31%; KV clustering is credible, but this still sits far from today’s GPU serving stack.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R1
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
ξ-DPO: Direct Preference Optimization via Ratio Reward Margin
The paper introduces ξ-DPO, replacing SimPO’s γ margin tuning with a chosen/rejected ratio reward margin; β controls sample filtering, and ξ can be set from the initial reward-gap distribution instead of repeated trial-and-error.
#Alignment#Fine-tuning#Research release
why featured
HKR-K passes: the post gives ξ-DPO, β-based sample filtering, and ξ set from the initial reward-gap distribution. HKR-H/R are weak; as a specialized single arXiv method paper with no benchmark or artifact disclosed, it stays in all.
editor take
ξ-DPO replaces SimPO β/γ tuning with ξ margins; benchmarks aren’t disclosed, so treat it as tuning-cost work.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
KAN-CL: Per-Knot Importance Regularization for Continual Learning with Kolmogorov-Arnold Networks
KAN-CL uses a KAN classification head with bbEWC on a convolutional backbone, reducing forgetting by 88% on Split-CIFAR-10/5T and 93% on Split-CIFAR-100/10T versus a head-only KAN baseline while matching or exceeding baseline accuracy on both benchmarks.
#Fine-tuning#Benchmarking#KAN-CL#Kolmogorov-Arnold Networks
why featured
HKR-K passes with a concrete mechanism and Split-CIFAR numbers; HKR-H/R are weak because the angle is niche research. Technical accessibility drags it down, but it remains ML-relevant rather than excluded.
editor take
KAN-CL cuts forgetting 88%/93% on two Split-CIFAR setups; I’d audit the head-only KAN baseline before crediting KAN.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift
The paper introduces DeconDTN-Toolkit to simulate provenance shifts of varying degrees under existing benchmark training protocols, and evaluates ERM vulnerability, a robust out-of-distribution performance indicator, and mitigation methods.
#Benchmarking#Alignment#DeconDTN-Toolkit#Research release
why featured
HKR-K passes for a concrete toolkit mechanism: provenance-shift simulation and ERM/OOD evaluation. HKR-H and HKR-R are weak, and the article stays at abstract-level detail, so it fits all rather than featured.
editor take
DeconDTN-Toolkit targets provenance shift; task count is undisclosed, so I’d first test whether it actually breaks ERM baselines.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Seeing the Needle in the Haystack: Weakly Supervised Log Instance Anomaly Localization via Counterfactual Perturbation
The paper proposes LogMILP, a weakly supervised framework that uses only bag-level labels for log anomaly detection and instance-level localization, and reports experiments on three public datasets with open-source code released on GitHub.
#Interpretability#Benchmarking#LogMILP#Research release
why featured
HKR-K passes via a new method, 3 datasets, and open code. HKR-H/R are weak, and log anomaly localization is too narrow for featured placement.
editor take
LogMILP localizes log anomalies with bag-level labels only; three public datasets and code make this a usable baseline.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Calibrated Multimodal Representation Learning with Missing Modalities
The paper proposes CalMRL for multimodal datasets with missing modalities, explains incomplete alignment through anchor shift, and calibrates representation-level imputation using bi-step learning plus a closed-form posterior solution for shared latent variables.
#Multimodal#Embedding#CalMRL#Research release
why featured
HKR-K passes: CalMRL offers an anchor-offset explanation and a two-step calibration mechanism for missing modalities. HKR-H and HKR-R fail; the post gives no experiment numbers or artifact details, so this stays niche research signal.
editor take
CalMRL imputes missing modalities at representation level; dataset scale isn’t disclosed, and the anchor-shift diagnosis lives or dies by reproduction.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Investigating Simple Target-Covariate Relationships for Chronos-2 and TabPFN-TS
The paper designs controlled experiments with simple target-covariate relationships to evaluate covariate integration in Chronos-2 and TabPFN-TS; results show TabPFN-TS captures these relationships more effectively than Chronos-2, especially for short forecast horizons.
#Benchmarking#Chronos-2#TabPFN-TS#Research release
why featured
HKR-K passes because the paper reports a controlled covariate-integration test and a short-horizon result. HKR-H and HKR-R miss: the angle is narrow time-series benchmarking with little practitioner-wide tension.
editor take
TabPFN-TS beats Chronos-2 on short horizons; strong Chronos-2 benchmarks don’t prove clean covariate use.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Focusing Influence Mechanism for Multi-Agent Reinforcement Learning
The paper proposes FIM, a multi-agent reinforcement learning framework that uses an entropy-based criterion and eligibility traces to focus agents on under-explored state-space regions under sparse rewards; the abstract says it improves cooperative performance across diverse MARL benchmarks, but the post does not disclose specific scores.
#Agent#Reasoning#Benchmarking#Research release
why featured
HKR-K passes on a testable mechanism, while HKR-H and HKR-R are weak. No benchmark scores are disclosed, and the MARL framing is too specialized for featured treatment.
editor take
FIM uses entropy criteria and eligibility traces for unexplored states; no scores disclosed, so I file it as a sparse-reward exploration patch.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty
arXiv 2605.12281 models English vocabulary difficulty for Spanish, German, and Chinese L1 learners with gradient-boosted models, then uses Shapley values to compare familiarity, meaning, surface-form, and cross-linguistic transfer feature groups.
#Benchmarking#Interpretability#Research release
why featured
Applied linguistics ML paper with HKR-H/K: the question is readable and the method is concrete. HKR-R is absent; no product, agent, or industry impact is disclosed, so it stays in the low-value research band.
editor take
arXiv 2605.12281 covers 3 L1 groups. Familiarity beats transfer; useful for vocab ranking, not an SLA model.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H1·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
FeatMap: Understanding Image Manipulation in Feature Space and Its Implications for Feature Geometry
FeatMap learns mappings from original feature maps to manipulated feature maps across geometric transforms, photometric changes, local masking, and semantic edits from generative image editing models. The paper reports that global transformer mappings often perform best, while a shared linear model on one feature vector usually reaches similar reconstruction quality with little degradation.
#Vision#Multimodal#Interpretability#arXiv
why featured
HKR-K passes via a concrete mechanism and experiment claim; HKR-H/R are weak because the title is technical and lacks practitioner resonance. No hard exclusion applies, but the audience fit is narrow.
editor take
FeatMap maps semantic edits with one shared linear vector; I buy the probe, but the linear-geometry claim needs cross-model replication.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Scaling Laws and Tradeoffs in Recurrent Networks of Expressive Neurons
The paper introduces ELM Network, tuning unit count N, per-unit complexity k_e, and connectivity k_c under a fixed parameter budget P, and evaluates the tradeoff with a three-order-of-magnitude parameter sweep on SHD-Adding and Enwik8 sequence benchmarks.
#Reasoning#Benchmarking#arXiv#Research release
why featured
HKR-K passes: the paper gives a new network setup and a three-order parameter scan. HKR-H/R are weak because the angle is academic and lacks product or industry pull; no hard exclusion applies.
editor take
ELM Network sweeps three parameter orders; I buy the allocation question, not the cortex analogy—replicate beyond Enwik8 first.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Trajectory First: A Curriculum for Discovering Diverse Policies
The paper proposes a two-stage reinforcement-learning curriculum: it first uses a spline-based trajectory prior to produce diverse, high-reward behaviors, then distills them into reactive step-wise policies; the abstract says empirical evaluation shows higher learned-skill diversity while maintaining task performance.
#Agent#Robotics#Fine-tuning#Research release
why featured
HKR-K passes because the abstract gives a concrete training mechanism, but tasks, metric gains, and artifacts are not disclosed. HKR-H and HKR-R stay weak, so this is niche research signal below featured.
editor take
Trajectory First uses two-stage RL for skill diversity; task count and baselines aren’t disclosed, and spline priors feel practical, not novel.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Resilient Vision-Tabular Multimodal Learning under Modality Missingness
The paper proposes a vision-tabular Transformer that uses masked self-attention and modality dropout to handle missing modalities, and evaluates it on MIMIC-CXR paired with MIMIC-IV for multilabel classification of 14 diagnostic findings.
#Multimodal#Vision#MIMIC-CXR#MIMIC-IV
why featured
HKR-K passes with concrete mechanisms and MIMIC-CXR/MIMIC-IV evaluation details. HKR-H and HKR-R are weak; this is niche medical multimodal robustness research with limited product or agent relevance.
editor take
This tests missing-modality robustness on 14 MIMIC labels; no AUC disclosed, so don’t confuse masked attention with clinical reliability.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Exploring Token-Space Manipulation in Latent Audio Tokenizers
The paper proposes LATTE, which appends a fixed set of learnable latent tokens to audio feature sequences, keeps only those tokens for quantization and decoding, and evaluates selected token-position swaps on voice conversion and denoising tasks.
#Audio#LATTE#Research release
why featured
HKR-K passes on the LATTE mechanism, but HKR-H and HKR-R miss: no result numbers, code release, or product impact are disclosed. This stays in the low-value research band.
editor take
LATTE keeps only fixed latent tokens for quantization and decoding; I buy the question, but bitrate, MOS, and failures are undisclosed.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Hypernetworks for Dynamic Feature Selection
The paper proposes Hyper-DFS, a hypernetwork-based dynamic feature selection method that generates classifier parameters for each feature subset and uses a Set Transformer for the conditioning space. The abstract says it beats or matches state-of-the-art methods on synthetic, real tabular, and image benchmarks, but the RSS snippet does not disclose dataset counts or scores.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes on the concrete Hyper-DFS mechanism, but the post gives no scores or reproducible setup. HKR-H and HKR-R fail, so this stays in the lower all band.
editor take
Hyper-DFS generates classifiers per feature subset; scores and dataset counts are undisclosed, so don’t buy the all-SOTA claim yet.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Empirical Study of Non-Uniform Replay Effects in Reinforcement Learning
The paper evaluates three modern off-policy RL algorithms on five benchmark suites and finds non-uniform replay helps most when replay volume is low, while high-entropy sampling remains important at comparable expected recency.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes with concrete benchmarks and conditions, but non-uniform replay is a narrow RL algorithm question with no product or agent link. hard-exclusion-technical-accessibility caps it below 40.
editor take
The paper reduces non-uniform replay gains to 3 factors: low replay volume, recency, high entropy; better than another PER variant.
HKR breakdown
hook knowledge resonance
open source
49
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis
SurvBench converts four PhysioNet critical-care databases into model-ready tensors for survival analysis, covering time-series vitals and labs, static demographics, ICD codes, and radiology report embeddings, with preprocessing decisions controlled through YAML and train-fold-only fitting for imputation, scaling, and feature filtering.
#Multimodal#Embedding#Benchmarking#SurvBench
why featured
HKR-K passes because the post gives 4 PhysioNet ICU databases and 4 input types; HKR-H/R fail because EHR survival analysis is narrow and distant from mainstream AI product or agent concerns.
editor take
SurvBench wires 4 PhysioNet datasets; for EHR survival models, reproducible preprocessing beats another architecture tweak.
HKR breakdown
hook knowledge resonance
open source
49
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
VNDUQE: Information-Theoretic Novelty Detection Using Deep Variational Information Bottleneck
VNDUQE uses Deep Variational Information Bottleneck models on MNIST with held-out digit classes for OOD detection; KL divergence reaches 100% AUROC on noise, prediction entropy reaches 94.7% AUROC on novel digits, and a parallel two-metric strategy averages 95.3% AUROC.
#Safety#Benchmarking#VNDUQE#Research release
why featured
HKR-K passes with concrete AUROC results and a VIB mechanism; HKR-H and HKR-R fail. This is a narrow MNIST OOD paper without product, agent, or production-pipeline implications, so it stays in the lower research-signal band.
editor take
VNDUQE hits 95.3% AUROC on held-out MNIST; I don’t buy the safety angle until CIFAR/ImageNet-style OOD shows up.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Neural Operators Learn Conditioning Mappings for Multiple Densities
The paper proposes a single operator that maps any joint density to its conditional distribution, proves neural operators can approximate this conditioning operator to arbitrary accuracy under suitable density classes, and tests the learned conditioning map on a class of Gaussian mixtures.
#Reasoning#Research release
why featured
Hard-exclusion: technical-accessibility fail. The paper is specialized probabilistic modeling theory; it gives a mechanism and Gaussian-mixture test, but no product, agent, or practical pipeline impact. HKR-K passes only, so the score is capped below 40.
editor take
Tsimpos et al. prove one neural operator can approximate conditioning; tests stop at Gaussian mixtures, so Bayesian foundation-model claims stay early.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Space Syntax-guided Post-training for Residential Floor Plan Generation
The paper proposes SSPT, using SSIO to convert generated floor plans into rectangle-space graphs and feed configurational metrics back into trained generators through SSPT-Iter and SSPT-PPO; experiments report higher public-space dominance and functional-hierarchy alignment than the unpost-trained baseline, with SSPT-PPO showing stronger gains, lower variance, and higher efficiency than iterative retraining.
#Fine-tuning#Robotics#Benchmarking#Research release
why featured
HKR-K passes for concrete SSPT/SSIO mechanisms, but HKR-H and HKR-R are weak because the topic is narrow floor-plan generation with no product, agent, or broad model impact disclosed.
editor take
SSPT-PPO turns space syntax into a reward; sample size is undisclosed, so I’d first audit SSIO for layout gaming.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods
The authors released gym-invmgmt, evaluating optimization, heuristic, and learned inventory controllers under one CoreEnv contract across 22 core scenarios and four supplemental MARL rows; PPO-Transformer shows the strongest learned-policy quality with fast inference, while informed stochastic programming is the strongest non-oracle reference at higher online compute cost.
#Agent#Benchmarking#arXiv#Gymnasium
why featured
HKR-K passes via benchmark size and controller comparison; HKR-H/R are weak because this is vertical OR/inventory-control work, not a broad AI-practitioner story. No hard exclusion, so it lands in the low-value research band.
editor take
gym-invmgmt covers 22 inventory scenarios; PPO-Transformer leads learned policies, while the LLM baseline is just diagnostic gear.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Read, Extract, Classify: A Tool for Smarter Requirements Engineering
The paper presents ReXCL, a requirements engineering tool with two modules for extraction and classification; it processes raw requirement documents into a predefined schema, assigns labels via adaptive fine-tuning of encoder-based models, and exports results to external tools, but the abstract does not disclose concrete efficiency or accuracy numbers.
#Fine-tuning#Tools#ReXCL#Research release
why featured
HKR-K passes on the extract/classify workflow and export mechanism, while HKR-H and HKR-R miss. No hard exclusion applies, but absent metrics and narrow software-engineering scope keep it in the low-value browse band.
editor take
ReXCL has two modules for requirements docs; no accuracy or efficiency numbers, so I treat “significant” as filler.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Pruning Federated Models through Loss Landscape Analysis and Client Agreement Scoring
AutoFLIP prunes federated models using one-time federated loss exploration and client agreement scoring, reducing computational overhead by 52% on average and communication costs by more than 65% under challenging non-IID client data conditions.
#Fine-tuning#Inference-opt#Benchmarking#Christian Internò
why featured
HKR-K passes via concrete mechanisms and cost-reduction numbers. HKR-H and HKR-R are weak; federated pruning is specialist material with no product or flagship-model impact, so it stays in the low-value research band.
editor take
AutoFLIP reports 52% compute and 65% communication cuts; for federated pruning, ask how ugly the non-IID benchmark is.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Assessing the Impact of Dimensionality Reduction on Clustering Performance: A Systematic Study
The paper evaluates five dimensionality reduction methods against four clustering algorithms, using ARI to compare no reduction with k-1, 25%, and 50% dimensional settings; the abstract does not disclose the number of datasets or the best method-algorithm combinations.
#Benchmarking#Research release#Benchmark
why featured
HKR-K passes on concrete experimental setup, while HKR-H/R fail due to a dry angle and weak practitioner stakes. Treat as low-value research release; no hard exclusion triggered.
editor take
The paper tests 5 reducers × 4 clusterers × 3 dimensions; without dataset count or winners, it is not a preprocessing rulebook.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits
The paper proves the first worst-case regret upper bound of Õ(m√NT) for CTS-G in sleeping semi-bandits and proposes CL-SG, which samples one shared Gaussian seed per round and improves the bound to Õ(√mNT).
#Reasoning#Benchmarking#Research release#Open source
why featured
hard-exclusion-technical-accessibility: sleeping semi-bandit regret bounds need specialist context and give no engineering or product hook. HKR-K passes on new bounds, but HKR-H/R fail.
editor take
CTS-G gets its first worst-case O~(m√NT) bound; CL-SG cuts it to O~(√mNT), useful for real routing/recsys bandits.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Paper on Constructive Conditional Normalizing Flows Published
The paper constructs conditional normalizing flows that approximate a diffeomorphism φ and the pushforward measure φ#μ using a continuity-equation flow whose velocity field is a perceptron network with piecewise constant weights; the v3 abstract does not disclose experimental metrics.
#Reasoning#Research release
why featured
Triggers hard-exclusion-technical-accessibility: the item depends on diffeomorphisms, pushforward measures, and flow construction, with no metrics or product on-ramp. HKR-K passes narrowly, but the score is capped below 40.
editor take
Geshkovski et al. give constructive conditional flows; v3 discloses no experiments, so theorists read, engineers wait.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Juan Zhong and three coauthors posted arXiv v2 of a survey on Transformer-based autonomous driving models, covering perception, prediction, and planning, and reviewing five deployment-oriented compression strategies: quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention.
#Robotics#Vision#Inference-opt#Juan Zhong
why featured
HKR-K passes: the post gives a taxonomy of autonomous-driving Transformers and five compression methods. HKR-H/R are weak; this is a v2 revision of a 2023 survey, with no new model, benchmark, or deployment data.
editor take
Juan Zhong’s 4-author survey updates a 2023 paper and lists 5 compression paths; no vehicle latency table, so treat it as referenceware.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Foundation Flow-Matching Models for Inverse Problems
The paper introduces FMPlug, a plug-in framework that applies foundation flow-matching models to inverse problems using instance-guided, time-dependent warm starts and Gaussianity regularization, with evaluation on image restoration and scientific inverse problems under a few-similar-samples condition.
#Inference-opt#FMPlug#Research release
why featured
Hard-exclusion technical-accessibility fail: the post centers on flow-matching priors, Gaussianity regularization, and scientific inverse problems with no product or agent on-ramp. HKR-K passes, but the item is capped below 40.
editor take
FMPlug adds time-dependent warm-start plus Gaussian regularization for inverse problems; ICML 2026 accepted, but abstract gives no benchmark numbers.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness
OverNaN extends common synthetic oversampling methods to incomplete feature vectors, preserving, propagating, or selectively interpolating missing values through explicit strategies; the abstract does not disclose benchmark scores or dataset sizes.
#Benchmarking#OverNaN#arXiv#Research release
why featured
HKR-K passes because the article states a concrete oversampling mechanism for meaningful missingness. HKR-H/R are weak, and no benchmark numbers or production impact are disclosed, so it stays in the low-value research band.
editor take
OverNaN keeps NaNs during oversampling, but the abstract gives no scores; I buy the setup, not the generalization claim.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
A Comparative Study of Model Selection Criteria for Symbolic Regression
The study compares AIC, AICc, BIC, MDL, and Efron’s bootstrap for symbolic regression model selection on seven synthetic datasets with Gaussian noise; MDL yields the lowest test error and shortest expressions across most datasets, while MDL and BIC show the highest probability of selecting ground-truth expressions.
#Benchmarking#arXiv#Research release#Benchmark
why featured
HKR-K passes with 5 criteria, 7 datasets, and an MDL result. HKR-H/R fail: the topic is narrow, academic, and has no product or agent impact, so it stays in the low-value research band.
editor take
MDL wins on most of 7 Gaussian-noise synthetic sets; in symbolic regression, the selector can matter as much as the search.
HKR breakdown
hook knowledge resonance
open source
44
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
Efficient and Adaptive Human Activity Recognition via LLM Backbones
The paper proposes using frozen LLM backbones for sensor-based human activity recognition, with a structured convolutional projection mapping accelerometer and gyroscope time series into the LLM latent space and LoRA handling parameter-efficient adaptation. The RSS abstract states gains in convergence, data efficiency, and cross-dataset transfer under low-data and few-shot settings, but does not disclose model names, benchmark names, or metric values.
#Fine-tuning#Multimodal#Inference-opt#Research release
why featured
HKR-K passes for the frozen-LLM plus conv-projection plus LoRA mechanism on accelerometer/gyroscope streams. No model, dataset, or metric is disclosed, and HAR is peripheral to the AI-product agenda.
editor take
The authors freeze an LLM for HAR, but omit models and metrics; I’m not sold sensor time series inherit language pretraining gains.
HKR breakdown
hook knowledge resonance
open source
44
SCORE
H0·K1·R0
04:00
27d ago
arXiv · cs.LG· atomEN04:00 · 05·13
TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion
TriBand-BEV reports pedestrian BEV AP of 58.7/52.6/47.2 on KITTI at 49 FPS on one consumer GPU, using a three-height-band BEV tensor, P1-P4 bidirectional fusion, area attention, oriented boxes, and an IQR filter for noisy LiDAR points.
#Vision#Robotics#Benchmarking#Mohammad Khoshkdahan
why featured
HKR-K passes on concrete metrics and architecture details, but this is a narrow vision/robotics paper with high reader friction. No product adoption, open-source impact, or cross-source discussion is disclosed.
editor take
TriBand-BEV hits 49 FPS on KITTI with one consumer GPU; I buy the engineering, not the Complex-YOLO victory lap.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
03:11
27d ago
HuggingFace Papers (takara mirror)· rssEN03:11 · 05·13
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
The paper introduces ATD-Trans, a Japanese-English travelogue translation dataset for evaluating machine translation at overall and geo-entity levels across domestic Japan and overseas regions; the post does not disclose dataset size, licensing, or the exact language models tested.
#Benchmarking#Research release#Benchmark
why featured
HKR-K passes on the new dataset and geography-based evaluation angle, but HKR-H/HKR-R are weak. The post does not disclose sample size, baselines, or reproducibility details, so it stays in the lower 40–59 band.
editor take
ATD-Trans covers Japan and overseas travelogues; size and license are undisclosed, but geo-entity errors beat BLEU as a practical MT failure mode.
HKR breakdown
hook knowledge resonance
open source
46
SCORE
H0·K1·R0
02:22
27d ago
HuggingFace Papers (takara mirror)· rssEN02:22 · 05·13
When Do LLMs Generate Realistic Social Networks? A Study of Culture, Language, Scale, and Method
The study generates 192 verified directed networks from 50 personas, testing four cultural contexts, four prompt languages, three GPT-4.1 variants, and four prompting architectures for effects on homophily, connectivity, clustering, modularity, and demographic bias.
#Benchmarking#Reasoning#GPT-4.1#Research release
why featured
HKR-H/K pass: the title tests realistic LLM social networks, and the abstract gives 192 networks with culture/language/model/prompt comparisons. HKR-R is weak and there is no product or reusable artifact, so this stays in 60-71.
editor take
192 networks show prompt architecture changes outcomes; if LLMs stand in for humans, prompt design is an experimental treatment.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H1·K1·R0
01:04
27d ago
● P1HuggingFace Papers (takara mirror)· rssEN01:04 · 05·13
ChipMATE: Reinforcement Learning Multi-Agent Training Enhances RTL Generation
ChipMATE trains Verilog and Python reference-model agents to cross-verify RTL without a golden testbench, builds 64.4K reference-model samples, and reaches 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models.
#Agent#Code#Reasoning#ChipMATE
why featured
HKR-H/K/R all pass: the story has a concrete mechanism, benchmark numbers, and a no-golden-testbench condition. RTL generation is niche EDA, so technical-accessibility pressure keeps it below the 78+ band.
editor take
ChipMATE is strong because it trains verification into RTL generation; 75.0% pass@1 is impressive, but still far from signoff-grade trust.
sharp
Both sources reuse the same arXiv paper title, so this is paper diffusion, not independent confirmation. The key numbers also come from the authors: ChipMATE reports 75.0% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B base models, and claims to beat DeepSeek V4 at 1600B parameters. I buy the direction more than the victory lap. For RTL, the failure mode of API agents is not just prompting; it is air-gapped deployment, missing golden testbenches, and proprietary vendor code that cannot leave the building. Pairing a Verilog agent with a Python reference-model agent, plus backtracking to stop multi-turn error propagation, maps to real verification practice. But VerilogEval V2 is still a benchmark. Timing, CDC, synthesis constraints, and PPA regression are where this claim gets expensive.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
00:44
27d ago
HuggingFace Papers (takara mirror)· rssEN00:44 · 05·13
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
AssemblyBench introduces a synthetic dataset of 2,789 industrial objects with multimodal manuals, 3D part models, and assembly trajectories, while AssemblyDyno uses manuals and part shapes to predict assembly order and trajectories evaluated through physics-based simulation.
#Multimodal#Robotics#Benchmarking#AssemblyBench
why featured
HKR-K is strong: 2,789 industrial objects plus physics-simulation feasibility checks. HKR-R is present for robotics data scarcity, but the paper is a niche benchmark with no evidence of broad industry pickup, so it stays in 60–71.
editor take
AssemblyBench ships 2,789 synthetic industrial objects; I’d inspect the simulator before trusting AssemblyDyno near a real cell.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1

more

feeds

admin