ax@ax-radar:~/papers $ grep -E 'arxiv|paper' sources/tags
45 srcsignal 72%cycle 04:32

papers · 2026-05-12

500 papers · updated 3m ago
2026-05-12 · Tue
23:48
27d ago
HuggingFace Papers (takara mirror)· rssEN23:48 · 05·12
FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection
FRAME detects image manipulation with multi-path forensic routing, adaptively selecting informative forensic paths per input image and fusing complementary evidence; the post says the code is available on GitHub, but does not disclose specific benchmark scores.
#Vision#Reasoning#FRAME#Research release
why featured
HKR-K/R pass: the mechanism and open code add substance, and authenticity/safety gives it resonance. Metrics are not disclosed and HKR-H is weak, so it stays in all.
editor take
FRAME open-sources multi-path image forensics, but no scores are disclosed; I don't buy the robustness claim until cross-generator tests land.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R1
20:49
27d ago
HuggingFace Papers (takara mirror)· rssEN20:49 · 05·12
What Do You Think I Think? Accounting for Human Beliefs Using Second-Order Theory of Mind
The paper uses I-POMDP to build a second-order Theory of Mind agent that models a person’s mistaken beliefs about the agent’s knowledge; an in-person user study reports that the ToM-2 learner significantly improves the informativeness of teacher actions.
#Agent#Reasoning#Research release
why featured
HKR-H and HKR-K pass: the title has a clean hook, and the summary gives an I-POMDP ToM-2 mechanism plus a user-study claim. HKR-R is weak because no effect size, reproducible setup, or industry deployment angle is disclosed.
editor take
The paper builds a ToM-2 agent with I-POMDP; sample size is undisclosed. I like the direction, not the “significant” claim yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
18:56
27d ago
HuggingFace Papers (takara mirror)· rssEN18:56 · 05·12
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
CRAFT uses Clinical Alignment Score rewards to fine-tune medical diffusion models across four modalities, improving CAS and downstream classification over strong baselines and reducing the low-alignment tail versus the strongest baseline by 5.5-34.7 percentage points, a 20.4% average relative reduction.
#Multimodal#Vision#Fine-tuning#CRAFT
why featured
HKR-K passes on the CAS reward method and 5.5-34.7 pp tail improvement. HKR-H and HKR-R are weak because this is a vertical medical-imaging paper, so it stays in the lower all band.
editor take
CRAFT cuts low-alignment tails by 5.5–34.7 points; I buy CAS after blinded physicians, not as a clinical-label substitute.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H0·K1·R0
18:09
27d ago
HuggingFace Papers (takara mirror)· rssEN18:09 · 05·12
DocAtlas: Multilingual Document Understanding Dataset and Benchmark Across 82 Languages
DocAtlas builds OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks, evaluates 16 state-of-the-art models, and reports persistent gaps in low-resource scripts; DPO with rendering-derived ground truth improves in-domain accuracy by 1.9% and out-of-domain accuracy by 1.8%.
#Vision#Benchmarking#Fine-tuning#DocAtlas
why featured
HKR-H/K pass via the 82-language benchmark and measured DPO gains. This is a useful research release, not a major model/product event, so it stays in the 60–71 band.
editor take
DocAtlas spans 82 languages and 9 tasks; DPO gains 1.9%, while SFT loses up to 21% out-of-domain.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
17:58
27d ago
● P1arXiv · cs.AI· atomEN17:58 · 05·12
Research paper introduces Fast-Slow Training framework for continual LLM adaptation
The paper introduces Fast-Slow Training, using model parameters as slow weights and optimized context as fast weights. Across reasoning tasks, FST is up to 3x more sample-efficient than RL-only training, reaches a higher asymptote, and stays closer to the base LLM with up to 70% less KL divergence.
#Reasoning#Fine-tuning#Memory#Research release
why featured
HKR-H/K/R all pass: the paper has a clear hook, a concrete FST mechanism, 3x sample-efficiency, and 70% lower KL divergence. It remains an arXiv method paper without major-model deployment, so featured-low fits.
editor take
Two arXiv tracks cover the same paper, not independent validation; FST’s 3x sample efficiency is tempting, but continual learning is not solved.
sharp
Both sources point to arXiv:2605.12484 with identical framing; this is one paper listed under cs.AI and cs.LG, not independent validation. The concrete hook is Fast-Slow Training: parameters act as slow weights, optimized context acts as fast weights, with up to 3x better sample efficiency and up to 70% lower KL drift on reasoning tasks. I buy the problem framing before I buy the win. RL post-training has kept running into the same tradeoff: task gains arrive with base-model behavior drift. FST’s move—parking task-specific information in an updatable context layer—does look more controllable than parameter-only RL or LoRA-style adaptation. But the abstract does not give model size, task suite, or inference-time cost for maintaining those fast weights. If state management is expensive, the 3x training-sample story gets taxed in production.
HKR breakdown
hook knowledge resonance
open source
90
SCORE
H1·K1·R1
17:57
27d ago
● P1arXiv · cs.AI· atomEN17:57 · 05·12
Research proposes sparse-to-dense reward principle for language model post-training beyond GRPO
The paper tests a sparse-to-dense reward allocation rule on Qwen3 and Llama math tasks: scarce labeled data trains an 8B teacher with sparse RL, then a dense bridge distills behavior into a Qwen3-1.7B student, raising MATH from 75.4% to 78.5% after later GRPO and beating a matched replay control by 2.8 points.
#Reasoning#Fine-tuning#Alignment#Qwen
why featured
HKR-H/K/R all pass, but this is a single arXiv post-training paper rather than a product release. The dense-bridge recipe and MATH gain put it at the 72–77 featured threshold.
editor take
Two arXiv categories, narrow signal; still, “don’t burn verifiable labels on a cold student” hits a real waste pattern in small-model RL.
sharp
cs.LG and cs.AI list the same arXiv v1, so this is one paper surfaced twice, not independent corroboration. The hard hooks are Qwen3-1.7B, 8B/14B teachers, and MATH moving from 75.4% to 78.5% after the bridge. I buy the recipe, not the grand “principle” framing. The paper says scarce verifiable labels should first train a stronger teacher with sparse reward, then move behavior through a forward-KL warmup plus OPD, then run student-side GRPO. That is a polite way of saying direct RL on a cold small model often burns compute on sampling noise. The sharp detail is that transfer from the same teacher before RL underperforms, so the gain is teacher-side policy shaping, not distillation magic. For Qwen/Llama small-model post-training, this looks more useful than another round of GRPO hyperparameter folklore.
HKR breakdown
hook knowledge resonance
open source
94
SCORE
H1·K1·R1
17:57
27d ago
arXiv · cs.AI· atomEN17:57 · 05·12
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA reaches 46.85% accuracy on OSWorld-MCP, about a 66% relative improvement over the baseline, by training computer use agents to choose between atomic GUI actions and high-level tool calls through staged SFT, single-turn RL, and online agentic RL.
#Agent#Tools#Fine-tuning#ToolCUA
why featured
HKR-H/K/R all pass, but this is a single arXiv agent-orchestration paper without major-lab backing, product rollout, or cross-source pickup. Concrete benchmark and training details put it at the top of 60–71.
editor take
ToolCUA hits 46.85% on OSWorld-MCP; I buy the angle—GUI agents fail hardest when they keep clicking.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
17:56
27d ago
arXiv · cs.AI· atomEN17:56 · 05·12
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT proposes three changes to online diffusion RL for joint audio-video generation: modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting, and evaluates them with an LTX-2 backbone on JavisBench and VBench for audio-video quality, alignment, and synchronization.
#Multimodal#Audio#Vision#OmniNFT
why featured
HKR-K passes because the post names three mechanisms plus JavisBench, VBench, and LTX-2. HKR-H and HKR-R are weak, so this stays in all as a niche arXiv research item.
editor take
OmniNFT adds 3 modality-level patches to online diffusion RL; I buy the decomposition, but RSS gives no gains, so don't crown it SOTA.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
17:55
27d ago
● P1arXiv · cs.CL· atomEN17:55 · 05·12
MEME: Multi-Entity Evolving Memory Evaluation Benchmark
MEME evaluates six memory tasks across 100 controlled episodes, and six systems under default settings reach only 3% average accuracy on Cascade and 1% on Absence despite adequate static retrieval performance.
#Agent#Memory#Benchmarking#Claude
why featured
HKR-H/K/R all pass: MEME turns agent memory into 100 controlled episodes and reports 3%/1% failure-point accuracy. It is a strong benchmark paper, not yet an industry-level release, so it stays in the 78–84 band.
editor take
MEME hits the sore spot in agent memory: Cascade 3%, Absence 1%. A lot of “memory” stacks are retrieval with a nicer costume.
sharp
MEME appears under both cs.LG and cs.CL with the same title, so the coverage is a single arXiv source, not independent confirmation. The paper tests 6 memory tasks, 6 systems, and 100 controlled episodes; the ugly numbers are Cascade at 3% average accuracy and Absence at 1%. I buy the benchmark’s pressure point. Agent memory has not been about finding an old fact for a while; it is about updating dependent state across many entities without lying to itself. Prompt optimization, deeper retrieval, less filler noise, and stronger LLMs do not close the gap. Only a file-based agent with Claude Opus 4.7 partially recovers, at about 70x baseline cost. That makes plenty of “long-term memory” product claims look like dressed-up retrieval.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
17:53
27d ago
● P1arXiv · cs.CL· atomEN17:53 · 05·12
KV-Fold Method Enables KV-Cache Recurrence for Long-Context Inference
KV-Fold treats the KV cache as a left-fold accumulator over sequence chunks, and on Llama-3.1-8B it reports 100% exact-match retrieval across 152 needle-in-a-haystack trials from 16K to 128K tokens, with chain depths up to 511 and within a single 40GB GPU memory limit.
#Inference-opt#Memory#Reasoning#KV-Fold
why featured
HKR-H/K/R all pass: the paper has a clear mechanism, hardware condition, and benchmark numbers tied to long-context cost. It remains a single arXiv release without open-source or cross-source validation, so it stays in the 78–84 band.
editor take
KV-Fold’s 128K/511-step/40GB claim is spicy, but perfect needle retrieval is not proof of real long-context reasoning.
sharp
Two arXiv categories carry the same KV-Fold paper, with fully aligned claims, so this is one research source, not independent validation. The concrete claim is strong: Llama-3.1-8B hits 100% exact-match on 152 needle-in-a-haystack trials from 16K to 128K tokens, up to 511 chain steps, on one 40GB GPU. I think this lands because it attacks long context from inference mechanics, not model scale. No training, no architecture change, just treating KV cache as a left-fold accumulator across chunks. That puts pressure on the million-token-window story vendors have been selling. The pushback is also obvious: needle retrieval is a clean benchmark. Codebase reasoning, multi-hop evidence, and contradictory facts across chunks are where this idea has to earn its keep.
HKR breakdown
hook knowledge resonance
open source
90
SCORE
H1·K1·R1
17:51
27d ago
● P1arXiv · cs.CL· atomEN17:51 · 05·12
Solve the Loop: Attractor Models for Language and Reasoning
Attractor Models refine output embeddings by solving a fixed point and use implicit differentiation, keeping training memory constant with effective depth; a 770M model outperforms a 1.3B Transformer trained on twice as many tokens, with up to 46.6% lower perplexity and 19.7% higher downstream accuracy.
#Reasoning#Inference-opt#Benchmarking#Claude
why featured
HKR-H/K/R all pass: the paper offers a concrete fixed-point refinement mechanism and claims a 770M model beats a 1.3B Transformer with up to 46.6% lower perplexity. Single arXiv preprint status keeps it in the 78–84 band.
editor take
Two arXiv listings are category echo, not press consensus; 46.6% PPL gains are spicy, but don’t crown a new architecture from an abstract.
sharp
The 2 sources are the same arXiv paper listed under cs.CL and cs.LG, so the coverage is fully aligned through one abstract, not independent validation. Attractor Models replace fixed-depth looping with a fixed-point solve and use implicit differentiation for constant training memory. The hard claims are big: up to 46.6% lower perplexity, up to 19.7% higher downstream accuracy, and a 770M model beating a 1.3B Transformer trained on twice the tokens. I buy the engineering motivation before I buy the victory lap. The tiny-model reasoning numbers are loud: 91.4% on Sudoku-Extreme and 93.1% on Maze-Hard, while the abstract says Claude and GPT o3 fail completely. But that comparison lives or dies on task format and evaluation protocol. Recursive reasoning papers have burned people before when benchmark structure, not reasoning depth, carried the result.
HKR breakdown
hook knowledge resonance
open source
91
SCORE
H1·K1·R1
17:50
27d ago
HuggingFace Papers (takara mirror)· rssEN17:50 · 05·12
ScaleSearch: Block Floating Point Scale Factor Search with Mantissa-Bit Granularity
ScaleSearch searches BFP scale factors with mantissa-bit granularity, reducing NVFP4 quantization error by 27% and improving Qwen3-8B post-training quantization by up to 15 points on MATH500.
#Inference-opt#Fine-tuning#Benchmarking#Qwen
why featured
HKR-H/K/R pass, but this is a specialized quantization paper brief. The post gives the mechanism and two results, not code, full reproducibility details, or deployment cost, so technical accessibility keeps it in all.
editor take
ScaleSearch cuts NVFP4 error 27%; I buy it—BFP scaling should stop worshipping block max.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
17:48
27d ago
arXiv · cs.AI· atomEN17:48 · 05·12
Researchers Release Open-Source DR-Gym Environment for Electric Utility Demand Response
The paper introduces open-source DR-Gym to train and evaluate utility-side demand response, using an online Gymnasium-compatible environment with a regime-switching wholesale price model calibrated to extreme events, physics-based building demand profiles, and a configurable multi-objective reward function.
#Agent#Robotics#Benchmarking#DR-Gym
why featured
HKR-K passes because the paper names an open DR-Gym environment, an extreme-event-calibrated price model, and multi-objective rewards. HKR-H/R are weak: utility demand response is niche for AI practitioners, so this stays below featured.
editor take
DR-Gym opens a utility-side demand-response Gymnasium env; useful benchmark gap, but its “realistic” claim needs runs beyond the abstract.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R0
17:43
27d ago
arXiv · cs.AI· atomEN17:43 · 05·12
Real-world 6G AI-native mobility dataset with handover and beam management measurements released
The paper presents a UE mobility dataset collected from a commercially deployed network, covering five mobility modes: pedestrian, bike, car, bus, and train, with handover, beam management, and timing advance measurements.
#Inference-opt#Research release
why featured
Hard-exclusion technical-accessibility fail: HO, beam management, and TA are wireless-specialist topics, and the post gives dataset scope without an AI-product or agent angle. HKR-K passes, but the cap applies.
editor take
This 6G dataset spans 5 mobility modes, but sample size is undisclosed; AI-native mobility lacks real-network mess, not models.
HKR breakdown
hook knowledge resonance
open source
48
SCORE
H0·K1·R0
15:59
27d ago
HuggingFace Papers (takara mirror)· rssEN15:59 · 05·12
Overview of the MedHopQA Track at BioCreative IX: Multi-Hop Medical QA Evaluation
BioCreative IX MedHopQA evaluated 48 submissions from 13 teams on 1,000 two-hop medical QA pairs across diseases, genes, and chemicals. The top system scored 89.30% MedCPT F1 and 87.30% exact match, while the zero-shot baseline scored 67.40% and 60.20%.
#RAG#Reasoning#Benchmarking#BioCreative
why featured
HKR-K passes with concrete benchmark scale and F1 results. HKR-H and HKR-R are weak: this is a niche academic track recap with limited product or competitive impact for general AI practitioners.
editor take
MedHopQA shows a 22-point F1 gap on 1,000 cases; biomedical multi-hop QA still lives or dies on retrieval.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H0·K1·R0
15:28
27d ago
HuggingFace Papers (takara mirror)· rssEN15:28 · 05·12
Reconnecting Fragmented Citation Networks with Semantic Augmentation
The authors build a hybrid citation-graph framework on 662,369 Web of Science papers, adding LLM-based text-similarity edges from small disconnected components and reweighting existing citations by textual similarity.
#Embedding#Benchmarking#Web of Science#Research release
why featured
HKR-K passes via the 662,369-paper dataset and semantic-edge/citation-reweighting mechanism. HKR-H/R are weak: this is a niche citation-network method with limited product or practitioner impact, so it stays in the upper low-value band.
editor take
The authors augment 662,369 papers with semantic edges; I buy the direction, but boundary-preservation metrics are undisclosed.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Research Paper Proposes MXFP4 Quantization Method for Large Language Model Pretraining
The paper tests MXFP4 quantization during Llama 3.1-8B pretraining on C4 and finds Wgrad quantization drives convergence degradation; deterministic Hadamard rotations restore stable optimization, while stochastic rounding and randomized Hadamard rotations fail under native MXFP4 support on AMD Instinct MI355X GPUs.
#Inference-opt#Benchmarking#Llama#AMD
why featured
HKR-H/K/R all pass: MXFP4 pretraining is not a routine quantization note, and the post names Wgrad plus Hadamard rotation. Scope is limited to Llama 3.1-8B/C4, so it stays below same-day must-write.
editor take
FP4 pretraining just got a sharper failure mode: Wgrad, not generic quantization pain. If MI355X results hold, one excuse disappears.
sharp
Two arXiv entries point to the same v2 paper, so the coverage is aligned but single-source, not independent confirmation. The setup is concrete: Llama 3.1-8B on C4, native MXFP4 on AMD Instinct MI355X, with FP4 enabled stepwise across Fprop, Dgrad, and Wgrad. I like this paper because it narrows FP4 pretraining failure to a specific path. Fprop and Dgrad add only modest token overhead; Wgrad quantization drives convergence degradation. The mechanism is also testable: stochastic rounding and randomized Hadamard rotations fail, while deterministic Hadamard rotations restore stable optimization. That is a much cleaner story than “4-bit training is unstable.” The caveat is scale: the abstract discloses 8B on C4, not a 70B-class run or multi-dataset sweep.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Research paper Metis proposes self-evolving metacognitive policy for LLM jailbreaking
Metis frames jailbreaking as inference-time POMDP policy optimization and reaches 89.2% average ASR across 10 models, while reducing token costs by 8.2x on average and up to 11.4x under the evaluated settings.
#Safety#Reasoning#Alignment#Metis
why featured
HKR-H/K/R all pass: automated jailbreak learning is clickable, the paper gives 89.2% ASR and 8.2x token-cost reduction, and it hits model-safety nerves. It is still a single arXiv paper, so it stays in the 78–84 band.
editor take
Metis turns jailbreaks into inference-time policy optimization, and 89.2% ASR is ugly. Refusal templates keep losing to closed-loop probing.
sharp
Both entries point to the same arXiv paper, so the coverage is aligned by duplication, not independent confirmation: Metis reports 89.2% average ASR across 10 models, with 76.0% on O1 and 78.0% on GPT-5-chat. My read: jailbreak work is moving from prompt folklore to trained attack policy. Metis frames the target as a POMDP, diagnoses the defense during inference, then updates its policy using structured feedback. That is a nastier failure mode than a static suffix or prompt library. The claimed 8.2x average token-cost reduction also says this is directed search, not brute-force sampling. I would still discount the headline ASR until the benchmark setup, judge criteria, and refusal taxonomy are inspected; the supplied body only exposes the abstract.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace-Bench builds 5 worker profiles, 74 file types, and 20,476 files for 388 workspace tasks, evaluating agents on cross-file retrieval, contextual reasoning, and adaptive decisions; the best agent reaches about 60%, below the human score of 80.7%, while the agent average is 45.1%.
#Agent#Reasoning#Benchmarking#Workspace-Bench
why featured
HKR-H/K/R all pass: the paper has a concrete agent-versus-human gap and a detailed workspace-task setup. It is a useful benchmark release, not a major lab launch, so it stays in the 78–84 band.
editor take
Workspace-Bench drops agents into 20GB workspaces and the best hits only ~60%; that stings more than another web-task leaderboard win.
sharp
Both listed sources use the same arXiv title, so this is a single paper chain, not independent press convergence. The hard payload is clear: 5 worker profiles, 74 file types, 20,476 files, up to 20GB, 388 tasks, and 7,399 rubrics. I like this benchmark because it moves agent evals away from tidy browser chores and into dirty workspace maintenance. The best agent reaches only about 60%, humans hit 80.7%, and the agent average is 45.1%. That gap smells less like a missing reasoning trick and more like failures across retrieval, implicit file dependencies, and state updates. Workspace-Bench-Lite cutting eval cost by ~70% helps adoption, but a 100-task subset will get overfit fast by serious agent harness teams.
HKR breakdown
hook knowledge resonance
open source
91
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Study Finds Reasoning Models' Refusal Mechanisms Tied to Chain-of-Thought Traces
The paper examines refusal mechanisms in four open-source reasoning models and finds that fixing a specific chain-of-thought trace substantially reduces variance in refusal versus compliance outcomes. In distilled models, the opening CoT sentence can determine refusal decisions, while ablating linear refusal directions increases harmful compliance with non-negligible capability degradation.
#Reasoning#Safety#Interpretability#Research release
why featured
All three HKR axes pass: the title has a refusal-location hook, the summary gives 4-model and CoT/linear-direction mechanisms, and safety practitioners care that refusals can be ablated. Technical but audience-relevant, so 78-84 band.
editor take
Both sources trace to the same arXiv paper, but the signal is sharp: refusal behavior lives inside early CoT, and distillation copies that fragility.
sharp
Two entries point to the same arXiv v4 paper, so the coverage is a single-source chain, not independent confirmation. The paper tests four open-source reasoning models and lands on an uncomfortable result: fixing one CoT trace substantially reduces variance in refusal versus compliance, and in distilled models the first CoT sentence can fully determine refusal. That makes safety behavior look less like a stable policy head and more like a brittle trajectory feature. The linear-refusal-direction result adds the punchline: ablation increases harmful compliance, but less cleanly than in non-reasoning chat models and with real capability damage. For teams treating hidden CoT as a safety buffer, this is a warning shot.
HKR breakdown
hook knowledge resonance
open source
89
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Language Model Uses Internal States for Reinforcement Learning Value Estimation
The paper introduces POISE, which estimates RLVR baselines from a policy model’s hidden states and token-entropy statistics. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B math benchmarks, POISE matches DAPO while using less compute than multi-rollout or LLM-scale critic methods.
#Reasoning#Fine-tuning#Benchmarking#Qwen
why featured
HKR-H/K/R all pass: the title has a sharp hook, and the post gives POISE’s mechanism plus Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B tests. It stays at 79 because this is a single arXiv paper with no code, adoption, or cross-source debate disclosed.
editor take
POISE puts the critic back inside the actor’s hidden states; smart idea, but Qwen3-4B and R1-Distill-1.5B are not frontier-scale proof.
sharp
Both listed sources point to the same arXiv paper, 2605.07579, so this is aligned coverage without independent validation. The concrete move is POISE: train a lightweight probe on the actor’s hidden states, trajectory features, and token-entropy stats, then estimate prompt value from a single rollout instead of paying for a PPO-scale critic or GRPO-style multiple rollouts. I buy the direction, but not the implied victory lap on cheap critics. The evidence is Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B matching DAPO on math RLVR benchmarks. That is useful, not decisive. If the probe stays close to a separate value model at 30B+ or MoE scale, the RL training bill changes; until then this is a promising variance-reduction trick, not a solved recipe.
HKR breakdown
hook knowledge resonance
open source
89
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
HyperEyes Dual-Grained Reinforcement Learning Improves Multimodal Search Agent Efficiency
HyperEyes-30B surpasses the strongest comparable open-source agent across six benchmarks by 9.9% accuracy and uses 5.3x fewer tool-call rounds on average, after training with a two-stage pipeline, TRACE trajectory-level cost rewards, and token-level corrective signals from On-Policy Distillation.
#Agent#Multimodal#Reasoning#HyperEyes
why featured
HKR-H/K/R all pass: the paper gives concrete benchmark and tool-call numbers tied to agent efficiency. It stays below 78 because this is a single arXiv item with no disclosed code, replication detail, or major-lab signal.
editor take
HyperEyes’ 5.3x fewer tool-call rounds matters more than its 9.9% accuracy gain; parallel retrieval is the agent bottleneck finally getting priced.
sharp
Both entries are the same arXiv paper, so the coverage is a duplicated source chain, not independent validation. HyperEyes-30B claims 9.9% higher accuracy across six benchmarks and 5.3x fewer tool-call rounds on average; that targets the right pain point for multimodal agents: serial per-entity lookup turns retrieval into the latency and cost sink. I buy the problem framing, but not the margin yet. IMEB has only 300 human-curated cases, and TRACE explicitly rewards fewer tool calls, so the training objective can fit the evaluator’s taste. Compared with WebVoyager-style and visual RAG agents, the useful move here is making search width a reinforcement-learning target, not another prompt trick. The code and data are linked; the claim earns attention after reproducible runs.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Research identifies non-monotonic latency issues in Apple MPS decoding with KV cache interactions
The paper measures up to 21x latency spikes in Apple MPS autoregressive decoding on GPT-2, BLOOM, and OPT, while CPU and NVIDIA T4 CUDA runs show smooth monotonic scaling under identical conditions.
#Inference-opt#Benchmarking#Apple#NVIDIA
why featured
HKR-H/K/R pass: the 21x MPS spike is surprising, measured across GPT-2/BLOOM/OPT, and relevant to Mac inference users. It remains a niche ML-systems paper, so it lands at 76, not the 78+ band.
editor take
Apple MPS shows up to 21x decode latency spikes; that is not a tuning footnote. A lot of Mac-local LLM demos are underpricing tail latency.
sharp
Two listed sources are the same arXiv paper repeated, so the coverage is fully aligned but single-chain. The paper reports Apple MPS decode latency spikes up to 21x on GPT-2, BLOOM, and OPT, while CPU and NVIDIA CUDA do not reproduce the behavior under identical conditions. My read: stop quoting average tok/s for Mac-local inference as if it describes runtime quality. The anomaly is pinned mainly to decode, and KV cache still helps overall, but its speedup collapses inside the bad regimes. That hits the exact blind spot in long-context local apps on MLX, Metal-backed stacks, and llama.cpp-style deployments: users feel adjacent generation budgets suddenly stalling, not the clean mean latency in a benchmark table.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
04:00
28d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·12
Research Identifies Gap Between Generative AI Benchmark Scores and Real-World Utility
The paper analyzes 28 deployment cases across education, healthcare, software engineering, and law, and identifies a gap between benchmark scores and real-world utility. It proposes SCU-GenEval, a four-stage evaluation framework, plus three instruments: deployment protocols, context-conditioned user simulators, and persona- and goal-conditioned proxy metrics.
#Benchmarking#Research release#Benchmark#Commentary
why featured
HKR-H/K/R all pass: the title has a clear contradiction, and the paper gives 28 cases plus a four-stage framework. It stays in the featured-threshold band because it is a single arXiv paper with no broad coverage shown.
editor take
Across 28 deployments, the paper says benchmark gains are not user gains. Evaluation teams have been measuring artifacts, not utility.
sharp
Both listed sources point to the same arXiv record, so the coverage is duplicated, not convergent reporting. The paper uses 28 deployment cases across education, healthcare, software engineering, and law to argue that output benchmarks miss deployed utility. I buy the critique, less the grand framing. SCU-GenEval’s four stages—stakeholder-goal mapping, construct indicators, mechanism modeling, and longitudinal utility measurement—hit the blind spot in MMLU-style and SWE-bench-style leaderboards: they rank systems, but they do not prove users or teams get better over time. The hard part is cost. Once evaluation becomes longitudinal deployment research, it stops being a scriptable leaderboard, and vendors lose the clean marketing number they want.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO trains large and small models together; on Minerva, Qwen2.5-Math-1.5B gains 6.0 percentage points over GRPO, while Qwen2.5-Math-7B nearly matches standard GRPO using small-model rollouts and reports about an 18% training speedup.
#Reasoning#Fine-tuning#Inference-opt#Qwen
why featured
HKR-K/R pass: the paper gives concrete benchmark gains and training-speed numbers tied to GRPO cost. HKR-H is weak because this is still a dry arXiv method paper, so it stays below featured.
editor take
CoDistill-GRPO adds 6 points on Minerva for Qwen2.5-Math-1.5B; small-model rollouts giving 7B an 18% speedup is the sharper claim.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Invisible Handshake: Persistent Overpricing by Adaptive Market Agents
arXiv:2510.15995v3 studies a repeated game with two agents, a market maker controlling liquidity and a market taker choosing trade quantities, and gives a sufficient condition under which decentralized learning reaches a persistent overpricing region in finite time, including the case of projected stochastic gradient ascent.
#Agent#Reasoning#arXiv#Research release
why featured
HKR-H/K/R all pass, but this is an arXiv theory paper with only a mechanism summary; no experiment scale, dataset, or real-market validation is disclosed, so it stays at the top of 60–71.
editor take
A two-agent repeated game gives PSGA finite-time overpricing conditions; collusion risk looks sharper as gradient dynamics.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Elastic MoE trains MoE experts to collaborate across diverse combinations and improves router selection, expanding the effective inference-time k range to 2–3× the training-time k across four 7B–21B MoE architectures and nine benchmarks.
#Inference-opt#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R all pass, but this is a single arXiv research release whose impact depends on reproduction and framework uptake. Concrete architectures and benchmarks keep it near, but below, the featured threshold.
editor take
Elastic MoE stretches inference k to 2–3× training k; I buy the target—MoE serving needs budget elasticity per model.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Seed Hijacking of LLM Sampling and Quantum Random Number Defense
SeedHijack manipulates PRNG outputs for LLM sampling and achieves 99.6% exact token injection in 540 GPT-2 124M trials; the QRNG defense neutralizes the evaluated threat model with +0.6% median latency and +7.7 MB memory.
#Safety#Inference-opt#Alignment#GPT-2
why featured
HKR-H/K/R all pass, but the evidence is limited to GPT-2 124M and a specific threat model. This is a useful safety paper, not yet a featured production-impact story.
editor take
SeedHijack hit 99.6% injection in 540 GPT-2 124M trials; if suppliers touch sampling seeds, alignment is bypassed.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Towards Effective Theory of LLMs: A Representation Learning Approach
The paper proposes Representational Effective Theory, which learns macrostates from LLM hidden-state trajectories using a BYOL/JEPA-style self-supervised objective. The abstract reports temporally consistent states, reasoning-state trajectories, high-level semantic structure, early prediction of sycophancy, and causal handles for steering generations toward interpretable computational phases.
#Interpretability#Reasoning#Alignment#Research release
why featured
HKR-H/K/R all pass: the hook is novel, the mechanism is concrete, and sycophancy touches alignment practice. Single arXiv summary lacks metrics, authorship signal, and reproducibility details, so it stays in the lower 60–71 band.
editor take
RET learns hidden-state macrostates via BYOL/JEPA; abstract only, with no models, baselines, or effect sizes for sycophancy steering.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MOOSE-Star: Tractable Training for Scientific Discovery by Breaking the Complexity Barrier
MOOSE-Star reduces scientific hypothesis-generation training from O(N^k) complexity to O(log N) in the best case, using decomposed subtasks, motivation-guided hierarchical search, and bounded composition, and the authors release TOMATO-Star with 108,717 decomposed papers built using 38,400 GPU hours.
#Reasoning#RAG#Inference-opt#MOOSE-Star
why featured
HKR-H/K pass on the O(N^k)→O(log N) claim and 108,717-paper dataset. HKR-R is weak, and this is a single arXiv paper with no production deployment or named lab validation, so it stays at the top of all.
editor take
MOOSE-Star claims O(log N) P(h|b) training; I’d audit the 108,717 TOMATO-Star decompositions before buying the curve.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Skill-R1: Agent Skill Evolution via Reinforcement Learning
Skill-R1 trains a lightweight skill generator with verifiable rewards, keeps the task LLM frozen, and iteratively revises natural-language skills across multiple generations using a bi-level group-relative policy optimization objective that compares intra-generation rollouts and inter-generation revision gains.
#Agent#Reasoning#Tools#Research release
why featured
HKR-H/K/R are present, but the body gives no authors, benchmark numbers, code, or production replacement result. This is an interesting agent-RL paper, not yet a featured-level release.
editor take
Skill-R1 freezes the task LLM and trains a skill generator; no benchmark numbers disclosed, so I buy black-box adaptation, not the “skill evolution” gloss.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
CLR-voyance models inpatient reasoning as a POMDP and post-trains Qwen3-8B and MedGemma-4B with GRPO, and its 8B model scores 84.91% on CLR-POMDP versus GPT-5 at 77.83% and MedGemma-27B at 66.66%.
#Reasoning#Fine-tuning#Alignment#Qwen
why featured
HKR-H/K/R all pass, but this is a narrow clinical decision-support paper centered on a benchmark and post-training result, not a general AI product or model release; it sits at the high end of the 60–71 band.
editor take
CLR-voyance-8B scores 84.91% on CLR-POMDP; I buy the POMDP framing, not the hospital-win framing yet.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Relational In-Context Learning via Synthetic Pre-training with Structural Prior
RDB-PFN trains on over 2 million synthetic single-table and relational tasks from a Relational Prior Generator, then adapts to new databases through in-context learning and outperforms graph-based and single-table baselines on 19 real-world relational prediction tasks under the same DFS-linearized input setting.
#Reasoning#Fine-tuning#Benchmarking#RDB-PFN
why featured
HKR-K/R pass: the paper gives concrete scale and 19 relational prediction evaluations, with clear relevance to structured-data teams. HKR-H is weak, and this is a single arXiv method paper without cross-source traction or product impact.
editor take
RDB-PFN trains on 2M+ synthetic tasks; for relational FMs, priors beat pretending private databases are scrapable.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning
The paper proposes Theorem-SFT to train explicit theorem application, reporting +8.8% on MATH with LLaMA3.2-3B-Instruct and +20.27% on GeoQA with Qwen2.5-VL-7B-Instruct, while MLP-only fine-tuning matches full-layer performance and points to feed-forward layers as the main locus for reasoning rules.
#Reasoning#Fine-tuning#Vision#LLaMA
why featured
HKR-H/K/R all pass, but this is a single arXiv method paper with impact limited to math reasoning and SFT. Concrete gains lift it above filler, not into same-day coverage.
editor take
Theorem-SFT reports +8.8% on MATH and +20.27% on GeoQA; I buy theorem-use supervision, but MLP-only needs replication.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
BoostLLM: Boosting-Inspired LLM Fine-Tuning for Few-Shot Tabular Classification
BoostLLM turns PEFT fine-tuning into multi-round residual optimization with sequential adapters as weak learners; across multiple tabular datasets, its 4B model outperforms GPT-4o-based methods and matches or surpasses XGBoost over a wide range of shot counts.
#Fine-tuning#Reasoning#BoostLLM#XGBoost
why featured
HKR-H/K/R all pass, but this is a single arXiv paper on a narrow tabular fine-tuning setup; datasets, code, and reproducibility details are not disclosed in the feed, so it stays in all.
editor take
BoostLLM trains sequential PEFT adapters as residual learners; a 4B tabular model beating GPT-4o methods makes tree paths as teachers look sane.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code
The paper proposes MetaCompress to test behavioral fidelity in distilled code language models, evaluating two tasks and three distillation methods—Compressor, AVATAR, and MORPH—and finding up to 62% behavioral discrepancies plus up to 285% larger performance drops under adversarial attacks.
#Code#Fine-tuning#Benchmarking#MetaCompress
why featured
HKR-H/K/R all pass, but this is a niche arXiv evaluation paper for code-model distillation and testing. The concrete 62% and 285% numbers keep it above generic research, below featured threshold.
editor take
MetaCompress tests 2 code tasks and 3 distillation methods; 62% behavior drift says accuracy-only compression eval is too thin.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Diffusion Models are Evolutionary Algorithms
arXiv:2410.02543v3 presents a mathematical equivalence between diffusion models and evolutionary algorithms. The abstract says the method covers selection, mutation, and reproductive isolation, and outperforms mainstream evolutionary algorithms, but the post does not disclose benchmark numbers.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the title has a strong counterintuitive hook, and the body claims a mechanism mapping from diffusion to evolutionary components. No metrics or deployment impact keeps it in the upper 60–71 band.
editor take
arXiv:2410.02543v3 claims diffusion equals evolution; no benchmark numbers are disclosed, so I file it under elegant analogy.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
The paper introduces EntCollabBench, a benchmark with 11 role-specialized agents across six departments, using Workflow and Approval subsets to evaluate enterprise collaboration under access control, stateful systems, and policy-based approvals.
#Agent#Benchmarking#Tools#EntCollabBench
why featured
HKR-H/K/R pass, but the body gives only the benchmark shape; model rankings, task count, and enterprise validation are not disclosed. Useful agent-eval signal, below the featured bar.
editor take
EntCollabBench uses 11 roles across 6 departments; database-state checks beat yet another LLM-judge agent benchmark.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DiffATS: Diffusion in Aligned Tensor Space
DiffATS trains diffusion models on aligned tensor primitives for images, videos, and PDE solutions, compressing original data by 3.9× to 210× without pretrained compression autoencoders.
#Multimodal#Research release
why featured
HKR-K is strong, and HKR-H comes from 210× compression without an autoencoder. As a technical arXiv method with no open-source artifact, product path, or major-lab signal, HKR-R is weak, so it stays high-all.
editor take
DiffATS compresses fields 3.9×–210× via OP-aligned Tucker factors; clean math, but I want code and FID tables.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
AQUA-Bench evaluates unanswerability in audio question answering across 3 scenarios: missing correct options, categorically incompatible answer choices, and audio-question mismatches where the question lacks grounding in the audio.
#Audio#Benchmarking#AQUA-Bench#Research release
why featured
HKR-H/K/R all pass, but the body gives only title-level facts and no dataset size, model results, or release details. Audio QA benchmarking is relevant but niche, so it stays in all at 70.
editor take
AQUA-Bench tests 3 unanswerable audio-QA cases. No size or leaderboard disclosed; refusal beats QA accuracy in production failures.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Semantic Voting: Execution-Grounded Consensus for LLM Code Generation
The paper compares 18 LLM code-selection configurations and finds the best execution-based selector beats output-pattern majority voting by 19–52 percentage points, while SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable once candidates run on diverse inputs.
#Code#Inference-opt#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper gives experiment scale, lift, and a statistical result useful for code-selection design. HKR-H is weak, and this is a single arXiv paper, so it stays in all.
editor take
Across 18 configs, execution selectors gain 19–52 points; SemanticVote fails to beat MBR-Exec, so stop fetishizing aggregation rules.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Test-Time Speculation
The paper proposes Test-Time Speculation, an online distillation method that adapts the draft model during verification, and reports up to 72% higher acceptance length and 41% average gains over state-of-the-art speculators across Qwen-3, Qwen-3.5, and Llama3.1 model families.
#Inference-opt#Qwen#Llama#Research release
why featured
HKR-H/K/R all pass, but this is a single arXiv inference-optimization paper without code, independent replication, or deployment proof. I keep it in the lower 60–71 band at 70.
editor take
TTS distills the draft during verification and lifts acceptance length 41% on average; offline-trained speculators finally get punished on long outputs.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
What's the Plan? Metrics for Implicit Planning in LLMs and Their Application to Rhyme Generation and Question Answering
The paper proposes simpler metrics for implicit planning in LLMs, using rhyme generation and question answering cases where steering vectors at the prior line ending alter intermediate tokens before the target rhyme or answer, and reports the mechanism appears in models starting at 1B parameters.
#Reasoning#Interpretability#Safety#Claude
why featured
HKR-H/K/R all pass, but this is a single arXiv methods paper. The provided facts cover metrics, rhyme/QA tasks, and a 1B emergence claim, not broad validation or community traction, so it sits at the top of 60-71.
editor take
The paper finds implicit planning from 1B models; narrow rhyme/QA tasks, but vector steering gives interpretability a runnable probe.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation
IntroLM enables causal language models to predict output quality during prefilling with introspective tokens; on QA benchmarks, Qwen3 8B reaches 90% ROC AUC for success prediction and beats a DeBERTa classifier by 14%.
#Reasoning#Inference-opt#Fine-tuning#Qwen
why featured
HKR-H and HKR-K pass: the mechanism is specific and the metric is concrete. As a single arXiv research item with no code, deployment cost, or production validation disclosed, it fits the upper “all” band.
editor take
IntroLM reports 90% ROC AUC on Qwen3 8B; if prefill self-eval holds, routers can drop one evaluator.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Layer Collapse in Diffusion Language Models
The paper identifies layer collapse in LLaDA-8B: a few early layers are dominated by one large super-outlier over long token ranges, and pruning it degrades outputs into repetitive random token loops; under 3-bit GPTQ, LLaDA drops 1.8% on GSM8K while Llama-3.1-8B drops 64.7%.
#Inference-opt#Interpretability#Benchmarking#LLaDA
why featured
HKR-H and HKR-K pass: the paper gives a concrete failure mode and a 3-bit GPTQ result. The topic stays niche model diagnostics, so HKR-R fails and the item lands in all, with no hard exclusion.
editor take
LLaDA-8B leans on one early-layer super-outlier; 3-bit GPTQ drops just 1.8%, so Llama compression heuristics break here.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning
The paper proposes the Plasticity-Ceiling Framework to compare SFT and RL use of expert trajectories for mathematical reasoning post-training. Its benchmarks identify sequential SFT-then-RL as superior to synchronized approaches, and give three scaling rules: switch at stable or mild-overfitting SFT, treat data scale as the main driver, and use minimum validation loss for trajectory selection.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper offers a post-training framework, an SFT/RL ordering claim, and scaling rules for math reasoning. HKR-H is weak, and the summary lacks model names, benchmark numbers, and reproduction details.
editor take
This gives three post-training rules: SFT then RL, scale data first, pick trajectories by min val loss; RSS omits model sizes and benchmark tables.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
SpatiaLab evaluates VLM spatial reasoning with 1,400 real-world visual QA pairs across 6 categories and 30 task types; InternVL3.5-72B reaches 54.93% multiple-choice accuracy versus 87.57% for humans, while GPT-5-mini leads open-ended tests at 40.93% versus 64.93% for humans.
#Vision#Multimodal#Reasoning#SpatiaLab
why featured
HKR-H/K/R pass, but this is an arXiv benchmark whose impact depends on adoption. The 1,400-item setup and 54.93% vs 87.57% gap are useful, below model-release or major product-update weight.
editor take
SpatiaLab puts hard numbers on VLM spatial weakness: InternVL3.5-72B gets 54.93% MCQ accuracy, far below humans at 87.57%.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AI Alignment via Incentives and Correction
The paper models AI alignment as a two-agent solver-auditor fixed point, where a principal selects rewards over joint correction outcomes, and proposes a bandit-based outer loop to search reward profiles from noisy interaction feedback; in an LLM coding pipeline, adaptive rewards maintain oversight pressure and reduce hallucinated incorrect attempts versus static hand-designed rewards, while the abstract does not disclose exact dataset size or reduction rate.
#Agent#Alignment#Code#Research release
why featured
HKR-K/R pass: the paper adds a solver-auditor fixed point and bandit reward search, tied to hallucinated coding attempts. HKR-H is weak and no effect size or experiment scale is disclosed, so it stays in all.
editor take
This frames alignment as a two-agent fixed point; reduction size is undisclosed, so don’t sell bandit reward search as safety.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies
The paper proposes learning multi-indicator weights for instruction data selection using ICL signals from compact tiny-validation sets, and reports that on GSM8K it matches or exceeds full-dataset tuning while using 30% of the training samples across model families including Mistral, Qwen, and Llama.
#Fine-tuning#Reasoning#Mistral#Qwen
why featured
HKR-H/K/R pass on the 30%-data claim, concrete proxy mechanism, and fine-tuning cost angle. As a single arXiv method paper without code or cross-source pickup, it stays below featured.
editor take
GSM8K hits full-tuning parity with 30% data; I buy task-model selection over static data scores.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA
Tabula RASA uses four-component ablations to show sparse adjacency masking drives most multi-hop KGQA gains, adding +72.5pp on 3-hop MetaQA, +45.5pp on WebQSP, and +53.9pp on CWQ, while learned relation parameters add modest refinement and hurt without structural guidance.
#Reasoning#RAG#Benchmarking#Tabula RASA
why featured
HKR-H/K/R all pass, but this is a narrow arXiv paper on structural inductive bias without a tool release, major-lab model, or product impact. It sits in the 60–71 research band.
editor take
Sparse adjacency masking adds 72.5pp on 3-hop MetaQA; KG reasoning wants topology first, relation weights later.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction
The paper proposes adversarial training on policy-generated trajectories, using a co-evolving discriminator to separate policy trajectories from the data distribution and reduce reward hacking in RL post-training for melody-to-chord accompaniment.
#Fine-tuning#Alignment#arXiv#Research release
why featured
HKR-H and HKR-K pass via the unusual music-interaction reward-hacking setup and a concrete adversarial post-training mechanism. No metrics, dataset details, or artifact are disclosed, so it stays in the 60–71 band.
editor take
GAPT adds a co-evolving discriminator to policy trajectories; narrow music setting, but reward hacking gets a measurable interaction test.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve replaces synchronized stage execution with asynchronous workers and queues, raising proposal throughput on GEPA workloads by 3.5x on local vLLM and 4.9x on API serving versus synchronous GEPA.
#Agent#Inference-opt#FlashEvolve#GEPA
why featured
HKR-H and HKR-K pass: the mechanism and speedup numbers are clear for agent-infra readers. HKR-R is weaker; the post only gives GEPA results, with no code, benchmark breadth, or production deployment disclosed.
editor take
FlashEvolve hits 3.5x/4.9x on GEPA; async queues are old, treating language staleness as repairable signal is the sharp bit.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models
The paper tests language models on procedurally generated zero-sum matrix games, where anonymous 2×2, 3×3, and 5×5 payoff matrices cut success to 34%, 18%, and 2%, while supervised fine-tuning on only 2×2 and 3×3 games raises unseen 5×5–7×7 success to 61%.
#Reasoning#Fine-tuning#Benchmarking#Research release
why featured
HKR-H/K pass: the paper quantifies a reasoning failure down to 2% and shows 61% transfer after small-game SFT. HKR-R is weak; this is a single arXiv benchmark without product uptake, so it stays in all.
editor take
Anonymous 5×5 games drop success to 2%; SFT on 2×2/3×3 reaches 61%, so named-game scores look flimsy.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Continuous Latent Contexts Enable Efficient Online Learning in Transformers
The paper constructs constant-depth transformers that store weighted-majority and Q-learning state in a small number of continuous latent context tokens, then trains a small GPT-2-style model without direct latent-state supervision and reports better performance than Qwen-3-14B and DeepSeek-V3 on long synthetic online prediction sequences.
#Reasoning#Memory#Benchmarking#Qwen
why featured
HKR-K is strong and HKR-H comes from tiny latent contexts beating larger models. The evidence is still long synthetic online prediction, so HKR-R is weak and this stays in the lower research-recommendation band.
editor take
Latent tokens store online-learning state; beating Qwen-3-14B on long synthetic sequences is neat, not deployment evidence.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
The paper introduces Reflective Test-Time Planning for embodied LLMs, scoring multiple candidate actions before execution and updating the reflection model and action policy after execution, with experiments on Long-Horizon Household, MuJoCo Cupboard Fitting, photorealistic HM3D, and a Franka Panda arm.
#Agent#Robotics#Reasoning#arXiv
why featured
HKR-H and HKR-K pass: the hook is test-time trial-and-error reflection, with pre-action scoring and post-execution updates across HM3D, MuJoCo, and Franka Panda. No metrics, release artifact, or major lab angle keeps it below featured.
editor take
RTTP spans 4 settings, but gains lack numbers; I’d scrutinize update cost and reproducibility before buying the reflection story.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Muon Does Not Converge on Convex Lipschitz Functions
The paper proves that Muon does not converge on convex Lipschitz functions under any learning-rate schedule; error feedback restores convergence for Muon and non-Euclidean subgradient methods with momentum, but degrades performance on CIFAR-10 image classification and nanoGPT language modeling on FineWeb-Edu 10B.
#Reasoning#Benchmarking#Muon#CIFAR-10
why featured
HKR-H/K/R all pass, but this is a single arXiv optimizer-theory paper with narrow reach and no cross-source cluster. Technical accessibility keeps it in the 60–71 band.
editor take
Muon fails to converge on convex Lipschitz functions under any LR schedule; error feedback fixes proof, hurts CIFAR-10 and FineWeb-Edu 10B.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D performs training-free 3D scene reasoning by exposing editable visual-textual 3D memories and composable spatial tools to an off-the-shelf MLLM, reports competitive ScanQA results against finetuned 3D-LMM methods, and evaluates multi-hop spatial reasoning on Compose3D, where inference-time synthesis of spatial operations is required.
#Agent#Multimodal#Reasoning#Flame3D
why featured
HKR-H/K pass: zero-shot 3D reasoning plus an editable 3D memory mechanism. No exact scores are disclosed, and the 3D reasoning niche keeps it in the 60–71 research-increment band.
editor take
Flame3D runs ScanQA with zero 3D training; I buy the tool-synthesis path, and finetuned 3D-LMM moats look thinner.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Beyond Multiple Choice: Evaluating Steering Vectors for Summarization
The paper evaluates steering vectors on SAMSum, NEWTS, and arXiv to control topical focus, sentiment, toxicity, and readability in abstractive summaries; high steering strengths consistently induce degenerate repetition and factual hallucinations.
#Inference-opt#Alignment#Benchmarking#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv evaluation with no disclosed model scale, metric numbers, or artifact in the feed. Useful for control/safety work, not featured-level industry news.
editor take
The paper tests steering vectors on 3 summarization sets; high strength causes repetition and hallucination, so MC control does not transfer cleanly.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
The paper proposes MDMF for AI-generated image detection, using a learnable Patch Forensic Signature and Maximum Mean Discrepancy to turn patch-level forensic cues into distributional gaps; the abstract says MDMF beats baseline detectors across multiple benchmarks, but the RSS snippet does not disclose dataset names, metrics, or exact scores.
#Vision#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R all pass, but this is a single arXiv vision-forensics paper. The mechanism is specific, while scores, code, and cross-source discussion are missing, so it stays in the 60–71 band as all.
editor take
MDMF uses PFS plus MMD for patch anomalies; no scores in RSS, so don’t buy the multi-benchmark win yet.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE applies KL distillation only to annotated critical spans, uses GRPO on remaining tokens, and improves over GRPO by 2.76 percentage points on average across four held-out math benchmarks plus GPQA-Diamond, while preserving the Qwen3-8B base OOD score on GPQA-Diamond.
#Reasoning#Alignment#Fine-tuning#Qwen
why featured
HKR-K/R pass: the mechanism and 2.76-point gain are concrete, and small-model alignment teams care. Single arXiv paper with incremental gains keeps it in the 60–71 band.
editor take
TRACE beats GRPO by 2.76 pts on five benchmarks; I buy span-KL, but critical-span labeling is the replication tax.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Data Mixing Can Induce Phase Transitions in Knowledge Acquisition
The paper trains LLMs on a synthetic biography dataset mixed with web-scraped data and finds that, once model size or mixing ratio crosses a critical threshold, memorized biographies jump from very few to most rather than scaling smoothly.
#Benchmarking#Reasoning#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv training-dynamics paper; the provided text gives the synthetic-bio/web-data setup but not authors, model sizes, or reproducibility details. Lower-band score: 69, tier all.
editor take
Synthetic bios mixed with web data show threshold jumps in memorization; I buy the setup, and linear recipe extrapolation looks unsafe.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Robust Multi-Agent LLMs under Byzantine Faults
The paper proposes Self-Anchored Consensus, a decentralized iterative filter-and-refine protocol that suppresses Byzantine agents under (F+1)-robust communication-graph conditions on math and commonsense reasoning benchmarks.
#Agent#Reasoning#Safety#Research release
why featured
HKR-H/K/R all pass, but the article only gives abstract-level facts: no effect sizes, dataset scale, or code status. The agent-safety angle is useful, yet not a same-day must-write item.
editor take
SAC needs an (F+1)-robust graph; I care how it labels “reliable messages,” because that filter is the attack surface.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Efficient Evaluation of LLM Performance with Statistical Guarantees
The paper proposes Factorized Active Querying to estimate LLM accuracy under a fixed query budget, using Bayesian factor modeling and active question selection while preserving frequentist CI coverage, and reports up to 5x effective sample size gains on two benchmark suites.
#Benchmarking#Research release#Benchmark#Open source
why featured
HKR-K and HKR-R pass: the paper gives a concrete 5x sample-efficiency claim and targets LLM eval cost. HKR-H is weak, and a single arXiv methods paper stays in the 60–71 band.
editor take
FAQ reports up to 5x effective sample-size gains for LLM accuracy evals; I buy the cost angle, but coverage under missing history is the test.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar replaces cached-token attention with two compact networks per layer and KV-head, and the paper tests it on 1.7B to 8B parameter models across five long-context datasets.
#Inference-opt#Memory#Reasoning#Nectar
why featured
HKR-K and HKR-R pass: the mechanism and test scope are concrete, and KV-cache cost matters. HKR-H is weak, and the summary lacks accuracy, speed, or memory deltas, so this stays in the 60-71 band.
editor take
Nectar makes cached attention cost independent of n; I care about fit cost, and the abstract gives no training budget.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
The paper proposes failure-prefix conditioning for saturated RLVR problems, using prefixes from rare incorrect trajectories to steer exploration toward failure-prone reasoning states; the abstract says it improves performance when standard RLVR stalls and matches gains from newly collected medium-difficulty problems, but the snippet does not disclose exact metrics.
#Reasoning#Alignment#Research release
why featured
HKR-H/K/R pass: failure-prefix training is counterintuitive, the RLVR exploration mechanism is specific, and reasoning-RL plateaus matter. It stays in 60–71 because this is one arXiv paper with no gain numbers, task set, or model sizes disclosed.
editor take
Failure-prefix conditioning mines saturated RLVR tasks with rare wrong prefixes; metrics are undisclosed, so I buy the mechanism, not the claimed magnitude.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
How Instruction and Reasoning Data Shape Post-Training: Data Quality through Layer-Wise Gradients
The paper analyzes LLM post-training data with layer-wise gradient SVD and reports that higher-quality data usually has lower nuclear norms and higher effective ranks, while models within the same family share similar gradient patterns across sizes.
#Reasoning#Fine-tuning#Research release
why featured
HKR-K and HKR-R pass: the paper offers testable gradient-SVD signals for data quality and maps to post-training data selection. HKR-H is weak, with no product or open-source impact, so it stays in 60–71.
editor take
Layer-wise gradient SVD ranks post-training data; effective rank beats nuclear norm, giving data curation a reproducible probe.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MaD Physics: Evaluating Information Seeking Under Constraints in Physical Environments
MaD Physics evaluates scientific agents across 3 environments with altered physical laws, requiring each agent to measure a system under a fixed budget and infer the underlying law for future-state prediction.
#Agent#Reasoning#Benchmarking#Gemini
why featured
HKR-H/K/R are present: a physics-law twist, 3 constrained environments, and agent-eval relevance. Still, only arXiv-level metadata is disclosed; model results and reproducibility details are missing, so it stays in the 60–71 band.
editor take
MaD Physics uses 3 altered-physics environments; four Gemini models stumble on structured exploration, not textbook recall.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
The paper proposes a self-captioning workflow and a Multimodal Interaction Gate that converts unique interactions into redundant interactions, reporting a 38.3% reduction in visually induced errors and a 16.8% consistency improvement under ambiguous or corrupted modality conditions.
#Multimodal#Vision#Safety#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete mechanism and two measured gains, tied to multimodal reliability. As a single arXiv paper with a jargon-heavy title and no adoption signal, it stays in 60–71.
editor take
This paper trains for multimodal redundancy and cuts visual-induced errors 38.3%; I buy it—dedup instincts hurt robustness here.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CAMAL: Improving Attention Alignment and Faithfulness with Segmentation Masks
CAMAL uses segmentation masks as an auxiliary regularizer during training to align vision-model attention with ground-truth discriminative regions, and the paper reports statistically significant attention-alignment gains across DL and DRL settings plus over 35% higher attention faithfulness than recent work without extra inference cost.
#Vision#Interpretability#CAMAL#Research release
why featured
HKR-K passes with segmentation-mask regularization and a >35% faithfulness gain; HKR-R is limited to interpretability/reliability. This is academic vision research with no product or artifact, so it stays in 60–71.
editor take
CAMAL reports >35% faithfulness gains via mask regularization; I buy half of it, since the cost moves to labels.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
GLAI: GreenLightningAI for Accelerated Training through Knowledge Decoupling
GLAI replaces conventional MLP blocks by fixing stabilized ReLU activation structure and optimizing only weights and biases, reducing training time by about 40% on average across the reported cases while matching or exceeding equal-parameter MLP accuracy.
#Inference-opt#GreenLightningAI#Research release
why featured
HKR-K/R pass on a concrete ~40% training-speed claim and a mechanism; HKR-H passes on the cost hook. Single arXiv paper, with no code, benchmark scale, or reproduction details disclosed here, keeps it in all.
editor take
GLAI reports 40% average training-time savings. Hold the Transformer hype; the snippet shows no large-scale pretraining proof.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Mistake-Bounded Language Generation
The paper defines mistake-bounded generation, shifts evaluation from eventual consistency to total invalid outputs, and gives a finite-class algorithm with last-mistake time Cdim(L) and mistake bound ⌊log₂|L|⌋.
#Reasoning#Benchmarking#Joshi et al.#Research release
why featured
HKR-H/K/R pass, but this is a theory-heavy arXiv paper. The post gives the objective and finite-class bound, not a usable system, experiment scale, or production evidence, so it stays in all.
editor take
Joshi et al. prove a ⌊log₂|L|⌋ mistake bound for finite language classes; generation evals need this accounting pressure.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CARL: Criticality-Aware Agentic Reinforcement Learning
CARL uses entropy as a proxy for state criticality and updates only actions from high-criticality states; the paper says a small fraction of states determines final outcomes in multi-step agent tasks, and the source code will be public.
#Agent#Reasoning#CARL#Research release
why featured
HKR-H and HKR-K pass via the critical-state hook and entropy-based update rule. HKR-R is weak because no metrics, task suite, or deployment impact is disclosed, so it stays in the 60–71 research band.
editor take
CARL updates only high-entropy states; metrics are undisclosed. I buy the credit-assignment angle, not entropy as causality.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Lattice Deduction Transformers
Lattice Deduction Transformer constrains a recurrent transformer state with lattice projection between passes; its 800K-parameter version reaches 100% accuracy on Sudoku-Extreme and Snowflake Sudoku, while a 1.8M-parameter variant reaches 99.9% on Maze-Hard.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R pass, but this is a single arXiv reasoning-architecture paper with evidence centered on Sudoku benchmarks, not agent or product impact. It lands at the high end of 60–71, below featured.
editor take
800K-param LDT hits 100% on two Sudoku sets. Toy benchmark, sure; frontier LLMs scoring 0% is the awkward part.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
The paper proposes latent visualization by optimization, using sparse autoencoders to split diffusion model layer representations into monosemantic features, and demonstrates the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset with recognizable concepts such as human figures, roses, cables, and waterfall foam.
#Vision#Interpretability#Stable Diffusion#Research release
why featured
HKR-H is the diffusion-feature visualization hook and HKR-K has LVO, SAE, and SD 1.5 Style50 specifics. HKR-R is weak: no product impact, benchmark delta, or safety incident, so it stays in 60-71.
editor take
LVO visualizes SAE features on SD1.5 Style50; out-of-sample evidence is undisclosed, so don’t crown diffusion interpretability yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
Echo-LoRA injects aggregated boundary hidden states from deeper layers into shallow LoRA or DoRA modules during training, and reports a 5.7-point average gain over LoRA baselines across eight commonsense reasoning benchmarks on LLaMA-7B, LLaMA2-7B, and LLaMA3-8B.
#Fine-tuning#Reasoning#Echo-LoRA#LLaMA
why featured
HKR-K is clear: Echo-LoRA adds cross-layer injection and reports +5.7pp on 8 benchmarks; HKR-R also lands for fine-tuning cost/performance. It remains a single arXiv method paper with no open-source or adoption signal, so it stays in 60–71.
editor take
Echo-LoRA gains 5.7 points on eight commonsense tests; zero inference cost is neat, but reproduced baselines shrink it to 3.0.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Stargazer evaluates eight frontier agents on 120 radial-velocity time-series model-fitting tasks across three difficulty tiers, including 20 archival cases; agents often reach good statistical fits but fail to recover correct physical system parameters, and higher test-time compute brings only marginal gains with frequent recursive failure loops.
#Agent#Benchmarking#Reasoning#Stargazer
why featured
HKR-H/K/R pass through the curve-fit vs parameter-recovery gap and the 120-task, 8-agent setup. The astrophysics constraint keeps it niche, below featured-level agent benchmarks.
editor take
Stargazer tests 8 agents on 120 RV tasks; good fits still miss physical parameters, and extra test-time compute mostly loops.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PrAg-PO: Prompt Augmented Policy Optimization for Robust and Diverse Mathematical Reasoning
PrAg-PO mixes multiple prompt templates with template-specific format rewards during training, and on an 8.5K-problem MATH Level 3-5 set it outperforms GRPO and DAPO on mathematical reasoning benchmarks.
#Reasoning#Fine-tuning#Benchmarking#PrAg-PO
why featured
HKR-K and HKR-R pass: the paper gives a concrete training recipe and benchmark against GRPO/DAPO. HKR-H fails because the angle is academic, so it stays in the 60–71 band with no hard exclusion.
editor take
PrAg-PO beats GRPO and DAPO on 8.5K MATH problems; I buy the premise—single-template RL is an overfitting trap.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
The paper defines the Overscaling Curse in parallel thinking, where a global sampling budget maximizes dataset accuracy while many samples peak at smaller budgets, and proposes LanBo to predict sample-specific optimal budgets before decoding while preserving dataset accuracy and improving latency and memory efficiency.
#Reasoning#Inference-opt#Research release
why featured
HKR-H/K/R pass, but the post gives only the mechanism summary and no benchmark scale or savings numbers. As an arXiv reasoning/inference-optimization paper, it sits high in 60–71, not featured.
editor take
LanBo predicts per-sample budgets before decoding; models, tasks, and savings aren't disclosed, so treat it as early-stop gating for parallel sampling.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SnareNet: Flexible Repair Layers for Neural Networks with Hard Constraints
SnareNet appends a differentiable repair layer to neural networks, repairs outputs to a user-specified tolerance, and reports more reliable constraint satisfaction on optimization learning and trajectory planning benchmarks than prior work.
#Reasoning#Safety#Benchmarking#SnareNet
why featured
HKR-K and HKR-R pass: the mechanism is clear and hard constraints matter for safe deployment. HKR-H is weak, and the body does not disclose lift size or reproduction details.
editor take
SnareNet adds a differentiable repair layer for user-tolerance constraints; if reproduced, this beats penalty-trained surrogates for hard feasibility.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Validity-Calibrated Reasoning Distillation
The paper proposes validity-calibrated reasoning distillation, comparing student and teacher next-step actions under the same prefix and scaling distillation updates by relative local validity; across math reasoning, code generation, and instruction-following benchmarks, it outperforms strong distillation baselines, while the snippet does not disclose model sizes or benchmark scores.
#Reasoning#Code#Fine-tuning#Research release
why featured
HKR-K is clear: the summary states the validity-calibrated distillation mechanism and task coverage. HKR-R is present via cost/performance pressure, but missing numbers, authorship signal, and artifacts keep it in the 60–71 band.
editor take
VCRD compares teacher-student next-step validity under one prefix; no scores or model sizes disclosed, so don't crown it.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
AU-Harness evaluates Audio LLMs with optimized batch processing and parallel execution, reporting up to 151% speedup over existing toolkits while adding standardized prompting, flexible configurations, and multi-turn dialogue dynamics analysis for fairer benchmark comparisons.
#Audio#Benchmarking#Tools#AU-Harness
why featured
HKR-K is clear: 151% faster evaluation and multi-turn analysis are testable claims. HKR-R is limited to audio-LLM evaluators; with no adoption signal or major-lab backing, this stays in the 60–71 band.
editor take
AU-Harness claims 151% speedup but omits baselines here; audio LLM eval needs reproducible multi-turn decay curves, not another leaderboard.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Consensus Sampling for Safer Generative AI
The paper presents consensus sampling: given k distributions, the black-box sampler abstains when agreement is insufficient and achieves risk competitive with the average risk of the safest s distributions.
#Safety#Alignment#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: consensus sampling gives a concrete abstention rule and a safety/reliability angle. HKR-H fails, and the post shows no experiments, code, or production-pipeline claim, so it stays in 60–71.
editor take
Consensus sampling needs k samplable distributions with likelihoods; safety comes from overlap plus abstention, not inner-model alignment.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
The paper introduces ThinkARM, a framework that uses Schoenfeld's Episode Theory to abstract reasoning traces into steps such as Analysis, Explore, Implement, and Verify, then compares reasoning and non-reasoning models on mathematical problem solving.
#Reasoning#Benchmarking#Interpretability#Schoenfeld
why featured
Single arXiv methods paper with a concrete framework for labeling reasoning traces, but the provided text lacks dataset size, model list, and headline results. HKR-K/R pass; score stays in the interesting-not-featured band.
editor take
ThinkARM segments math traces into steps; sample and model lists aren’t disclosed, so cross-task replication is the test.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
The paper proposes a modular cost-aware regulator that scales agent actions by predicted constraint violations, plugs into off-policy RL methods such as SAC and TD3, and reports up to 126× fewer constraint violations plus over 10× higher returns on sparse-cost Safety Gym locomotion tasks.
#Agent#Reasoning#Safety#arXiv
why featured
HKR-K is solid: adaptive action scaling, SAC/TD3 integration, and up to 126x fewer Safety Gym violations are concrete. HKR-R lands on agent safety, but the narrow RL-benchmark context keeps it in all.
editor take
The regulator cuts violations up to 126× with SAC/TD3; I trust the modular hook before the Safety Gym leaderboard.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Interactive Critique-Revision Training for Reliable Structured LLM Generation
The paper proposes DPA-GRPO, a paired-action training method for a generator-verifier game, and reports higher structured decision accuracy on TaxCalcBench TY24 than zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B.
#Reasoning#Alignment#Benchmarking#Qwen
why featured
HKR-K and HKR-R pass: it has a new training mechanism and reproducible benchmark, but no concrete accuracy delta is disclosed and the framing is academic. Treat as a useful arXiv method paper, below featured.
editor take
DPA-GRPO improves Qwen3-4B/8B on TaxCalcBench TY24, but no deltas are disclosed; useful increment, not a reliability win yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CausalGaze: Unveiling Hallucinations via Counterfactual Graph Intervention in Large Language Models
CausalGaze detects LLM hallucinations with structural causal models, modeling internal states as dynamic causal graphs and applying counterfactual interventions; experiments across 4 datasets and 3 widely used LLMs report a 3.3% AUROC gain on TruthfulQA over state-of-the-art baselines.
#Reasoning#Interpretability#Safety#CausalGaze
why featured
HKR-K and HKR-R pass: the paper gives concrete evaluation scale and AUROC gain, and hallucination detection matters to practitioners. HKR-H is weak; single arXiv paper with no artifact or production claim keeps it in the 60–71 band.
editor take
CausalGaze reports +3.3% AUROC on TruthfulQA; I want the three LLM names and intervention cost before buying it.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP proposes an end-to-end prompt-driven unlearning framework that uses reinforcement learning to optimize a prompt generator, suppressing target knowledge while preserving general capabilities under the condition that model parameters are not updated.
#Alignment#Safety#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: the mechanism is concrete and relevant to safety/compliance. No metrics, benchmarks, or artifact are disclosed, and HKR-H is weak, so it stays in the 60–71 band.
editor take
CAP learns unlearning prompts with RL and no weight updates; attractive for closed models, but the abstract gives no baseline numbers.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
AAAC replaces the fixed 4-bit scalar codebook with two learned 64-byte scalar codebooks per layer, selects per weight group by activation-weighted reconstruction error, and finishes quantization in 3–30 minutes on one GPU with no memory beyond the model itself.
#Inference-opt#AAAC#AWQ#GPTQ
why featured
AAAC has clear HKR-K: codebook size, quantization time, and memory condition are specific; HKR-R comes from inference cost. HKR-H is weak, and this is a single arXiv quantization paper, so it fits all, not featured.
editor take
AAAC uses two 64-byte codebooks per layer and quantizes in 3–30 minutes; if accuracy holds, AWQ/GPTQ look lazy.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PRIM: Meta-Learned Bayesian Root Cause Analysis
PRIM frames root cause analysis as Bayesian inference over a synthetic prior of causal models and reports zero-shot inference in 17 ms for systems with up to 100 variables.
#Reasoning#Benchmarking#Fine-tuning#PRIM
why featured
HKR-H/K pass: 17 ms, 100 variables, and zero-shot inference give testable claims. Still, this is a narrow arXiv methods paper with no disclosed open source, production replacement, or major adoption, so it stays in 60–71.
editor take
PRIM reports 17 ms zero-shot RCA at 100 variables; I buy the latency, not yet the synthetic-prior generalization.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
The paper introduces an adaptive regularization framework that estimates batch-level safety risk during fine-tuning with either a judge-based Safety Critic or an activation-based classifier, constrains higher-risk updates to stay close to a safe reference policy, and reports lower attack success rates across multiple model families with no inference-time cost.
#Fine-tuning#Safety#Alignment#Research release
why featured
HKR-K/R pass: the mechanism is concrete and targets safety loss during fine-tuning. HKR-H is weak, and the item lacks model names, experiment scale, or external replication, so it stays in all rather than featured.
editor take
The paper adapts regularization by batch risk with zero inference cost; ASR deltas aren’t disclosed here, so don’t crown it yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics
MathlibLemma introduces an LLM-based pipeline to mine, formalize, and prove folklore lemmas missing from Mathlib. The paper reports 1,506 Lean-checked proofs that pass a proof-bypass screen and builds a benchmark of 4,028 non-trivial type-checked Lean statements.
#Reasoning#Code#Benchmarking#Mathlib
why featured
HKR-H/K/R pass, with concrete proof and benchmark counts. The Lean/formal-math scope narrows audience fit, so it stays below the 72 featured threshold.
editor take
MathlibLemma reports 1,506 Lean-checked proofs; I care more about the tiny Mathlib merge rate, undisclosed here.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
The paper proposes a hierarchical statistical model for benchmark evaluation that incorporates benchmark characteristics and LLM randomness, uses multiple generations to improve score estimation accuracy and reduce variance, and defines a prompt-level difficulty score via correct ratios.
#Benchmarking#Research release#Benchmark
why featured
HKR-K and HKR-R pass: the paper gives a concrete variance-handling mechanism and speaks to benchmark trust. HKR-H is weak, and this is a single arXiv item without a tool, dataset, or visible industry uptake, so it stays in 60–71.
editor take
The paper estimates benchmark variance via multiple generations; single-sample leaderboards look clean and stay statistically dirty.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
The paper tests function vectors across 12 tasks, 6 models, and 4,032 directed cross-template pairs, finding that FV steering often succeeds when the logit lens cannot decode the correct answer at any intermediate layer.
#Interpretability#Safety#Reasoning#Mistral
why featured
HKR-H and HKR-K pass: the title has a counterintuitive hook and the experiment scale is concrete. The topic remains niche mechanistic interpretability, with no product or safety-event resonance, so it stays in the 60–71 band.
editor take
FV steering works across 4,032 pairs while logit lens stays blind; Llama/Gemma safety monitors built on projection will miss interventions.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
No Mean Feat: Simple, Strong Baselines for Context Compression
The paper introduces BenchPress, a reproducible context-compression benchmark suite covering model scales, datasets, compression ratios, and contexts from under 1K to under 8K tokens.
#RAG#Inference-opt#Benchmarking#Research release
why featured
HKR-H/K/R pass, but the feed only gives BenchPress coverage, not baseline results, model names, or reproducible setup details. Useful research-benchmark signal, below the featured bar.
editor take
BenchPress spans <1K to <8K tokens; mean pooling beats causal compression tokens, which is awkward for flashy soft-compression papers.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FactoryNet Industrial Time-Series Foundation Model Dataset Released
FactoryNet introduces 51 million industrial time-series datapoints across 23,000 task executions, six embodiments, and 27 annotated anomaly types, using an S-E-F-C schema for zero-shot cross-embodiment transfer and parameter-efficient anomaly detection.
#Robotics#Benchmarking#FactoryNet#Research release
why featured
HKR-K is strong: the paper gives reusable industrial time-series scale and anomaly labels. HKR-R is moderate for factory-AI data bottlenecks, but HKR-H is weak and this is an arXiv dataset paper, so it stays below featured.
editor take
FactoryNet ships 51M points across 6 embodiments; without raw sampling rates and license details, industrial time-series reuse stays shaky.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems
LLMSYS-HPOBench introduces a live HPO benchmark for real-world LLM systems, covering 364,450 configurations, 12-23 hyperparameter dimensions, 932 fidelity settings, 3-9 inference objective metrics, and 2-10 cost metrics with generated measurement logs.
#Benchmarking#Inference-opt#LLMSYS-HPOBench#AutoML
why featured
HKR-K/R pass: the benchmark adds concrete scale and inference-cost logs for LLM systems optimization. HKR-H is weak and the AutoML/HPO angle is narrow, so it stays in the 60-71 band.
editor take
LLMSYS-HPOBench ships 364,450 configs; inference tuning gets a serious target, but live benchmarks die fast without disciplined maintenance.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Hierarchical Mixture-of-Experts with Two-Stage Optimization
Hi-MoE splits MoE routing into inter-group balancing and intra-group specialization, and in 58B-token large-scale pre-training, Hi-MoE-7B reduces perplexity by 5.6% and improves expert balance by 40% over OLMoE-7B across diverse evaluation domains.
#Inference-opt#Benchmarking#Hi-MoE#OLMoE
why featured
HKR-K is strong and HKR-R applies to training-efficiency readers. This is still a specialist MoE architecture paper, with no major-lab release, open framework, or production-replacement claim, so it fits the 60–71 band.
editor take
Hi-MoE-7B cuts perplexity 5.6% over OLMoE-7B on 58B tokens; the routing idea works, but training cost is undisclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Exploring and Exploiting Stability in Latent Flow Matching
The paper reports that LFM models remain stable under data reduction and capacity shrinkage, then uses three sample-scoring criteria and a two-model coarse-to-fine trajectory design to save data and achieve more than 2x inference speedup while producing comparable outputs.
#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the paper offers sample-scoring mechanisms and a >2x inference-speed claim tied to cost. HKR-H is weak, and a single technical arXiv paper stays below featured.
editor take
LFM stays stable under identical noise seeds and claims 2x speedup; I want dataset sizes before buying “comparable outputs.”
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Can Revealed Preferences Clarify LLM Alignment and Steering?
The paper proposes fitting a discrete choice model to infer an LLM’s cost function from observed decisions, then evaluates preference coherence, objective self-reporting, and prompt-based steering across four medical diagnosis domains and multiple frontier and open-source models.
#Alignment#Safety#Reasoning#Research release
why featured
HKR-H/K/R all pass, but this is a single arXiv methods paper. The text gives the mechanism and four-domain evaluation, not adoption or a field-moving result, so it stays in the 60–71 band.
editor take
The paper infers LLM cost functions across 4 diagnosis domains; I like the lens, but model names and error sizes are undisclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces
LBI reduces backpropagation depth from O(K) to O(log K) by limiting inter-region communication to r-dimensional latent interfaces, replacing full d×d Jacobian combines at O(d^3) with r×r combines at O(r^3), and reports r=16 preserving training quality within 0.16–0.35 cross entropy across four 47–61M-parameter architectures.
#Fine-tuning#Inference-opt#arXiv#Mamba-2
why featured
HKR-K is strong thanks to concrete complexity and experiment numbers, and HKR-R hits training cost. HKR-H is weak; the backprop parallelization topic has a technical-accessibility drag, so it stays in the 60–71 band.
editor take
LBI cuts backward depth to O(log K), with r=16 losing 0.16–0.35 CE; I buy the shape, not the 61M-scale victory lap.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Prediction Bottlenecks Don't Discover Causal Structure, But Here's What They Actually Do
The paper retests a Mamba prediction bottleneck with VAR, Lorenz, CauseMe-style generators and 3 intervention semantics, finding about 60% of the reported intervention gain comes from a sample-size confound.
#Benchmarking#Reasoning#Mamba#Research release
why featured
HKR-H/K/R pass: the paper debunks a causal-discovery claim and gives a 60% confounding estimate. The niche causal-eval and Mamba setup keeps it in 60–71, not featured.
editor take
Mamba bottleneck retest eats ~60% of intervention gain; I don't buy “prediction learns causality” when Lasso and linear baselines pierce it.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG
CDS4RAG separates retriever and generator hyperparameters and optimizes them cyclically; across four benchmarks and two backbone LLMs, it improves vanilla algorithms in 21 of 24 cases and reports up to 1.54x higher generation quality than state-of-the-art methods.
#RAG#Inference-opt#Benchmarking#CDS4RAG
why featured
HKR-K and HKR-R pass: the paper gives concrete experiment counts and addresses RAG tuning practice. HKR-H is weak, and as a single arXiv methods paper without an artifact or wider debate, it stays in 60–71.
editor take
CDS4RAG wins 21/24 across 4 benchmarks and 2 LLMs; I buy split tuning, but eval cost is underdisclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
The paper analyzes eight Qwen2.5 and OLMo2 models, using representation lenses to track residual-stream readout subspaces and identify three geometric phases: Seeding Multiplexing, Hoisting Overriding, and Focal Convergence.
#Interpretability#Reasoning#Qwen2.5#OLMo2
why featured
HKR-H and HKR-K pass: the paper offers 8 models and a three-phase mechanism. HKR-R is weak, and the representation-lens/residual-stream framing is specialist, so it lands in the 60–71 band.
editor take
Eight Qwen2.5/OLMo2 models tested; framing depth as candidate disambiguation beats another logit-lens heatmap.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip compresses BF16 KV tensors at 613.3 GB/s and decompresses them at 2181.8 GB/s. In disaggregated LLM serving experiments, it preserves KV tensors bitwise, raises end-to-end KV transfer speed by up to 1.32×, cuts TTFT by 1.30×, and increases request throughput by 1.23×.
#Inference-opt#SplitZip#arXiv#Research release
why featured
HKR-K/R pass: the paper gives concrete throughput and TTFT numbers for KV transfer in disaggregated serving. HKR-H is weak, and the infra-specialist scope keeps it below featured.
editor take
SplitZip gets BF16 KV transfer to 1.32×; 613GB/s compression is strong, but network and serving overhead eat the win.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
arXiv:2605.01345v3 frames high-resolution VLM reasoning as sequential Bayesian optimal experimental design and introduces FOVEA, a training-free crop-proposal probing procedure; experiments report consistent gains over direct and ReAct-style baselines, but the RSS snippet does not disclose exact improvement numbers.
#Vision#Reasoning#Benchmarking#Research release
why featured
HKR-H/K/R pass, but the body gives no gain numbers, model list, or reproducible setup. This is useful VLM research signal, not a same-day featured item, so it stays in the 60–71 all band.
editor take
FOVEA probes crops without training for high-res VLMs; gains are undisclosed, so the framing lands better than the evidence.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Zero-shot Imitation Learning by Latent Topology Mapping
ZALT achieves 55% zero-shot success on unseen tasks in a complex 3D maze, versus 6% for the strongest baseline; the method identifies latent hub states, learns hub-to-hub policies and dynamics, and plans over the resulting topology.
#Agent#Reasoning#ZALT#Research release
why featured
HKR-H and HKR-K pass: the paper gives a 55% vs 6% result and a hub-to-hub mechanism. HKR-R is weak because it remains a 3D-maze research result with no agent-product or cost impact.
editor take
ZALT hits 55% on unseen 3D-maze tasks. The 6% baseline gap is huge; I’d audit demo coverage and hub leakage first.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
The SDiaReward team released an end-to-end multi-turn speech reward model, SDiaReward-Dataset, and ESDR-Bench, using pairwise preference supervision to evaluate prosody, emotion, and colloquialness across full spoken dialogue episodes.
#Audio#Benchmarking#Multimodal#SDiaReward
why featured
HKR-K and HKR-R pass: it offers a speech-dialogue reward model, dataset, and benchmark for voice-agent evaluation. HKR-H is weak, and no major lab or headline metric lifts it above the interesting-research band.
editor take
SDiaReward scores full multi-turn speech episodes; sample size is undisclosed, so hold the SOTA claim, but speech rewards need this target.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
HyperTransport: Amortized Conditioning of T2I Generative Models
HyperTransport maps CLIP embeddings through a hypernetwork to intervention parameters, validates on 167 held-out concepts, and produces each new intervention in one forward pass, 3,600–7,000× faster than per-concept fitting.
#Vision#Multimodal#Fine-tuning#CLIP
why featured
HKR-H and HKR-K pass: the paper gives a concrete mechanism, 167 unseen concepts, and a 3600-7000x speed claim. HKR-R is weak because this is a single arXiv T2I conditioning paper with no disclosed product or open-source path.
editor take
HyperTransport is 3,600–7,000× faster on 167 held-out concepts; I buy the speed, but CLIP/VLM judging still favors nameable concepts.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping
The paper introduces the GONE benchmark and NEDS framework for knowledge-graph unlearning, evaluating LLaMA-3-8B and Mistral-7B across multiple editing and unlearning methods, with NEDS scoring 1.000 on unlearning efficacy and 0.839 on locality.
#Reasoning#Fine-tuning#Benchmarking#LLaMA
why featured
HKR-K and HKR-R pass: the paper adds a benchmark, method, and concrete metrics tied to unlearning and compliance. HKR-H is weak, and this is a single arXiv paper, so it stays below the 72 featured bar.
editor take
GONE tests KG unlearning on LLaMA-3-8B and Mistral-7B; NEDS hits 1.000 efficacy, 0.839 locality—multi-hop leakage gets a real target.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Can Muon Fine-tune Adam-Pretrained Models?
The paper studies optimizer mismatch when Muon fine-tunes Adam-pretrained models through controlled experiments, finding that performance degradation scales with update strength and that LoRA narrows the full fine-tuning performance gap between Adam and Muon across language and vision tasks.
#Fine-tuning#Vision#Muon#Adam
why featured
HKR-K and HKR-R pass: the paper gives a testable optimizer-mismatch finding and affects LoRA fine-tuning choices. The topic is narrow training optimization, with no broader product or platform impact, so it sits in 60–71.
editor take
Muon fine-tuning Adam-pretrained models degrades with update strength; LoRA narrows the gap, but Adam dependency is still the tax.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Test-Time Training for Visual Foresight Vision-Language-Action Models
The paper proposes T³VF for Visual Foresight VLA models, using predicted future images and later observations as a supervision pair during test time under OOD conditions; the RSS snippet says it adds adaptive update filtering and modest inference cost, but does not disclose benchmark scores.
#Vision#Robotics#Fine-tuning#Research release
why featured
HKR-H and HKR-K pass: T³VF has a concrete test-time self-training mechanism for VF-VLA models. No benchmark scores are disclosed, and the robotics niche limits HKR-R, so it stays below featured.
editor take
T³VF trains on later observations at test time; scores are undisclosed, so I buy the mechanism, not the cost claim.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control
TetraJet-v2 applies NVFP4 to activations, weights, and gradients in all linear layers, and in pre-training runs up to 370M parameters and 212B tokens it reduces the average gap to BF16 by 51.3% while reporting a 1.67x end-to-end speedup over FP8.
#Fine-tuning#Inference-opt#TetraJet-v2#THU ML
why featured
HKR-K/R pass: the paper gives a concrete NVFP4 training path and a 51.3% BF16-gap reduction, tied to training cost. HKR-H is weak, and evidence tops out at 370M params, so it stays in all.
editor take
TetraJet-v2 cuts the BF16 gap 51.3% at 370M/212B tokens; solid 4-bit training mechanics, but not billion-scale yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
What Should Post-Training Optimize? A Test-Time Scaling Law Perspective
The paper studies post-training when training has only m≪N rollouts per prompt but deployment uses best-of-N selection. It derives Tail-Extrapolated estimators, including TEA and Prefix-TEA, to approximate best-of-N policy gradients from small rollout groups, and reports gains across instruction-following models, reward models, datasets, and budget settings.
#Reasoning#Alignment#Inference-opt#Research release
why featured
HKR-H/K/R all pass, but the item only exposes abstract-level facts and no gains, model scale, or reproducible results. This is useful post-training research, not a same-day must-write.
editor take
TEA estimates best-of-N gradients with m≪N rollouts; I buy the setup, but the tail assumptions carry the risk.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks
The paper proposes Gen-LRA, a no-box membership inference attack that audits synthetic tabular data leakage without model knowledge or access by estimating a local likelihood ratio with a surrogate model.
#Safety#Benchmarking#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: Gen-LRA gives a no-box membership-inference mechanism for synthetic data auditing. With only arXiv-summary facts and no results or wider uptake, it stays in the 60–71 band.
editor take
Gen-LRA attacks membership from synthetic tables alone; gains at low FPR lack numbers, but no-box auditing is the useful bite.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
RAM applies KL-regularized reward optimization to diffusion and flow-matching post-training, using clean endpoints sampled from the current model, reward evaluation, pretraining-style noising, and regression; on Stable Diffusion 3.5M, it reaches Flow-GRPO’s peak reward in up to 50× fewer training steps without SDE rollouts, backward adjoint sweeps, or reward gradients.
#Fine-tuning#Multimodal#Alignment#Stable Diffusion
why featured
HKR-H/K/R pass via the 50x training-step claim and concrete RAM mechanism, but this is a niche diffusion/flow-matching post-training paper with no code, author signal, or independent replication disclosed.
editor take
RAM matches Flow-GRPO on SD 3.5M with 50× fewer steps; image RL as regression is the right engineering smell.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Post-hoc Selective Classification for Reliable Synthetic Image Detection
ReSIDe estimates confidence for an existing synthetic image detector without retraining. Under common covariate shifts, it aggregates layer-level scores and cuts AURC by up to 69.55%.
#Vision#Safety#Benchmarking#ReSIDe
why featured
HKR-K passes with a concrete mechanism and 69.55% AURC figure; HKR-R passes via synthetic-media safety and moderation reliability. HKR-H is weak, and this is a single arXiv methods paper below featured threshold.
editor take
ReSIDe cuts AURC by up to 69.55% without SID retraining; abstention beats another brittle fake-image verdict.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi frames embodied policy improvement as verifier-gated edits to a typed executable knowledge base, then runs accepted policies with a deterministic symbolic executor at inference with zero LLM calls.
#Agent#Robotics#Tools#Kintsugi
why featured
HKR-H/K/R pass, but the body discloses mechanism only, with no task count, success rate, or benchmark delta. A single arXiv paper fits the 60–71 band, below featured.
editor take
Kintsugi uses zero LLM calls at inference; I buy the verifier-gated KB patching, not the white-box branding.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Two Clocks and the Innovation Window: When and How Generative Models Learn Rules
The paper defines two training timescales on rule-valid synthetic tasks: τ_rule marks the first rule-valid generations, while τ_mem marks reproduction of training samples; τ_rule increases with rule complexity and decreases with model capacity, while τ_mem is approximately rule-invariant and scales nearly linearly with dataset size N.
#Reasoning#Benchmarking#arXiv#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv paper on synthetic tasks with no disclosed real-model or production result, so it stays in the 60–71 band.
editor take
The paper separates τ_rule from τ_mem: N nearly linearly delays memorization, while rule complexity shrinks the innovation window.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Task-Aware Calibration: Provably Optimal Decoding in LLMs
The paper introduces task calibration for LLM decoding, calibrating predictive distributions in task-induced latent spaces such as labels, integers, or sets, and proves that MBR decoding on the calibrated latent distribution is optimal under latent model beliefs.
#Inference-opt#Reasoning#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the hook is “provably optimal decoding,” and the summary gives the task-calibration plus MBR mechanism. With only abstract-level detail and no metrics or product impact, it stays in the 60–71 band.
editor take
Task calibration is proved for labels, integers, and sets; I buy it there, not for open-ended generation.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD replaces static length rewards with online adaptive mechanisms for efficient CoT reasoning; it calibrates the correctness-efficiency trade-off at each step and estimates a per-problem target length from the model’s own correct rollouts, with evaluation on five mathematical reasoning benchmarks against RL-trained efficient-reasoning methods.
#Reasoning#Inference-opt#Benchmarking#OpenAI
why featured
HKR-K and HKR-R pass: the paper proposes online adaptive rewards for shorter CoT and evaluates on five math benchmarks. It stays in the 60–71 band because this is a single arXiv method paper with no disclosed artifact or production proof.
editor take
LEAD tests on 5 math benchmarks; per-problem length targets are sane, but the snippet hides actual token savings.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Entropy-informed Decoding: Adaptive Information-Driven Branching
EDEN adjusts the branching factor at each generation step using token-distribution entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions; experiments on math reasoning, code generation, and scientific questions report better accuracy-expansion trade-offs than fixed-width beam search.
#Inference-opt#Reasoning#Code#Research release
why featured
HKR-K and HKR-R pass: EDEN describes a concrete entropy-based branching rule and claims gains over fixed-width beam search on math, code, and science QA. The summary lacks effect sizes, model scale, and reproducibility details, so it stays in the mid-range.
editor take
EDEN branches by per-step entropy, but models, datasets, and deltas aren’t disclosed; I’d file this under decoding compute-savers to reproduce.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec uses idle windows on faster ranks to pre-generate later rollout drafts while preserving strict synchronous RL exactness; evaluations report 50% fewer decoding steps and up to 1.8x higher rollout throughput.
#Reasoning#Inference-opt#BubbleSpec#Research release
why featured
HKR-H/K/R pass, but this is a niche synchronous-RL systems paper. The 50% decoding-step and 1.8x throughput claims are useful, yet no code, replication, or major deployment is disclosed, so it stays in the 60–71 band.
editor take
BubbleSpec turns fast-rank idle bubbles into drafts and cuts decoding 50%; synchronous RL speedups needn't sacrifice exactness first.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Sparse Layers Are Critical to Scaling Looped Language Models
The paper compares standard and MoE Transformers with and without looped layers, finding that Looped-MoE scales better through routing divergence across repeated passes and offers better compute-quality trade-offs when early exits occur at loop boundaries.
#Inference-opt#Reasoning#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete Looped-MoE scaling mechanism and early-exit condition tied to inference cost. HKR-H is weak, and a single arXiv abstract keeps it in the 60–71 band.
editor take
Looped-MoE wins via cross-pass routing divergence; scale details aren't disclosed, so don't extrapolate to frontier LMs yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications
SmartEval introduces a 9,000-contract Solidity benchmark with a five-dimensional rubric, validated through three empirical studies including ablations, expert review, and Slither-based security analysis.
#Code#Benchmarking#Columbia University#Slither
why featured
HKR-K and HKR-R pass with a concrete benchmark size and safety-relevant coding use case. HKR-H is weak, and the Solidity-evaluation niche keeps it in the 60–71 band.
editor take
SmartEval ships 9,000 Solidity contracts; the +8.29 over human ground truth is the spicy claim—check FSMSCG quality first.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Security Enhancement Methods for Adversarially Robust LLM Agents in Medical Decision-Making
ARSM-Agent uses a six-stage security pipeline and a 0.3/0.3/0.2/0.2 weighted joint objective; under semantic perturbation, prompt injection, drug-name confusion, and false-evidence attacks, it reduces overall attack success to 8.7% and reaches a 0.91 knowledge consistency score.
#Agent#RAG#Safety#ARSM-Agent
why featured
HKR-K and HKR-R pass: the item gives concrete defenses and attack-success numbers, and medical-agent safety has real deployment stakes. Single arXiv paper, dry framing, and limited reproducibility detail keep it in all.
editor take
ARSM-Agent reports 8.7% attack success; with only four in-paper baselines, don’t trust the medical-agent safety claim yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
LLiMba adapts Qwen2.5-3B-Instruct into a Sardinian-ready 3B model using CPT and SFT on one 24 GB consumer GPU, with 11.5 million Sardinian tokens and 2.4 million Romance replay tokens; rsLoRA r256 reaches 28.5 BLEU for English-to-Sardinian, versus 17.3 after CPT and 21.0 with full fine-tuning.
#Fine-tuning#Benchmarking#Qwen#Research release
why featured
HKR-H/K/R pass, but the scope is niche low-resource-language fine-tuning rather than a broad model or tool release. Concrete setup and BLEU make it useful signal, but importance stays below featured.
editor take
LLiMba gets 28.5 BLEU from 11.5M Sardinian tokens; for low-resource languages, r256 adapters beat full fine-tuning.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling
TileQ compresses MoE expert parameters with fine-tuning-free PTQ, shares low-rank factors across input and output dimensions via 2D tiling, and reports up to 10× lower extra memory usage with inference latency reduced to about 5%.
#Fine-tuning#Inference-opt#TileQ#Research release
why featured
HKR-K/R pass: the paper gives a concrete MoE PTQ mechanism and a 10x extra-memory claim tied to serving cost. Single arXiv paper, dense title, no code or adoption disclosed, so it stays in 60–71.
editor take
TileQ claims 10× lower MoE PTQ extra memory; I want code and expert-scale tables before trusting the 5% latency number.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ActivationReasoning: Logical Reasoning in Latent Activation Spaces
ActivationReasoning embeds explicit logical reasoning into LLM latent spaces through three stages: identifying concept representations, activating propositions at inference time, and applying logical rules, with evaluation on PrOntoQA, Rail2Country, ProverQA, and BeaverTails.
#Reasoning#Interpretability#Safety#Research release
why featured
HKR-H/K pass: the latent-space reasoning angle is novel, and the summary gives a 3-stage method plus four benchmarks. No gains, code, or deployment context are disclosed, so it stays in the 60–71 research-paper band.
editor take
ActivationReasoning uses 4 benchmarks, but no models or scores in the snippet; SAE features as rules look neat, not proven.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
BRIDGE: Building Representations for Domain-Guided Program Synthesis
BRIDGE was evaluated on 178 algorithmic problems and five LLMs, using Code, Specification, and Theorem/Proof domains to improve Lean executable correctness by nearly 1.5x over direct prompting.
#Code#Reasoning#Fine-tuning#BRIDGE
why featured
HKR-H/K/R pass via the near-1.5x Lean gain, 178 tasks/5 LLMs, and code-correctness pressure. It stays in 60–71 because formal-verification scope is narrow and no product adoption or major lab signal is disclosed.
editor take
BRIDGE gets nearly 1.5x Lean executable correctness across 178 tasks and 5 LLMs; specs and proof traces are training signal, not garnish.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FreeMOCA Memory-Free Continual Learning Framework for Malicious Code Analysis
FreeMOCA preserves prior malware knowledge through adaptive layer-wise interpolation between consecutive task updates, without replay memory. On EMBER and AZ benchmarks, it beats 11 baselines in Class-IL and raises accuracy by up to 42% and 37%, while reporting best retention across compared methods.
#Memory#Fine-tuning#Benchmarking#IQSeC-Lab
why featured
HKR-K is strong and HKR-H has a clear “memory-free retention” hook, but the paper sits in niche security ML with no product or agent impact. Defaulting to the lower 40–59 band.
editor take
FreeMOCA beats 11 baselines by up to 42%/37% on EMBER/AZ; replay-free forgetting control is nice, security needs replication.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM builds ProcCorpus-60M from 30 embodied datasets with 60 million annotated frames. It trains a procedure-grounded vision-language reward model for dense progress estimation, with action segmentation and future planning in ProcVQA pretraining.
#Robotics#Vision#Reasoning#ProcVLM
why featured
HKR-K is strong with 30 datasets and 60M annotated frames; HKR-R is mostly for robotics reward-learning practitioners. The technical title and non-flagship source keep it below featured.
editor take
ProcVLM trains on 60M annotated frames from 30 datasets. Good strike against time-proxy rewards; downstream policy gains are undisclosed.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward
VIGOR uses the policy model’s own gradient norms as RL rewards; on Qwen2.5-7B-Base post-trained on MATH, it improves average math accuracy by 3.31% and average code accuracy by 1.91% over the RLIF baseline.
#Reasoning#Code#Fine-tuning#Qwen
why featured
HKR-H/K/R all pass, but this is a single arXiv post-training paper on Qwen2.5-7B-Base with +3.31%/+1.91% gains. Useful research signal, not same-day must-write.
editor take
VIGOR beats RLIF by 3.31% on Qwen2.5-7B. Verifier-free RL looks useful, but gradient-norm reward smells self-reinforcing.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Differences Between Direct Alignment Algorithms Are a Blur
The paper compares direct alignment algorithms under a unified two-stage framework and finds that the pairwise versus pointwise ranking objective is the main driver of alignment quality, while the scalar score, such as policy-reference ratio versus odds ratio, is secondary across instruction-following and math-reasoning benchmarks.
#Alignment#Reasoning#Benchmarking#arXiv
why featured
HKR-K is solid: the paper separates four DAA differences and says objective form drives alignment quality. HKR-H has a contrarian hook, but HKR-R is weak without model names, scale, or deployment stakes.
editor take
This pins DAA variance to 4 axes: pairwise vs pointwise drives quality, so ORPO-style scalar-score worship needs a cooldown.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Parameter-Efficient Neuroevolution for Diverse LLM Generation
QD-LLM evolves about 32K-parameter prompt embeddings for frozen 70B+ LLMs and reports 46.4% higher coverage than QDAIF on HumanEval, MBPP, and creative writing benchmarks under 30 runs with p<0.001.
#Fine-tuning#Benchmarking#Llama#Mistral
why featured
HKR-K is strong: 32K evolved prompt-embedding parameters on frozen 70B+ LLMs with a 46.4% coverage gain. HKR-H lands on the mechanism, but HKR-R is weak, so this stays in the 60–71 research-interest band.
editor take
QD-LLM moves only ~32K prompt-embedding params on frozen 70B LLMs; the 34% edge-case gain beats the writing-diversity score.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
The paper introduces f-GRPO and f-HAL, extending f-divergence estimation to RLVR and hybrid alignment, and proves expected reward improvement after alignment.
#Alignment#Reasoning#Safety#Research release
why featured
HKR-K is clear via f-GRPO/f-HAL and the f-divergence mechanism; HKR-R applies for post-training and safety practitioners. HKR-H is weak, and the arXiv-style theoretical framing keeps it in the lower band.
editor take
f-GRPO beats GRPO on math RLVR, but no margin is disclosed; the reward-hacking claim needs numbers before adoption.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Understanding Asynchronous Inference Methods for Vision-Language-Action Models
The paper compares four asynchronous inference methods for VLA models under controlled codebases, benchmarking Kinetix and LIBERO with inference delays up to 20 control steps; A2C2 keeps above a 90% solve rate on Kinetix through an 8-step delay and leads LIBERO from delay 4 onward.
#Robotics#Vision#Inference-opt#arXiv
why featured
HKR-K is strong: 4 methods, 2 benchmarks, 20-step delay, and A2C2 above 90% at 8-step delay. HKR-H is weak, and async VLA inference is narrow, so this fits all rather than featured.
editor take
A2C2 stays above 90% on Kinetix at 8-step delay. For async VLA, residual correction beats bigger-model theater.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Safety-Aware Denoiser for Text Diffusion Models
The paper proposes Safety-Aware Denoiser, an inference-time framework that modifies iterative denoising in text diffusion models and evaluates safety across three risk categories: hazard taxonomy, memorization, and jailbreak.
#Safety#Alignment#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the item gives a concrete inference-time mechanism and three safety-risk tests. HKR-H is weak, and text diffusion safety is still niche, so it stays in the 60–71 band.
editor take
SAD changes denoising at inference; no risk-reduction numbers disclosed, so I’d file it as a safety-interface experiment.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec compresses the drafter LM-head’s internal representation with a low-rank parameterization, preserves full vocabulary support, and delivers 4–5× acceleration over the standard LM-head on EAGLE-3 across three target models.
#Inference-opt#SlimSpec#EAGLE-3#Research release
why featured
HKR-K/R pass: the 4–5x LM-head speedup and EAGLE-3 setup add concrete value and touch inference cost. HKR-H is weak, and the low-level serving angle keeps it in the 60–71 band.
editor take
SlimSpec makes EAGLE-3’s draft LM-head 4–5× faster; low-rank internals look cleaner than brittle vocab truncation tricks.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Geometric 4D Stitching for Grounded 4D Generation
The paper proposes Geometric 4D Stitching, which identifies missing geometric regions and completes them with grounded 4D stitches, constructing 4D scene representations in under 10 minutes per one-step scene expansion on a single NVIDIA RTX 5090 GPU.
#Vision#Multimodal#arXiv#NVIDIA
why featured
HKR-H/K pass: the 4D scene-expansion hook and RTX 5090 under-10-minute condition add signal. HKR-R is weak; this remains specialist vision-generation research, so it stays in 60–71.
editor take
Geometric 4D Stitching runs one expansion under 10 minutes; I want the geometry metrics, and the snippet gives none.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Locking Pretrained Weights via Deep Low-Rank Residual Distillation
The paper proposes DLR-Lock, replacing each pretrained MLP with a comparable-parameter DLR-Net so backpropagation activation memory grows linearly with depth, and tests resistance to standard fine-tuning under adaptive attackers with full knowledge of the defense.
#Fine-tuning#Safety#Research release#Safety/alignment
why featured
HKR-H/K/R pass: the anti-fine-tuning hook is novel, with DLR-Net and omniscient-attacker details. Still an arXiv technical paper without success rates, code, or independent uptake, so it stays in 60–71.
editor take
DLR-Lock replaces every pretrained MLP, making activation memory grow linearly with depth; I don’t buy “weight locking” without scale or overhead numbers.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
MURPHY extends GRPO to multi-turn code generation by building feedback-conditioned rollout trees and propagating rewards backward; across HumanEval, MBPP, and LiveCodeBench-v6, it raises pass@1 by up to 6% absolute on Qwen3-1.7B/4B and OLMo-2-7B.
#Agent#Code#Fine-tuning#Qwen
why featured
HKR-K and HKR-R pass: MURPHY claims up to +6% pass@1 across HumanEval, MBPP, and LiveCodeBench-v6 for Qwen3/OLMo-2. HKR-H is weak; no code release, training cost, or production result is disclosed.
editor take
MURPHY adds up to 6% pass@1 on three code benchmarks; multi-turn code RL finally credits failed attempts that teach.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Learning When to Trust LLM Priors: A Validated Framework for Semantic Prior Integration
Statsformer maps LLM-derived feature scores into linear and nonlinear predictors, then uses out-of-fold validation to calibrate each prior-informed learner’s weight before semantic priors affect the final predictor.
#RAG#Reasoning#Benchmarking#Statsformer
why featured
HKR-H and HKR-K pass: the title targets the practical problem of trusting LLM priors, and the summary gives an out-of-fold calibration mechanism. No results, benchmark numbers, or deployment setting keeps it mid-band.
editor take
Statsformer calibrates LLM-prior weights via out-of-fold validation; I like the move: semantic knowledge gets demoted to testable features.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA
The paper proposes Residual Feature Alignment Unlearning, using LoRA to decompose intermediate features and train zero residuals on retained data and shifted residuals on the unlearning set.
#Fine-tuning#Alignment#Research release
why featured
HKR-K is present via the LoRA residual-alignment mechanism, and HKR-R via unlearning and compliance concerns. No benchmark, dataset, code, or surprising result is disclosed, so it stays in the 60–71 research-signal band.
editor take
RFAU uses LoRA on intermediate residuals; no experiment numbers disclosed, so treat the unlearning claim as unpriced.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
The paper defines causal dimensionality κ(L,M,T) and estimates it with SAE width sweeps plus attribution patching; on Gemma-2-2B layer 12 across seven SAE widths, representational capacity grows 15.6× while causal capacity grows 4.35×.
#Interpretability#Benchmarking#Gemma#Research release
why featured
HKR-H and HKR-K pass: the paper adds κ, SAE width scans, and a Gemma-2-2B layer-12 15.6x/4.35x contrast. HKR-R is weak because this is specialist interpretability, so it stays in all.
editor take
Gemma-2-2B layer 12 gets 15.6× representation growth but 4.35× causal growth; wider SAEs look less magical.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation
AdaPreLoRA applies an Adafactor diagonal Kronecker preconditioner to LoRA updates. It derives a closed-form factor-space solve using O((m+n)r) memory, selects the update minimizing an H_t-weighted factor imbalance, and reports competitive or better results on GPT-2 E2E, Mistral-7B, Qwen2-7B GLUE, ARC, GSM8K, and diffusion personalization tasks.
#Fine-tuning#Inference-opt#Benchmarking#AdaPreLoRA
why featured
HKR-K/R pass: the paper gives a concrete mechanism, memory bound, and model test set, with relevance to LoRA fine-tuning cost. HKR-H is weak, and the optimizer detail keeps it in the 60–71 research band.
editor take
AdaPreLoRA solves preconditioned LoRA updates in O((m+n)r) memory; I’d check ablations before trusting “competitive” benchmarks.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces GDN’s learned write coefficient with βt=ηt/(||kt||²+ε), keeps the recurrent state and chunkwise parallel algorithm unchanged, and at 0.4B scale with a 1B-token budget reports 8.09 validation perplexity versus GDN’s 8.50, with stability up to 65K tokens.
#Reasoning#Inference-opt#Benchmarking#Gated DeltaNet
why featured
HKR-K and HKR-R pass: the post gives a concrete mechanism and benchmark numbers, with relevance to long-context stability. HKR-H is weak, and the paper is technical, so it stays in all.
editor take
KLA reports 8.09 perplexity at 0.4B/1B tokens; a one-scalar GDN tweak doing this much deserves replication.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models
The paper reports exploration collapse in post-trained LRMs and proposes Latent Exploration Decoding, which sums intermediate posteriors and selects maximum-entropy depth configurations without extra training or parameters, improving pass@1 by 0.61 points and pass@16 by 1.03 points across multiple reasoning benchmarks and models.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: exploration collapse plus LED’s entropy-depth decoding gives a testable mechanism and +0.61/+1.03 pp results. HKR-R is weak; gains are small and implementation impact is not disclosed.
editor take
LED lifts pass@16 by 1.03 points; temperature sampling is failing RL post-training, and layer-aware decoding is the cleaner fix.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Why Is Prompting Hard? Understanding Prompts on Binary Sequence Predictors
The paper frames prompting as searching for the best conditioning sequence on a near-optimal sequence predictor. Across multiple controlled experiments, even exhaustive search fails to reliably identify optimal prompts for practical neural predictors, and task demonstrations can be suboptimal.
#Reasoning#Benchmarking#arXiv#Research release
why featured
HKR-H/K/R pass, but this is a single theoretical arXiv paper. The post gives controlled experiments and an exhaustive-search failure claim, without tooling, benchmark impact, or product implications.
editor take
Binary predictors make prompting look less mystical: exhaustive search still misses optima, so few-shot demos deserve less worship.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Crowding Out the Noise: Algorithmic Collective Action Under Differential Privacy
The paper analyzes how DP-SGD affects algorithmic collective action, derives lower bounds on success as a function of collective size and privacy parameters, and validates the trends by simulating deep neural network classifier training, while the snippet does not disclose the exact number of datasets.
#Fine-tuning#Safety#Research release#Safety/alignment
why featured
HKR-K is concrete via a formal bound, and HKR-R connects to privacy and data leverage. The arXiv item is theoretical and lacks dataset counts or reproducible experiment details, so it stays in the mid-interest band.
editor take
The paper bounds success by collective size and DP parameters; dataset count is undisclosed. Privacy training doubles as a moat against data protests.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
The paper evaluates gradient attribution on 2 algorithmic tasks and up to 10 random seeds, finding rank correlation drops to ρ=0.27 on sequence sorting and reaches ρ=-0.18 in individual seeds.
#Interpretability#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv interpretability paper with algorithmic tasks and limited seeds. Industry impact stays narrow, so it lands in the 60–71 band.
editor take
This hits gradient attribution on 2 toy tasks: sorting ρ=0.27, one seed ρ=-0.18; useful warning, not LLM evidence.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Voice Biomarkers for Depression and Anxiety
The paper trains a deep learning model on about 65,000 utterances from over 23,000 U.S. subjects, evaluates it on about 5,000 unique subjects, and reports 71% sensitivity and specificity for depression and anxiety detection from speech.
#Audio#Fine-tuning#Benchmarking#HuggingFace
why featured
HKR-H/K/R pass, but this is a medical voice-classification paper without product rollout, open artifact detail, or clinical deployment mechanics. It stays in the interesting research band, below featured.
editor take
The model hits 71% sensitivity and specificity on 5,000 subjects; not clinical-ready, but HuggingFace weights invite real generalization tests.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
The paper proposes the cancellation hypothesis for critic-free RL: coupled gradients cancel opposing signals on tokens shared by positive and negative rollouts, and two batching interventions, query-preserved mini-batching and reward-balanced batching, improve RLVR training across multiple model scales.
#Reasoning#Fine-tuning#Alignment#Research release
why featured
HKR-K/R pass: the paper offers a token-signal cancellation mechanism and 2 batching interventions. It is relevant to post-training, but limited source detail and dense RL framing keep it in the 60–71 band.
editor take
This paper moves critic-free RL to token-level credit: 2 batching tricks help, but model scales are undisclosed. I buy half the story.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
A PyTorch Library of Turing-Complete Neural Networks
arXiv 2605.08150 presents a PyTorch package that compiles neural networks and weights from Turing machine descriptions, with each forward pass simulating one machine step without training, and implements two architectures: a transformer construction and a recurrent network using Cantor-set stack encoding.
#Code#Tools#Reasoning#PyTorch
why featured
HKR-H and HKR-K pass: the no-training weight-compilation angle is novel, and the mechanism is concrete. HKR-R is weak because the paper is theory/tooling-heavy with limited industry impact.
editor take
This PyTorch library compiles Turing machines into weights; don’t sell it as intelligence, use it as a runnable construction benchmark.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders
The paper tests 12 Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs and finds that single-thread rankings do not predict PyTorch DataLoader throughput across worker counts {0,2,4,8}.
#Benchmarking#arXiv#Google Cloud#PyTorch
why featured
HKR-H/K/R pass, but this is a narrow ML-systems benchmark rather than a model or mainstream tooling update. No hard exclusion applies; the reproducible setup keeps it in all.
editor take
12 JPEG paths across five 16-vCPU CPUs expose bad loader benchmarks: single-thread winners fail PyTorch DataLoader reality.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding
The paper defines complete evidence extraction as a task and tests Rashomon-style ensembles on a medical coding dataset with human-annotated evidence; ensembles of three equally performing language models beat the best single model on evidence recall while adding only a small token overhead.
#Interpretability#Benchmarking#Research release
why featured
A single arXiv paper with a concrete ensemble mechanism, but the use case is narrow medical coding rather than a broad model or product release. HKR-K/R pass, HKR-H misses, so it stays in all.
editor take
Three peer models raise evidence recall; in medical coding compliance, small token overhead beats single-model missed evidence.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
AllocMV models music video synthesis as a Multiple-Choice Knapsack Problem and uses dynamic programming to allocate resources across three branches: High-Gen, Mid-Gen, and Reuse.
#Multimodal#Inference-opt#AllocMV#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete allocation mechanism and targets video-generation cost. The post gives no metrics, baselines, or artifact, so it stays in the mid “all” band.
editor take
AllocMV casts MV generation as MCKP with DP; CQR numbers are undisclosed, so the engineering story outruns reproducibility.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors
The paper proposes Cosine-Aware Adaptive EWC for text-to-image backdoors, using cosine-based semantic utility and adaptive scheduling to tune EWC regularization; the abstract does not disclose specific ASR, fidelity, or OOD dataset numbers.
#Safety#Fine-tuning#Research release#Safety/alignment
why featured
HKR-H/K/R all pass lightly: the security angle is real and the mechanism is specific. But metrics are not disclosed, and the technical barrier keeps it in the 60–71 research-interest band.
editor take
Cosine-Aware Adaptive EWC tunes EWC regularization; no ASR, FID, or OOD numbers disclosed, so treat it as attack tuning.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses attention activations in diffusion language models via DARE-KV and DARE-O, cutting per-layer latency by up to 1.20x, reusing up to 87% of attention activations, and reporting average drops of 2.0% and 1.2% for DARE-KV and DARE-O on reasoning and code-generation benchmarks.
#Inference-opt#Reasoning#Code#arXiv
why featured
HKR-K is clear via mechanism and numbers, and HKR-R hits inference cost. The diffusion-LM inference angle is narrow and acronym-heavy, so this stays interesting but not featured.
editor take
DARE reuses up to 87% attention activations; 1.20x per-layer gain is modest, but dLLM inference gets a stackable cache primitive.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Clin-JEPA: Multi-Phase Co-Training Framework for EHR Patient Trajectory Prediction
Clin-JEPA co-trains a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor with a five-phase curriculum; on MIMIC-IV ICU data, its 48-hour rollout drift drops 15.7%, and it reaches mean AUROC 0.883 on 8 binary risk tasks.
#Embedding#Fine-tuning#Benchmarking#Qwen
why featured
HKR-K passes because the abstract gives concrete training phases, model sizes, and MIMIC-IV metrics. HKR-H/R are weak: this is a vertical clinical-ML paper, not a general model, product, or open-source framework release.
editor take
Clin-JEPA co-trains a Qwen3-8B encoder and 92M predictor in 5 phases; AUROC hits 0.883, but one ICU dataset is thin.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning
CERSA uses SVD to retain principal components holding 90% to 95% of spectral energy, then fine-tunes low-rank representations to reduce memory use for large pretrained models; evaluations cover image recognition, text-to-image generation, and natural language understanding, while the abstract does not disclose exact memory numbers or release date.
#Fine-tuning#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the paper states a concrete SVD energy-retention method and targets fine-tuning memory. HKR-H fails, and no headline result, code, or cost delta is disclosed.
editor take
CERSA keeps 90–95% spectral energy; exact memory cuts are undisclosed, so don’t bury LoRA on abstract claims.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
The paper tests LLM in-context learning with a two-graph random-walk task, and PCA, residual-stream patching, and linear steering show that structure inference and induction circuits operate in parallel.
#Reasoning#Interpretability#Research release
why featured
HKR-H and HKR-K pass: the title poses a mechanism puzzle, and the summary gives dual-graph random walks plus three causal probes. HKR-R is weak because the impact stays inside interpretability research.
editor take
arXiv 2605.08405 uses two-graph random walks with causal interventions; the steering controls sell it, not the “belief” framing.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Towards Customized Multimodal Role-Play
The paper introduces Customized Multimodal Role-Play and the RoleScape-20 dataset with 20 characters, and trains UniCharacter with 10 images plus interaction examples per character under about 100 GPU hours to align persona, dialogue style, and visual identity across generated text and images.
#Multimodal#Fine-tuning#Agent#arXiv
why featured
HKR-H comes from the few-shot multimodal character-customization hook, and HKR-K has a new task, dataset, and compute condition. The audience fit is narrow, so it stays below featured.
editor take
UniCharacter needs 10 images and ~100 GPU hours per character; RoleScape-20 is too small to sell immersive agents.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PRISM: Fast Online LLM Serving via Scheduling-Memory Co-design
PRISM combines a query-aware scheduler, QAS, with a demand-aware radix tree, DART, and reduces average per-QPS P99 TTFT by 23.3% and 37.1% on 4B and 13B models versus the strongest baseline.
#RAG#Agent#Inference-opt#PRISM
why featured
HKR-K and HKR-R pass: the paper gives a concrete mechanism and P99 TTFT gains tied to serving cost. HKR-H is weak, and a single technical arXiv systems paper stays in the 60–71 band.
editor take
PRISM cuts P99 TTFT 23.3%/37.1% on 4B/13B; RAG hot-prefix reuse finally gets scheduler-level treatment.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Selective Neuron Amplification in Transformer Language Models
The paper proposes Selective Neuron Amplification, an inference-time method that increases task-relevant neuron influence without changing model parameters; its experiments report gains mainly when the model is uncertain, with low effect when confidence is already high.
#Inference-opt#Interpretability#Reasoning#Research release
why featured
HKR-H and HKR-K pass: the paper offers a clear inference-time mechanism and a testable no-parameter-change claim. With no model names, metrics, or artifact details in the feed, it stays in the interesting research band.
editor take
SNA amplifies task-relevant neurons at inference without weight updates; smells like an activation-routing patch, with model sizes and benchmarks undisclosed.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
The paper proposes an LLM evaluation framework that combines multi-armed bandits with low-rank score predictions, using doubly robust estimators to build finite-sample confidence intervals under adaptive model selection and sampling without replacement; the abstract does not disclose the exact evaluation savings.
#Benchmarking#Research release#Benchmark
why featured
HKR-K/R pass: the method targets LLM eval sample cost and valid best-model identification. HKR-H is weak, and the post does not disclose savings ratio or experiment scale, so it stays in all.
editor take
MAB plus low-rank prediction targets LLM eval cost, but savings are undisclosed; buy the confidence intervals, not the cost story yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PMCTS: Particle Monte Carlo Tree Search for Principled Parallelized Inference Time Scaling
The paper introduces Particle MCTS, a particle-based parallel MCTS algorithm for neural network evaluations, and claims it preserves formal policy improvement guarantees while outperforming heuristic baselines across domains.
#Reasoning#Inference-opt#Research release
why featured
HKR-K is concrete via particleized parallel MCTS, and HKR-R fits inference-time scaling cost/latency. HKR-H is weak and no experiment numbers, model scale, or artifact are disclosed, so this stays in 60–71.
editor take
PMCTS parallelizes MCTS, but the snippet gives no benchmark numbers; if the guarantee holds, inference scaling gets less hacky.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LeapTS: Rethinking Time Series Forecasting as Adaptive Multi-Horizon Scheduling
LeapTS reframes time series forecasting as a dynamic scheduling process over the prediction horizon, using hierarchical control and neural controlled differential equations to improve forecasting performance by at least 7.4% and run 2.6x to 5.3x faster than representative Transformer-based models on real-world and synthetic datasets.
#Reasoning#Inference-opt#LeapTS#Research release
why featured
HKR-H and HKR-K pass: the scheduling reframing is a hook, and the abstract gives testable 7.4% and 2.6–5.3x claims. Scope is vertical forecasting research, so it stays in the 60–71 signal band.
editor take
LeapTS claims ≥7.4% accuracy gains and 2.6–5.3x faster inference; I want baselines and datasets before buying the scheduling story.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses known token-to-expert routing from RL rollout-training workflows to balance MoE training loads at micro-batch granularity, improving throughput by up to 1.6x over Megatron-LM and up to 1.2x over EPLB, while staying within 6%-10% of an idealized balanced baseline.
#Reasoning#Inference-opt#ReLibra#Megatron-LM
why featured
HKR-K and HKR-R pass: the mechanism and 1.6x throughput claim are concrete, with real MoE/RL training-efficiency value. The topic is still niche training infrastructure, so it stays in mid-band all.
editor take
ReLibra gets 1.6x over Megatron-LM by replaying known MoE routes; I buy it, RL training has unused systems slack.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Transformation-Augmented GRPO for Enhancing Exploration in Reasoning of Large Language Models
TA-GRPO expands GRPO training with meaning-preserving question rephrasings. Across four LLMs, Qwen3-1.7B gains 4.97 average pass@32 points, and Qwen3-4B gains 4.34 points on listed competition and out-of-distribution benchmarks.
#Reasoning#Fine-tuning#Benchmarking#Qwen
why featured
HKR-K is clear: TA-GRPO expands GRPO training samples via problem rewriting and reports four-LLM results, including +4.97 pass@32 on Qwen3-1.7B. HKR-R is narrow to reasoning trainers; HKR-H is weak, so it stays all.
editor take
TA-GRPO gives Qwen3-1.7B +4.97 pass@32; question rephrasing is plain, but it hits GRPO’s zero-gradient failure cleanly.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
The thesis proposes trustworthy ML algorithms covering multiaccuracy, predictive multiplicity, LLM watermarking, and agent evaluation, with a fully LLM-driven supply-chain simulator where LLM agents outperform human teams and reduce costs by up to 67%.
#Agent#Alignment#Safety#Research release
why featured
HKR-K has concrete mechanisms and a 67% supply-chain simulation cost cut; HKR-R hits trustworthy agents and accountability. HKR-H is weak, and a single arXiv paper lacks lab authority or reproducible detail, so it stays all.
editor take
LLM supply-chain agents cut costs up to 67%, with costly tail events; skip the watermark glow, agent evaluation is the hard ledger.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding
The authors introduce HMAGAT, a directed-hypergraph attention architecture for MAPF group coordination; with 1M parameters and 100× less training data, it outperforms the current 85M-parameter learning-based SoTA model.
#Agent#Reasoning#Benchmarking#HMAGAT
why featured
HKR-H and HKR-K pass: the small-model, low-data claim is concrete. HKR-R is weak because MAPF remains a specialist path-planning topic, so it stays in all.
editor take
HMAGAT beats an 85M MAPF model with 1M parameters; hypergraph bias beats pairwise GNN scaling here.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Make Each Token Count: Improving Long-Context Performance with KV Cache Eviction
The paper introduces a global retention-based KV cache eviction method that scores cached entries with lightweight gates under one memory budget, targeting long-context language, vision-language reasoning, and multi-turn dialogue benchmarks without disclosing exact memory savings in the RSS snippet.
#Inference-opt#Reasoning#Multimodal#Research release
why featured
HKR-K and HKR-R pass: the mechanism is concrete and KV memory is a real deployment pain. No benchmark numbers, model scale, or released artifact are disclosed, so it stays in the mid all band.
editor take
Global gated KV eviction claims to beat full-cache inference, but the RSS gives no savings; I’d withhold trust until code and curves land.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CORP: Closed-Form One-Shot Representation-Preserving Structured Pruning for Transformers
CORP prunes Transformer MLP dimensions and attention substructures in one shot using unlabeled calibration data, without gradients or fine-tuning; on DeiT-Huge, it keeps 83.27% Top-1 accuracy after pruning 50% of both MLP and attention structures.
#Inference-opt#CORP#DeiT#Research release
why featured
HKR-K and HKR-R pass: the post gives a concrete pruning setup and DeiT-Huge result, tied to inference cost. HKR-H is weak, and as a single technical arXiv compression paper it stays in 60–71.
editor take
CORP keeps DeiT-Huge at 83.27% Top-1 after 50% MLP+attention pruning; I’d test calibration-domain drift first.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Negative Ontology of True Target for Machine Learning: Evaluation and Learning under Democratic Supervision
The arXiv 2604.24824v2 paper proposes Democratic Supervision and Multiple Inaccurate True Targets for machine-learning predictive modeling, derives EL-MIATTs for evaluation and learning under the assumption that a true target does not objectively exist, and describes one real-world application in education and professional development; the post does not disclose benchmark scores or dataset sizes.
#Benchmarking#Alignment#Research release
why featured
HKR-K/R pass: the paper introduces named supervision/evaluation mechanisms and touches alignment governance. HKR-H fails, and no benchmark numbers or reproducible conditions are disclosed, so it stays in the 60–71 band.
editor take
arXiv 2604.24824v2 proposes MIATTs with no benchmark scores; I don’t buy ontology as a substitute for reproducible evals.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
From Pre-training to Downstream Performance: Does Domain-specific Pre-training Make Sense?
The paper compares CNNs and transformers across supervised and self-supervised pre-training, different initializations, and natural images, chest X-rays, chest CT, and retina OCT; it finds that downstream medical-imaging performance improves significantly only when pre-training data closely matches the target modality.
#Vision#Fine-tuning#Benchmarking#Research release
why featured
HKR-K and HKR-R pass: the paper offers a testable rule for when domain pretraining helps. It is still a single arXiv medical-imaging benchmark with limited industry spillover, so it stays in the 60–71 band.
editor take
The paper compares CNNs and transformers across pretraining setups; for medical imaging, generic backbones don’t pay unless modality matches.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception
Urban-ImageNet releases a dataset with over 2 million public Weibo image-text pairs from 61 urban sites in 24 Chinese cities across 2019-2025, plus 1K, 10K, and 100K benchmark subsets and three tasks for classification, cross-modal retrieval, and instance segmentation.
#Multimodal#Vision#Benchmarking#Urban-ImageNet
why featured
HKR-H/K pass on a concrete 2M-post dataset and reproducible benchmark tasks. HKR-R is weak because the impact stays inside urban vision research, with no model, product, or platform-competition spillover.
editor take
Urban-ImageNet ships 2M Weibo image-text pairs; China urban perception gets a benchmark, with social-media bias baked in.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AdaPaD: Adaptive Parallel Deflation for PEFT with Self-Correcting Rank Discovery
AdaPaD trains all rank-1 components simultaneously and uses self-correcting deflation so errors converge toward zero across rounds; on Qwen3-0.6B SQuAD and SQuAD v2, it matches fixed-rank LoRA while deploying an adapter that is 30.7% smaller on average.
#Fine-tuning#Inference-opt#Benchmarking#Qwen
why featured
HKR-K/R pass: the paper states a concrete mechanism and a 30.7% adapter-size result, and it hits PEFT cost concerns. As a single arXiv methods paper with no disclosed implementation or production replacement, it stays in 60–71.
editor take
AdaPaD cuts Qwen3-0.6B SQuAD adapters by 30.7%; I buy rank discovery, pending replicated training-cost numbers.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
PoDAR uses randomized power augmentation and a latent consistency objective to separate signal power from semantic content, giving an F5-TTS generator on LibriSpeech-PC about 2x faster convergence to baseline performance, plus 0.055 higher speaker similarity and 0.22 higher UTMOS.
#Audio#Fine-tuning#PoDAR#Stable Audio
why featured
HKR-H/K pass: PoDAR gives a concrete method and testable LibriSpeech-PC gains. HKR-R is weak because the impact is confined to TTS/audio representation researchers, below featured threshold.
editor take
PoDAR gives F5-TTS ~2x faster convergence on LibriSpeech-PC; I buy the bet—audio latents need modelability, not just codec fidelity.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
The paper studies Transformer world-model scaling on Atari 100k with fixed offline datasets from an expert policy; joint training across 26 environments stabilizes scaling with monotonic gains, and policies trained entirely inside simulated dynamics reach a 0.770 median expert-random-normalized score.
#Agent#Reasoning#Benchmarking#Research release
why featured
HKR-K passes via concrete benchmark facts: fixed offline data, 26 Atari environments, and a 0.770 score. HKR-H and HKR-R are weak, so this stays as a useful but non-featured research item.
editor take
Joint training across 26 Atari games gives monotonic scaling; 0.770 median score says world models can cash fixed offline data.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
The paper proposes Autonomous Preference Optimization, treating reasoning drift across multiple MLLMs as negative constraints, and releases CXR-MAX with 170,982 reasoning trajectories from seven MLLMs for chest X-ray reasoning alignment under non-stationary conditions.
#Reasoning#Alignment#Multimodal#arXiv
why featured
HKR-K is clear: APO plus 170,982 trajectories across 7 MLLMs is testable new material; HKR-R is present for alignment and evaluation teams. HKR-H is weak, and a single arXiv paper lacks product or top-lab reach, so it stays in 60–71.
editor take
APO uses 170,982 CXR traces to suppress drift; chest-X-ray wins over proprietary sources need outside replication first.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Diversity in Large Language Models under Supervised Fine-Tuning
The paper attributes reduced generation diversity after SFT to neglected low-frequency patterns and forgetting of preexisting knowledge, and proposes Tempered Focal loss; the abstract says evaluations span multiple models and benchmarks, but the RSS snippet does not disclose specific models, benchmark names, or metric values.
#Fine-tuning#Alignment#Benchmarking#Research release
why featured
HKR-K/R pass: the mechanisms and new loss are useful for SFT practitioners and speak to output collapse after tuning. Specific models, benchmarks, and metric gains are not disclosed, so it stays in the 60–71 research band.
editor take
SFT narrows diversity; TOFU targets rare patterns. RSS gives no models or metrics, so I don't buy “preserves quality” yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DynaMiCS: Fine-tuning LLMs with Performance Constraints using Dynamic Mixtures
DynaMiCS formulates multi-domain fine-tuning as constrained optimization, estimates a local cross-domain slope matrix through short probing runs at each update, and solves mixture weights on the probability simplex without reference models, per-example scoring, or manually tuned weights.
#Fine-tuning#Safety#Benchmarking#DynaMiCS
why featured
HKR-K and HKR-R pass: the post gives a testable dynamic-mixture mechanism and targets regression control in multi-domain fine-tuning. No metrics, authorship signal, or product impact, so it stays in the 60–71 band.
editor take
DynaMiCS probes cross-domain slopes each step before mixing; I buy the idea, but model size and cost multiplier are undisclosed.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery
SAGE proposes three mechanisms for pathology image biomarker discovery: knowledge-graph-anchored hypothesis generation, debate-based multi-agent novelty assessment, and an automated validation pipeline. The arXiv abstract says the pipeline translates hypotheses into executable analyses on multimodal pathology datasets, but does not disclose benchmark results or clinical deployment data.
#Agent#Reasoning#Interpretability#Research release
why featured
HKR-H and HKR-K pass: SAGE applies an agent pipeline to pathology biomarker discovery and names 3 mechanisms. The medical pathology domain limits accessibility, so it lands in the 60-71 band.
editor take
SAGE offers 3 mechanisms but no results disclosed; don’t buy “clinically translatable” until benchmarks and deployment data appear.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Upholding Epistemic Agency: A Brouwerian Assertibility Constraint for Responsible AI
The paper proposes a three-status interface semantics: in high-stakes domains, AI systems assert or deny claims only with a publicly inspectable certificate, otherwise they return Undetermined.
#Alignment#Safety#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: the paper offers a verifiable-certificate constraint plus an Undetermined state for high-risk AI. HKR-H is weak, and the available facts stay at abstract level, so it fits the 60–71 band.
editor take
The paper requires Undetermined without public certificates in high-stakes AI; I like the hard gate, but deployment costs stay unspecified.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO partitions response tokens by prompt difficulty, answer correctness, and token entropy, then assigns group-specific objectives, outperforming the DAPO baseline by 8.6% on AIME'24 and 6.7% on AIME'25.
#Reasoning#Alignment#Benchmarking#HTPO
why featured
HKR-K is strong: the method and AIME gains are concrete. HKR-R is moderate for reasoning post-training practitioners, but HKR-H is weak and the paper is technical, so it stays in the 60-71 band.
editor take
HTPO beats DAPO by 8.6/6.7 on AIME’24/’25; token-level RLVR smells useful, but wait for code and non-math evals.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity
The paper compares a 2.7M-parameter TextCNN with 66M-parameter DistilBERT+LoRA on federated text classification and finds that under label skew alpha=0.1, DistilBERT+LoRA reaches a 50.1% worst-client accuracy gap, 56% higher than TextCNN’s 32.2%, while alpha>=0.5 reverses the pattern.
#Fine-tuning#Benchmarking#Alignment#arXiv
why featured
HKR-H/K/R pass, but this is a niche federated-learning paper rather than a broad product or model release. No deployable artifact or production replacement claim is disclosed, so it stays in 60–71.
editor take
DistilBERT+LoRA hits a 50.1% worst-client gap at alpha=0.1; FM priors can punish weak clients under extreme Non-IID.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Supervised Mixture-of-Experts for Surgical Grasping and Retraction
The paper presents a supervised MoE layer for surgical manipulation policies, where ACT learns bowel grasping and retraction from fewer than 150 demonstrations using only stereo endoscopic images.
#Robotics#Vision#Fine-tuning#arXiv
why featured
HKR-H and HKR-K pass: the surgical-robotics angle is unusual, with testable details around under 150 demos, stereo endoscopy, and ACT/MoE. HKR-R is weak because this is a vertical medical-robotics paper, not a broad AI tooling or platform story.
editor take
Supervised MoE gets ACT under 150 demos for bowel retraction; VLA fails even in-distribution, so surgical robotics should stop worshipping generalists.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation
PC3D trains decentralized multi-agent reinforcement learning policies for episodic roster variation, where homogeneous agents face changing team sizes across episodes and act only from local histories; across three cooperative MARL benchmarks, it reports higher returns than evaluated baselines on seen and unseen roster sizes, with ablations attributing gains to context distillation and adaptive context use.
#Agent#Reasoning#PC3D#Research release
why featured
HKR-H/K pass: the paper gives concrete variable-roster conditions and runtime constraints. HKR-R is weak; arXiv MARL is specialized, so it stays in the 60–71 band.
editor take
PC3D improves returns on 3 MARL benchmarks; clean no-comms execution, but task scale and variance are undisclosed.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
The paper compares dense FFNs, GLUs, MoE, and MoE-GLUs in one-layer Transformers trained on carry addition, modular arithmetic, and histogram counting, finding that sparse MoE routing shifts computation from FFNs to attention, with the strongest ablation-visible effect on carry-based addition.
#Interpretability#Reasoning#Research release
why featured
HKR-H/K pass: the claim is counterintuitive and the architecture comparison is testable. The evidence is still one-layer Transformers on arithmetic/counting tasks, so practical reach stays in the 60–71 band.
editor take
One-layer Transformers show random MoE routing nearly matches learned routing; park the expert story, sparsity is moving work into attention.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Alignment as Jurisprudence
The paper compares alignment with jurisprudence through Constitutional AI, case-based reasoning, Dworkin’s interpretivism, and Sunstein’s analogical legal positivism, arguing that rule interpretation and case reasoning share a structure across AI alignment and judicial decision-making.
#Alignment#Reasoning#Fine-tuning#Dworkin
why featured
HKR-H/K/R pass, but this is a conceptual alignment paper with no experiment, model release, or reproducible artifact. It fits the commentary-style safety band, so 66 and all.
editor take
2605.08416 frames alignment as jurisprudence; no experiments disclosed, and the legal analogy still has to survive measurement.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Laplacian Heads Improve Transformers by Smoothing Token Representations
The paper replaces a subset of attention matrices P with I-P in Transformer heads, tests the change on supervised learning, language modeling, and self-supervised tasks, and reports improved performance plus faster-decaying representation spectra that indicate stronger token smoothing.
#Reasoning#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the I-P attention variant is a concrete mechanism across multiple tasks. No effect sizes, model scale, or reproducibility details are disclosed, so HKR-R is weak and the item stays in the mid-interest band.
editor take
Laplacian Heads swap some P for I-P and improve three task families; no gains disclosed, so treat it as a cheap architecture patch.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate
ConfSMoE adds a two-stage missing-modality imputation module and a confidence-guided gate to sparse MoE, then evaluates resistance to missing modalities on four real-world datasets under three experiment settings.
#Multimodal#Inference-opt#Benchmarking#ConfSMoE
why featured
HKR-K and HKR-R pass: the mechanism and evaluation setup are concrete, and missing-modality robustness matters. As a single arXiv architecture paper with no product, code, or broad debate hook, it stays in the 60–71 band.
editor take
ConfSMoE tests 4 datasets across 3 settings; confidence gating without load-balance loss is the reusable bit here.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Sequential Membership Inference Attacks
arXiv:2602.16596v2 proposes Sequential Membership Inference attacks that insert a target canary at a controlled step and audit the full model sequence, with white-box gradient access or black-box loss access against models trained with (DP-)SGD; the post reports higher power than snapshot-independent baselines but does not disclose dataset counts in the RSS snippet.
#Safety#Benchmarking#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: the paper offers a new attack mechanism and targets model privacy risk. HKR-H is weak, and dataset counts, success rates, and model scope are not disclosed, keeping it in the 60-71 band.
editor take
SeMI audits full model sequences via controlled canaries; dataset counts are undisclosed, but final-snapshot privacy checks look stale.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve mean probing F1 by 10–30 points across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, cut reconstruction MSE by 25–50%, and recover 3–13× more semantically coherent latents than standard crosscoders under an LLM-as-a-judge evaluation.
#Interpretability#Benchmarking#GPT2-Small#Pythia
why featured
HKR-K is strong and HKR-R is moderate: the paper gives testable cross-layer feature-discovery gains. HKR-H is weak, and the method is too technical without product or agent impact, so it stays all.
editor take
fmxcoders add 10–30 probing F1 points on four small LLMs; standard crosscoders look brittle for cross-layer features.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics
The paper pretrains 14M to 1B parameter models for 300B tokens and compares three curricula against random ordering, finding that curricula mainly change time spent in shared latent phases while smaller models show more stable gradients.
#Reasoning#Benchmarking#arXiv#Research release
why featured
HKR-K and HKR-R pass: the scale and setup are concrete, and the claim targets curriculum learning’s value for pretraining efficiency. HKR-H is weak, so this stays in the 60-71 band.
editor take
14M–1B models ran 300B tokens; curricula changed phase timing, not phases. Don’t oversell small-model stability as a pretraining law.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
BEACON releases about 430 GB of synchronized multimodal data from 79 Valorant sessions across 28 players, totaling 102.51 hours of active gameplay, and provides the dataset and code on Hugging Face and GitHub for continuous authentication and behavioral fingerprinting benchmarks.
#Multimodal#Benchmarking#BEACON#Valorant
why featured
HKR-H and HKR-K pass: BEACON provides an open dataset, code, and concrete scale numbers. The impact stays research-dataset narrow, so it sits below the 72 featured threshold.
editor take
BEACON ships 102.51 hours from 28 Valorant players; useful as an auth benchmark, thin for broad behavioral claims.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations
LaWM replaces unconstrained transition predictors with a learned Lagrangian action functional, using a latent variational integrator over consecutive visual latent states to produce long-horizon rollouts under a discrete variational principle.
#Robotics#Vision#Reasoning#LaWM
why featured
HKR-H and HKR-K pass: the item has a concrete least-action world-model mechanism. No benchmark numbers, code, or product path are disclosed, and the technical bar keeps it in the 60–71 band.
editor take
LaWM advances visual latents with a variational integrator; no metrics disclosed, but physics priors are creeping back into world models.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens for DeepSeek-OCR-Large with 84.25% token retention and reports 99.47% accuracy plus 1.23× faster prefill on OmniDocBench using a two-stage high-norm selection and optimal-transport merging method.
#Vision#Inference-opt#Benchmarking#DeepSeek
why featured
HKR-K/R pass: the paper gives concrete metrics and targets OCR inference cost. HKR-H is weak, and the work is a niche inference-optimization paper rather than a product or industry-level update.
editor take
RTPrune keeps 84.25% tokens for 1.23× prefill; OCR pruning finally gets a DeepSeek-OCR-specific recipe.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Optimal Attention Temperature Improves ICL Robustness under High-Dimensional Distribution Shift
The paper derives a closed-form ICL generalization error for high-dimensional linear regression under distribution shift and gives an explicit optimal attention temperature, then validates gains on GPT-2 and Llama2-7B question-answering benchmarks with noisy in-context demonstrations.
#Reasoning#Inference-opt#Benchmarking#GPT-2
why featured
HKR-K/R pass: the paper offers a closed-form error, a temperature mechanism, and GPT-2/Llama2-7B checks, but no effect size or easy reproduction is disclosed; theory density keeps it in all.
editor take
The paper derives closed-form ICL error and optimal temperature; I buy the theory, but GPT-2/Llama2-7B gains are undisclosed.
HKR breakdown
hook knowledge resonance
open source
65
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SACHI: Structured Agent Coordination via Holistic Information Integration in Multi-Agent Reinforcement Learning
SACHI uses graph transformer convolutions over an inter-agent coordination graph before action selection, and the paper evaluates it on 5 cooperative tasks against 12 baselines, reporting that it matches or outperforms the best baseline on every task.
#Agent#Reasoning#Benchmarking#SACHI
why featured
HKR-K is solid via the mechanism and 5-task/12-baseline evaluation; HKR-R fits multi-agent reliability concerns. HKR-H is weak, and the MARL paper lacks product or open-source traction, so it stays in 60–71.
editor take
SACHI beats 12 baselines on 5 cooperative tasks; I’d check code first, since MARL papers often win inside their own task zoo.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
COSAC: Counterfactual Credit Assignment in Sequential Cooperative Teams
COSAC uses one ridge regression to decompose team rewards and policy forward passes for counterfactual advantages, reporting lower advantage MSE on sequential bandits up to K=16 and faster convergence than critic-free baselines on ARC with four Qwen3-0.6B agents.
#Agent#Reasoning#Robotics#Qwen
why featured
HKR-K/R pass: the mechanism and test settings are concrete, and the topic maps to multi-agent credit-assignment pain. HKR-H is weak; this is a niche arXiv method paper without product or open-source impact.
editor take
COSAC wins on K=16 bandits and four Qwen3-0.6B ARC agents; I haven’t seen large-team LLM evidence, so don’t oversell it.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias
R2PO uses a two-stage Search-LLM and Critic-LLM policy search loop with trajectory-level rollout evidence, and across 10 environments a 20B open-weight model achieves the highest mean best reward while reaching near-maximum CartPole reward within about 500 episodes.
#Agent#Reasoning#Benchmarking#R2PO
why featured
HKR-K passes because the mechanism and experiment numbers are concrete for agent/RL readers. HKR-H and HKR-R are weak, and a single arXiv paper without broad pickup stays in the lower all band.
editor take
R2PO tops mean best reward across 10 environments with a 20B open model; the useful bit is 76.6% CartPole regressions traced to critic salience bias.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Communicating Sound Through Natural Language
The paper introduces lexical acoustic coding, where pre-trained LLM sender and receiver agents transmit short sounds using only one English lexical sentence, a shared vocabulary, and optional symbolic music structure under fixed system prompts.
#Audio#Agent#Research release
why featured
HKR-H/K pass: the title has a counterintuitive experiment hook, and the summary gives the lexical acoustic coding setup. HKR-R fails; no product, benchmark, or artifact is disclosed, so it sits in the 60-71 research band.
editor take
LAC sends short audio through one English sentence; I don’t buy the romance until rate and fidelity ceilings are shown.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Mixture of Layers with Hybrid Attention
The paper introduces Mixture of Layers, replacing full-width Transformer blocks with K parallel thin blocks, using top-k block routing and hybrid attention to address token coverage when sparse routing scales to many blocks.
#Reasoning#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the mechanism is concrete and targets Transformer compute cost. With only abstract-level detail and no benchmarks, code, or production claim, it stays in the 60–71 research-signal band.
editor take
MoL swaps full-width layers for K thin routed blocks; shared softmax plus DeltaNet is the bet, not MoE magic.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Tabular Foundation Model for Generative Modelling
TabFORGE uses a causality-aware feature encoder and a two-stage diffusion design to generate tabular data, and the paper evaluates it against 22 benchmark methods on 45 real-world datasets.
#Fine-tuning#Benchmarking#TabFORGE#arXiv
why featured
HKR-H and HKR-K pass, but this is a narrow arXiv tabular-generation paper. The post gives mechanisms and benchmark scale, not open-source release, production replacement, or adoption evidence, so it stays in the 60–71 band.
editor take
TabFORGE reports 22 baselines across 45 datasets; I’d check privacy leakage and small-table performance before buying structural fidelity.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning
FLAME proposes a fixed-capacity MoE framework for continual multimodal multi-task learning, using modality-specific routers and low-rank memory subspaces to handle sequential tasks, with validation on multiple healthcare multimodal benchmarks.
#Multimodal#Fine-tuning#Memory#FLAME
why featured
HKR-K passes: the post names fixed-capacity MoE, routing, memory mechanisms, and medical multimodal benchmarks. HKR-H/R are weak, so this stays in the 60–71 research band.
editor take
FLAME keeps MoE capacity fixed and only expands routers; healthcare-only validation makes the open-domain claim hard to trust.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TeleResilienceBench: Quantifying Resilience for LLM Reasoning in Telecommunications
TeleResilienceBench tests error-recovery reasoning across seven telecom sub-domains and eight models, using midpoint-truncated flawed traces from a weak generator; the strongest model reaches only 29.1% macro-average CFR, while Nemotron-3-nano 4b leads the auxiliary TeleMath numerical evaluation at 23.4% CR%.
#Reasoning#Benchmarking#GSMA#Qwen
why featured
HKR-K is solid with a new benchmark and concrete results, and HKR-R ties to vertical-domain reliability. HKR-H is weak, and the telecom scope keeps it in the 60–71 research-benchmark band.
editor take
TeleResilienceBench tests 8 models; top CFR is 29.1%. In telco agent chains, recovery beats raw accuracy as the failure signal.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings
MapFormer updates positional encodings with input-dependent matrices and was tested on gating, 2D navigation, and Dyck language tasks; the paper reports near-perfect OOD generalization where standard models fail, plus perplexity gains on naturalistic data.
#Reasoning#Memory#Benchmarking#MapFormer
why featured
MapFormer hits HKR-H/K with an input-dependent positional-embedding mechanism and near-perfect OOD-generalization claim, but evidence is limited to gates, 2D navigation, and Dyck language tasks; no major lab, artifact, or product path is disclosed.
editor take
MapFormer updates positional encodings with input-dependent matrices; near-perfect OOD is a big claim, but baselines, scale, and ablations are undisclosed.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Black-Box Detection of LLM-Generated Text Using Generalized Jensen-Shannon Divergence
The paper proposes SurpMark, a black-box detector that uses token-surprisal state transitions and a generalized Jensen-Shannon gap to distinguish human from machine text; the RSS abstract says it matches or exceeds baselines across datasets and generators, but does not disclose dataset counts or metric values.
#Benchmarking#Safety#SurpMark#Research release
why featured
HKR-K/R pass: SurpMark offers a concrete black-box detection mechanism and targets AI-text authenticity. Kept in 60–71 because dataset counts, metrics, and comparisons are not disclosed.
editor take
SurpMark uses surprisal-transition matrices for black-box detection; dataset counts and metrics are undisclosed, so robustness stays unproven.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Hierarchical Reinforced Trader (HRT): A Bi-Level Approach for Optimizing Stock Selection and Execution
HRT splits equity trading into an HLC for sparse asset directions and an LLC for risk-aware weight adjustments, testing on 89 Nasdaq stocks with 2013–2018 training, 2019 validation, and 2020–2023 out-of-sample data; Sharpe rises from 1.06 for HRT-Base to 1.24, while daily turnover falls from 0.112 to 0.090.
#Agent#Reasoning#Benchmarking#arXiv
why featured
HKR-H and HKR-K pass: the AI-trader angle is clickable and the post gives mechanism plus backtest numbers. Scope stays in quant-finance research, with no code artifact, production claim, or major lab tie, so it remains all-tier.
editor take
HRT lifts Sharpe from 1.06 to 1.24 on 89 Nasdaq stocks; I’m not sold, one 2020–2023 slice is fragile.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Concordia: Self-Improving Synthetic Tables for Federated LLMs
Concordia proposes a tri-level optimization framework for federated LLM adaptation on tabular tasks: clients train LoRA adapters on synthetic tables, learn utility scorers from private validation feedback, and update local generators with GRPO without sharing raw records or validation data.
#Fine-tuning#Agent#Safety#Concordia
why featured
HKR-K and HKR-R pass: the method is specific and relevant to private-data adaptation. No metrics, artifact, or major-lab signal are disclosed, and the topic stays narrow, so this remains all.
editor take
Concordia stacks LoRA, private scorers, and GRPO for federated tables; no gains disclosed, so I’d treat it as mechanism-first.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
TongjiFinLab proposes FinTSB, a financial time series forecasting benchmark that covers 4 stock movement pattern categories, standardizes metrics across 3 evaluation dimensions, and tests models under regulatory constraints including transaction fees.
#Benchmarking#TongjiFinLab#FinTSB#Research release
why featured
HKR-K passes: FinTSB adds concrete financial time-series evaluation dimensions and trading-fee constraints. HKR-H and HKR-R are weak; this is a vertical research benchmark, not a broad model or toolchain update, so it sits in the 60-71 band.
editor take
FinTSB covers 4 pattern classes and 3 metric dimensions; adding fees makes finance forecasting less toy-benchmark cosplay.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
A Cross-Layered Multi-Drone Coordination for Medical Supply Delivery during Disaster Response Management
The paper presents CEDA, a CTDE Deep Q-Network algorithm for cooperative multi-drone medical delivery under hazards, energy limits, and triage deadlines; in grid simulation it reaches over 85% delivery completion, cuts obstacle collisions by more than 90% during training, averages 6 patients per episode, and is validated in PX4 SITL with two X500 quadrotors.
#Robotics#Agent#Reasoning#Research release
why featured
HKR-H and HKR-K pass: the scenario is concrete and includes completion, collision, and SITL details. The audience fit stays narrow, so it lands in all rather than featured.
editor take
CEDA tops 85% completion in simulation, but PX4 tests only two X500s; disaster medicine claims outrun the scale evidence.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
A Real-Calibrated Synthetic-First Data Engine
The paper presents a synthetic-first data engine that combines controllable diffusion generation, multi-stage filtering, optional uncertainty-driven selection, and human verification, with evaluation centered on human pose estimation; the abstract says synthetic augmentation improves a real-data baseline with real anchors, but it does not disclose dataset sizes.
#Vision#Research release
why featured
HKR-K lands via the synthetic-data pipeline mechanics, and HKR-R lands on vision data costs. HKR-H is weak, with no disclosed dataset size or standout metric, so this stays in the 60-71 band.
editor take
Human pose is the testbed; dataset sizes aren’t disclosed. The useful bit is admitting synthetic-only still trails real-only.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
DP-LAC estimates the initial clipping threshold with private histogram estimation, then adapts it during training without extra privacy budget or new hyperparameters, reporting a 6.6% average accuracy gain over state-of-the-art adaptive clipping methods and vanilla DP-SGD.
#Fine-tuning#Safety#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete mechanism and a 6.6% gain, tied to private fine-tuning tradeoffs. HKR-H is weak, and a single technical arXiv method sits in the 60–71 interesting band.
editor take
DP-LAC reports +6.6% accuracy with no extra privacy budget; I want epsilon, task mix, and model scale before buying it.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
The paper proposes validation-induced flow for targeted data selection, scoring candidates after a short capacity-limited warmup with normalized endpoint loss drop and requiring no candidate gradients or Hessian approximations.
#Fine-tuning#Benchmarking#Research release
why featured
HKR-K lands on a concrete mechanism; HKR-R is weaker but relevant to fine-tuning data cost. With no reported gains, code, or major-lab signal, this stays in the 60–71 single-paper band.
editor take
TAP scores samples via short validation warmup; zero-order selection skips candidate gradients, and reusable trajectories are the sell.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
The paper introduces SPACE, a closed-form concept erasure method that iteratively modifies cross-attention parameters in text-to-image diffusion models, reaches 80%-90% cross-attention sparsity, and reduces storage for modified parameters by 70%.
#Vision#Safety#Inference-opt#Stable Diffusion 1.5
why featured
HKR-K and HKR-R pass: the paper gives a concrete mechanism and numbers, tied to diffusion-model safety control. Single arXiv paper, narrow hook, and limited product impact keep it in 60-71.
editor take
SPACE hits 80%-90% cross-attention sparsity on SDXL; concept erasure is starting to look like patch distribution, not retraining.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
VC-Soup filters low-consistency preference pairs using cosine similarity between each reward-gap vector and an all-ones vector, then linearly combines policy models and applies Pareto filtering across values; the arXiv abstract claims experiments and theory show better multi-value alignment than reward reweighting, prompt-based SFT, and model merging, but the snippet does not disclose datasets or model sizes.
#Alignment#Fine-tuning#Research release#Safety/alignment
why featured
HKR-K/R pass: the mechanism is specific and alignment is relevant. HKR-H fails, and the post gives no metrics, model scale, or reproducible results, so this sits in the 60–71 band.
editor take
VC-Soup filters preference pairs by cosine consistency; datasets and model sizes are missing, so treat it as a cheap multi-value DPO recipe.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LILO: Bayesian Optimization with Natural Language Feedback
LILO translates a decision maker’s free-form language feedback into structured preferences and feeds them into a Gaussian-process proxy model for Bayesian optimization; across synthetic and real-world benchmarks, the paper reports stronger results than conventional preference-based BO methods and LLM-only optimizers, especially when feedback is limited.
#Reasoning#Tools#Benchmarking#LILO
why featured
HKR-H and HKR-K pass: the hook is natural-language feedback for BO, and the summary gives a GP-surrogate mechanism plus benchmark wins. It stays niche research with limited disclosed detail, so it remains all.
editor take
LILO routes free-text feedback into GP-based BO. In low-feedback regimes, that beats preference BO and LLM-only search.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models
The paper derives an entropy-minimization objective for test-time adaptation in autoregressive models and evaluates it on Whisper ASR across more than 20 domains, including acoustic noise, accents, and multilingual settings.
#Audio#Fine-tuning#Reasoning#Whisper
why featured
HKR-K is solid: a new TTA objective plus Whisper tests across 20+ noisy, accented, multilingual domains. HKR-R is narrow to ASR robustness teams, and the technical framing keeps it in 60–71.
editor take
They test Whisper across 20+ domains; the useful bit is turning TTA from heuristic patches into a derivable objective.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FairHealth: An Open-Source Python Library for Trustworthy Healthcare AI in Low-Resource Settings
FairHealth publishes an open-source Python library for healthcare AI in low-resource settings, with 6 modules covering federated learning, intersectional fairness metrics, explainability, dengue triage, disaster aid allocation, and public dataset loaders.
#Fine-tuning#Alignment#Interpretability#FairHealth
why featured
HKR-K is solid: 6 modules and low-resource healthcare use cases are explicit. HKR-H comes from the dengue/disaster mix, but no benchmarks, adopters, or production claims keep it in all.
editor take
FairHealth ships 6 modules; I worry this pip package turns fairness, FL, and triage into a demo menu.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Kinetic Theory for Transformers and the Lost-in-the-Middle Phenomenon
The paper studies causal self-attention as a toy decoder Transformer model, proves a quantitative mean-field limit, and derives a U-shaped token retrieval profile under iid uniformly distributed tokens and an explicit smallness condition.
#Reasoning#Interpretability#Research release
why featured
HKR-H/K/R all pass, but this is a theory-heavy arXiv paper built on mean-field analysis and a toy causal self-attention model. Technical-accessibility limits it to the 60–71 band.
editor take
The paper proves U-shaped retrieval for toy causal attention; don’t extrapolate to GPT-5-class models under iid uniform tokens and smallness.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Fairness of Explanations in AI: A Unifying Framework, Axioms, and Future Direction
The arXiv paper proposes a conditional invariance framework for explanation fairness in AI, mapping a blind spot where fair outputs still rely on unfair reasoning, and provides a 7-dimensional taxonomy, 3 mechanisms of explanation inequity, and a 6-step workflow for explanation fairness audits.
#Interpretability#Alignment#Safety#Research release
why featured
A single arXiv framework paper clears HKR-K/R with concrete taxonomy and audit mechanics, but misses HKR-H and lacks experiments, tooling, or industry uptake; it fits the 60–71 research-signal band.
editor take
This pins explanation fairness to conditional invariance: 7 axes, 3 mechanisms, 6 audit steps; I buy the problem, not post-hoc certification.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Max-pooling Network for Semantic Probability Analysis in Multiple Instance Learning Hallucination Detection
The paper analyzes HaMI through decision margins and proposes max pooling over token-level internal features with a lightweight MLP, removing repeated sampling and semantic similarity computation; the abstract does not disclose specific datasets, latency figures, or accuracy numbers.
#Reasoning#Benchmarking#HaMI#Research release
why featured
HKR-K is present via the max-pooling mechanism, and HKR-R via hallucination reliability. HKR-H is weak, and the abstract lacks datasets, latency, or accuracy numbers, so this stays in all.
editor take
Max pooling replaces HaMI semantic consistency; datasets and latency are undisclosed, so I’d file this as compute-saving until numbers land.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Large Language Models over Networks: Collaborative Intelligence under Resource Constraints
The paper proposes task-level collaboration among distributed LLMs across devices and cloud endpoints under compute, memory, communication, and cost constraints. It defines two composable dimensions—vertical device-cloud collaboration and horizontal multi-agent collaboration—and lists open problems in routing-policy training, cooperative capabilities, resource-heterogeneous scaling, and trustworthy collaborative intelligence.
#Agent#Inference-opt#Tools#Research release
why featured
HKR-K/R pass, but the post only gives a framework and open problems, with no metrics, code, or reproducible system. It belongs in all, below featured.
editor take
arXiv 2605.08626 folds device-cloud and multi-agent collaboration together; no experiments disclosed, so this reads like a routing agenda.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Yiding Song and Hanming Ye study grokking on modular arithmetic in a 23-page arXiv paper. They model capacity effects with two measured timescales, memorisation speed T_mem(P) and generalisation speed T_gen(P), and report grokking near the parameter scale where the two timescales intersect.
#Reasoning#Benchmarking#Interpretability#Yiding Song
why featured
HKR-H and HKR-K pass: the hook is capacity controlling grokking, with T_mem(P)/T_gen(P) as the mechanism in a 23-page paper. The modular-arithmetic setting limits practitioner impact, so it stays in the 60–71 band.
editor take
Song and Ye reduce grokking to 2 timescales; clean on modular arithmetic, thin until it survives real-task extrapolation.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Weakly Supervised Concept Learning for Object-centric Visual Reasoning
The paper introduces a weakly supervised perception scheme that combines a slot-based architecture with a VAE, translates predictions into symbolic background knowledge, and reports state-of-the-art foundation model baselines are outperformed in domain generalization with 1% label supervision.
#Reasoning#Vision#Research release#Benchmark
why featured
HKR-K passes with a testable 1% supervision, slot+VAE, and foundation-model baseline claim. HKR-H and HKR-R are weak, so this stays in the 60–71 research-interest band.
editor take
Slot+VAE hits symbolic reasoning with 1% labels; I’d audit dataset difficulty before calling this a vision-reasoning win.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation
Hun Chang and coauthors propose HAE, combining Directional Feature Alignment, Hierarchical Convolutional Patch Embedding, and Riemannian Flow Matching to train a DiT on a spherical latent manifold, reporting gFID 1.96, rFID 0.78, and PSNR 25.2 dB.
#Vision#Multimodal#Benchmarking#Hun Chang
why featured
HKR-K passes with concrete HAE mechanisms plus gFID 1.96, rFID 0.78, and PSNR 25.2 dB. HKR-H/R are weak; this is a single vision-architecture paper, useful but below featured.
editor take
HAE reports gFID 1.96 and rFID 0.78; spherical latents look clean, but convergence claims need code-backed replication.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Bilinear autoencoders find interpretable manifolds
The paper implements quadratic latents with bilinear autoencoders, decomposes activations into low-rank quadratic forms, and reports systematic reconstruction-error improvements in language models under the tested settings.
#Interpretability#Qwen#Research release
why featured
HKR-K passes because the mechanism is concrete: bilinear autoencoders with quadratic latents. HKR-H/R are weak, and the article lacks model list, experiment scale, and error numbers, so it stays in all.
editor take
Bilinear autoencoders cut reconstruction error on Qwen 3.5; I buy low-rank quadratics, not the linear-hypothesis takedown.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
The paper proposes ACE, a training-free decoding framework for MLLMs. It perturbs visual context with counter-commonsense patches, suppresses perturbation-sensitive linguistic priors, and compensates stable visual signals; the abstract claims negligible inference overhead but does not disclose benchmark names or numeric gains.
#Multimodal#Vision#Inference-opt#Research release
why featured
HKR-H/K/R pass, but the evidence is thin: no benchmark numbers are disclosed and the impact remains research-facing, so it stays in the 60–71 interesting-but-not-featured band.
editor take
ACE adds training-free counter-commonsense patch decoding; benchmarks and gains are undisclosed, so I file it with VCD-style tricks.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels
KernelBenchX evaluates LLM-generated Triton kernels across 176 tasks in 15 categories, finding that task category explains 9.4% of correctness deviance versus 3.3% for method choice, while quantization remains unsolved with 0/30 successful cases.
#Code#Benchmarking#Inference-opt#KernelBenchX
why featured
HKR-K/R pass: the paper gives concrete benchmark numbers and reliability limits for LLM-generated Triton kernels. Technical-accessibility penalty applies because GPU-kernel evaluation is narrow, so this stays in all.
editor take
KernelBenchX tests 176 Triton tasks; 46.6% of correct kernels are slower than PyTorch eager, so compile rate bragging is noise.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers
LLM-FE formulates tabular feature engineering as program search, where LLMs iteratively propose feature transformation programs and data-driven validation feedback guides evolutionary search across classification and regression benchmarks.
#Reasoning#Code#LLM-FE#Research release
why featured
HKR-H and HKR-K pass: the angle and mechanism are concrete. The post gives no benchmark gains, dataset count, or artifact details, and it is not a major-lab release, so it stays in the 60–71 band.
editor take
LLM-FE frames feature engineering as program search; benchmark count and lift are undisclosed, so don’t crown LLM+evolution yet.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SEMASIA: A Large-Scale Dataset of Semantically Structured Latent Representations
SEMASIA collects latent representations from about 1,700 pretrained vision models across eight image-classification benchmarks. The dataset pairs embeddings with structured metadata on architectures, training regimes, pretraining sources, and model scale. The paper uses it to study latent geometry, supervised alignment mappings, and regression links between training factors and embedding properties.
#Vision#Embedding#Interpretability#SEMASIA
why featured
HKR-K passes because SEMASIA discloses concrete dataset scale and metadata. HKR-H/R are weak: the angle is academic, with little product impact or practitioner identity tension.
editor take
SEMASIA ships embeddings from ~1,700 vision models; metadata quality decides whether this is science or an embedding zoo.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning
The paper tests depth pruning across three LLM families, two calibration objectives, and seven search algorithms, finding that calibration objectives shape redundant-layer choices more than the specific search algorithm under fixed objectives.
#Inference-opt#Reasoning#Benchmarking#Research release
why featured
HKR-K is solid: the paper gives a testable depth-pruning setup across model families, objectives, and search methods. HKR-R is moderate via inference cost, but HKR-H is weak, so it stays in all.
editor take
The paper tests 3 LLM families and 7 searches: pruning choices follow calibration goals, not universal layer-importance lore.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DeepLog: A Software Framework for Modular Neurosymbolic AI
DeepLog unifies logic and deep learning inside standard PyTorch workflows, compiling diverse neurosymbolic languages into optimized arithmetic circuits; the arXiv abstract says the code is available on GitHub, but it does not disclose benchmarks or performance numbers.
#Reasoning#Tools#Code#DeepLog
why featured
HKR-K passes via a concrete compiler mechanism, PyTorch integration, and open code. HKR-H/R are weak; neurosymbolic arithmetic-circuit tooling is niche, so this sits in the 60–71 band.
editor take
DeepLog plugs into PyTorch and ships code; no benchmarks disclosed, so treat “universal backend” as a claim to test.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
NoiseRater assigns importance scores to individual noise samples with bilevel optimization, reweights diffusion training on FFHQ and ImageNet, and releases anonymous code; the abstract does not disclose exact metric gains or compute cost.
#Fine-tuning#Inference-opt#NoiseRater#FFHQ
why featured
HKR-H and HKR-K pass: the mechanism is concrete, with FFHQ/ImageNet and anonymous code. HKR-R is weak because this is specialized diffusion-training research, so it stays in all.
editor take
NoiseRater reweights noise on FFHQ and ImageNet; no gains or compute disclosed, so don’t treat bilevel as free lunch.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Decoding Islamophobic Discourse: Using LLMs to Identify Tropes and Semi-Coded Hate Speech
The paper analyzes five semi-coded anti-Muslim terms from 4Chan, Gab, Telegram, and similar platforms, using LLMs, Google Perspective API, and BERT topic modeling to test semantic understanding, toxicity scoring, and topic distribution.
#Safety#Benchmarking#Google#4Chan
why featured
HKR-H/K/R pass at a modest level: the coded-hate angle, named platforms, and safety relevance give signal. No result numbers or reproducible details are disclosed, so it stays in the lower interesting band.
editor take
The paper tests only five coded Islamophobic terms; I don’t buy “LLMs understand OOV slurs” without disclosed models, prompts, and labels.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Adaptive Action Chunking via Multi-Chunk Q Value Estimation
ACH estimates Q-values for all candidate action chunk lengths in one Transformer forward pass, then selects the chunk length by state during training and inference; the paper evaluates it on 34 tasks against fixed-length baselines.
#Robotics#Reasoning#Benchmarking#Research release
why featured
HKR-K passes through a concrete mechanism and 34-task evaluation; HKR-H and HKR-R are weak. This is useful robotics research, but specialized, so it stays in the 60–71 band.
editor take
ACH picks action-chunk length in one forward pass across 34 tasks; I buy the setup, but no gain numbers are disclosed.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions
ReplaySCM introduces a 1,300-item benchmark where systems output causal mechanism maps in a restricted Boolean DSL, and scoring checks replay behavior on training and held-out intervention worlds rather than matching formula strings.
#Reasoning#Benchmarking#ReplaySCM#Research release
why featured
HKR-K passes: 1,300 binary-world tasks and a Boolean DSL give reproducible evaluation details. HKR-H and HKR-R are weak because causal mechanism induction is narrow, so this fits all rather than featured.
editor take
ReplaySCM tests Boolean causal replay on 1,300 tasks; hidden order tanks frontier LLMs, a harsher failure than local causal QA.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
EchoAlign: Bridging Generative and Discriminative Learning under Noisy Labels
EchoAlign modifies instance features with EchoMod and filters original samples with EchoSelect, outperforming state-of-the-art methods on three benchmark datasets in most evaluated settings; under 30% instance-dependent noise, EchoSelect retains nearly twice as many correctly labeled samples as competing methods while maintaining 99% selection accuracy.
#Fine-tuning#Benchmarking#EchoAlign#Research release
why featured
HKR-K is strong and HKR-R is moderate: EchoSelect keeps nearly 2x correct-label samples at 30% instance-dependent noise with 99% selection accuracy. The work is niche noisy-label research, with no product or major-model impact, so it stays all.
editor take
EchoAlign wins most settings on 3 benchmarks; editing samples toward noisy labels works, but I’d audit generator leakage first.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
HoReN: Normalized Hopfield Retrieval for Large-Scale Sequential Model Editing
HoReN wraps a single MLP layer with a discrete key-value codebook for parameter-preserving model editing, and on ZsRE it scales to 50K sequential edits while keeping overall performance above 0.9.
#Memory#Fine-tuning#RAG#HoReN
why featured
HKR-K passes with a testable mechanism and a 50k sequential-edit claim. HKR-H and HKR-R are weak because this is a niche arXiv model-editing paper, so it fits all rather than featured.
editor take
HoReN hits 50K ZsRE edits above 0.9; I'd reproduce routing false positives before buying the long-term memory claim.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche trains an input-space residual adapter through a frozen tabular foundation model, then uses an identity guard to skip harmful adaptation; on 51 TabArena-Lite datasets, TabICLv2-Retouche raises aggregate Elo by 56 over frozen TabICLv2.
#Fine-tuning#Benchmarking#TFM-Retouche#TabICLv2
why featured
HKR-K passes via a concrete adapter mechanism and 51-dataset Elo result. HKR-H and HKR-R are weak because the work is niche tabular-ML research, so it stays in the 60–71 band.
editor take
TFM-Retouche gives TabICLv2 +56 Elo on 51 TabArena-Lite datasets; for tabular models, input residuals look cheaper than LoRA plumbing.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
RelBench v2: A Large-Scale Benchmark and Repository for Relational Data
RelBench v2 expands the RDL benchmark to 11 datasets with over 22 million rows across 29 tables, adding autocomplete tasks that require models to infer missing table attributes under temporal constraints.
#Benchmarking#RelBench#Temporal Graph Benchmark#ReDeLEx
why featured
HKR-K passes with concrete benchmark scale and task conditions. HKR-H/R are weak: this is a niche research benchmark update, with no hard-exclusion trigger.
editor take
RelBench v2 hits 11 datasets and 22M rows. Temporal autocomplete makes it a less toy-ish test than CSV-style tabular benchmarks.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Fitting Multilinear Polynomials for Logic Gate Networks
The paper maps each 2-input Boolean gate to a 4-coefficient multilinear polynomial, reducing each neuron from 16 parameters to 4; across seven datasets, at least one 4-parameter method matches or exceeds Soft-Mix on every dataset.
#Reasoning#Inference-opt#Benchmarking#arXiv
why featured
HKR-K is strong and HKR-R is moderate: the 16-to-4 parameter cut and 7-dataset result are testable and cost-relevant. HKR-H is weak, and the niche research angle keeps it below featured.
editor take
CovJac drops 0.5pp at 12 layers on CIFAR-10; Soft-Mix drops 37.3pp. This smells like parameterization failure, not capacity.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding
The authors built a RAG pipeline with Qwen3-Embedding-8B, a fine-tuned Qwen3-Reranker-8B, and Qwen3-32B for Ukrainian multi-domain PDF QA, raising Recall@1 from 0.6957 to 0.7935 with reranking and reaching 0.9598 on the private leaderboard.
#RAG#Embedding#Fine-tuning#Qwen
why featured
HKR-H/K/R pass, but this is a single arXiv benchmark-style RAG setup with narrow multilingual retrieval impact. No hard exclusion; it fits the 60–71 interesting-but-not-featured band.
editor take
Qwen3-Reranker-8B lifts Recall@1 from 0.6957 to 0.7935; for Ukrainian PDF QA, fancy post-processing loses to reranking.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
BaLoRA: Bayesian Low-Rank Adaptation of Large Scale Models
BaLoRA changes LoRA matrices to an input-adaptive Bayesian parameterization with minimal added parameters and compute, and the paper reports improved accuracy plus calibrated uncertainty estimates across natural language reasoning, vision tasks, and metal-organic framework band gap prediction.
#Fine-tuning#Reasoning#Vision#BaLoRA
why featured
HKR-K and HKR-R pass: the paper offers a concrete LoRA parameterization and cross-task tests. No improvement numbers, product path, or open-source artifact are disclosed, so it stays in the mid-low research band.
editor take
BaLoRA adds input-adaptive Bayesian LoRA matrices; no benchmark numbers disclosed, but PEFT finally gets a serious uncertainty story.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
The paper introduces a fixed-contract diagnostic for KV cache compression selectors; on LongBench across three models and two budgets, its value-ranking probe is positive in 72.6% of positive-margin cells and 32.4% of nonpositive-margin cells.
#Inference-opt#Benchmarking#arXiv#LongBench
why featured
HKR-K is present via the diagnostic method and 72.6% result; HKR-R is present through inference cost. HKR-H is weak, and the infra-research angle is useful but too niche for featured.
editor take
Fixed-contract diagnostics cover 264 cells and hit 72.6% positive margins; KV compression papers need failure localization, not LongBench score theater.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
The paper compares neural CQA models with a training-free query relaxation strategy across multiple datasets and query structures, and finds no neural model consistently outperforms the relaxation baseline.
#Reasoning#RAG#Benchmarking#Research release
why featured
HKR-H/K/R pass, but the item is a narrow neural complex-query-answering paper with only the high-level benchmark claim disclosed. Limited product or agent/RAG implications keep it in the 60–71 band.
editor take
Neural CQA fails to beat a training-free relaxation baseline consistently. KG reasoning papers without strong symbolic baselines now smell under-benchmarked.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MESD: A Risk-Sensitive Metric for Explanation Fairness Across Intersectional Subgroups
The paper introduces MESD, a procedural fairness metric for explanation quality across intersectional subgroups. MESD combines label-aware aggregation, empirical-Bayes shrinkage, and CVaR weighting, then integrates with UEF and NSGA-II to optimize utility, outcome fairness, and procedural fairness across three benchmark datasets against four state-of-the-art methods.
#Interpretability#Safety#Benchmarking#Research release
why featured
HKR-K passes with a named metric, component count, benchmark count, and optimization setup. HKR-R is modest because fairness links to bias governance, but the academic framing keeps it in the 60-71 research-signal band.
editor take
MESD scores explanation gaps across intersectional groups with 3 components; I buy the problem, not the compliance leap.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reasoning emerges from constrained inference manifolds in large language models
The paper studies LLM inference-time representation dynamics and proposes a three-condition structural regime plus a label-free diagnostic computed from internal dynamics; the abstract does not disclose the model list, datasets, or quantitative results.
#Reasoning#Interpretability#Benchmarking#Research release
why featured
HKR-K passes: the paper proposes a mechanism for reasoning representations and an unlabeled diagnostic. HKR-H/R are weak because the abstract gives no models, datasets, or quantitative results.
editor take
The abstract gives a three-condition diagnostic, no models or datasets; label-free reasoning metrics tempt, but geometry stories need evidence.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation
The paper tests raw CSD cosine on a 1,799-artwork, 91-artist corpus and finds negative pairwise discrimination gaps for 23/91 artists; CSLS on the frozen backbone cuts aggregated negative gaps to 4/91 and raises AUC from 0.883 to 0.905 with 336-pixel positional interpolation.
#Vision#Benchmarking#CSD#CLIP
why featured
HKR-H and HKR-K pass: the title has a metric-failure hook and the summary gives testable sample counts plus a CSLS improvement. The topic is narrow vision evaluation, so HKR-R misses and the score stays in the 60-71 band.
editor take
Raw CSD cosine fails on 23/91 artists; CSLS cuts it to 4, so absolute style scores are shaky for shared traditions.
HKR breakdown
hook knowledge resonance
open source
63
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
AlphaExploitem: Going Beyond the Nash Equilibrium in Poker by Learning to Exploit Suboptimal Play
AlphaExploitem extends AlphaHoldem with a hierarchical transformer encoder for previously played hands and trains against a diverse pool of exploitable opponents, then evaluates on two imperfect-information game benchmarks while the abstract reports exploitation of weak in-distribution and out-of-distribution play without loss against Nash-equilibrium opponents.
#Agent#Reasoning#Benchmarking#AlphaExploitem
why featured
HKR-H comes from exploiting suboptimal poker play beyond Nash equilibrium, and HKR-K has a concrete mechanism: hierarchical Transformer history encoding on 2 benchmarks. HKR-R is weak because product impact is not shown.
editor take
AlphaExploitem tests exploitation on 2 imperfect-information benchmarks; I buy the direction, but no win rates means no Poker AlphaZero victory lap.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Dynamic Linear Coregionalization for Realistic Synthetic Multivariate Time Series
DynLMC adds time-varying, regime-switching correlations and cross-channel lag structures to synthetic multivariate time series generation, and fine-tuning three time-series foundation models on DynLMC data improves zero-shot forecasting across nine benchmarks.
#Fine-tuning#Benchmarking#DynLMC#arXiv
why featured
HKR-K passes via concrete mechanisms and a 9-benchmark setup. HKR-H/R are weak because this is specialized time-series synthetic-data research, useful but not broad enough for featured.
editor take
DynLMC fine-tunes 3 FMTS and improves zero-shot on 9 benchmarks; I buy the dynamic-correlation bet, but effect sizes are undisclosed.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
This survey organizes LLM optimizer research into 7 groups, spanning AdamW, memory-efficient variants, curvature-aware methods, low-rank approaches, and matrix-based optimizers such as Muon, and it argues that benchmarks should report convergence, stability, memory overhead, wall-clock efficiency, token efficiency, and implementation complexity together.
#Fine-tuning#Inference-opt#Benchmarking#Research release
why featured
HKR-K is solid: the 7 optimizer classes and four benchmark dimensions add usable structure. HKR-R is narrow to training-infra readers, and the numerical-optimization topic keeps it in the 60-71 band rather than featured.
editor take
This survey splits LLM optimizers into 7 buckets; AdamW-to-Muon claims now need memory, stability, and wall-clock receipts.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation
The paper introduces Causal Parametric Drift Simulation, using Structural Causal Models as digital twins of data-generating processes, and tests classifier robustness under drift on the OSMH dataset while preserving structural dependencies.
#Benchmarking#Safety#OSMH#Research release
why featured
HKR-K passes and HKR-R is weak: the method targets classifier drift robustness with production relevance. The post gives only the framework, OSMH dataset, and stress-test setup, with no result numbers or artifact details, so it stays in the low 60s.
editor take
Causal Parametric Drift Simulation is tested only on OSMH; the idea is right, but robustness claims need cross-domain replication.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features
The paper tests nine pruning feature classes across four sparsity levels on ViT-Small/CIFAR-10 and proposes SICS: κ=0 suffices below S<0.65, κ=1 dominates near S≈0.7, and κ=2 is required above S>0.75.
#Inference-opt#Benchmarking#Research release#Benchmark
why featured
HKR-K is strong and HKR-R is moderate: SICS gives testable sparsity thresholds, but validation is limited to ViT-Small/CIFAR-10. No hard exclusion; this fits the 60–71 band.
editor take
On ViT-Small/CIFAR-10, non-monotone features gain 6.6% at S=0.7; one model/dataset cannot carry a pruning law.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Distributionally Robust Token Optimization in RLHF
The paper proposes DRTO, combining token-level RLHF with DRO by building f-divergence ambiguity sets over span-level actor losses, and reports gains over standard RTO of 4.4 percentage points on MATH-500 and 2.7 percentage points on LiveCodeBench.
#Reasoning#Alignment#Benchmarking#Research release
why featured
HKR-K passes with a concrete method and MATH-500 gain. HKR-H and HKR-R are weak, and the RLHF/DRO focus is too specialized for featured.
editor take
DRTO beats RTO by 4.4 points on MATH-500; I buy DRO on span loss, not the prompt-robustness claim yet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Generating Synthetic EHR Data Using Agent-Based Models to Evaluate ML Robustness Under Mass Casualty Incidents
The authors use an emergency-department agent-based model to generate synthetic EHR data under mass-casualty-incident conditions. Length-of-stay prediction models show consistent recall declines versus baseline conditions, increasing missed patients with prolonged stays; the abstract does not disclose dataset size, model classes, or recall values.
#Agent#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the mass-casualty setting is a concrete hook, and the abstract gives a testable ABM synthetic-EHR setup with recall degradation. HKR-R is weak because this is vertical healthcare ML research without product or agent implications.
editor take
ABM generates synthetic MCI EHRs; recall drops versus baseline. No sample size or values disclosed, so trust the stress test, not deployment claims.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MIDUS: Memory-Infused Depth Up-Scaling
MIDUS replaces duplicated FFN branches in Depth Up-Scaling with memory layers and uses HML to assign each attention head a distinct key space. HIVE derives head-specific values from a shared latent bank, while the RSS abstract does not disclose model sizes, benchmark names, or numeric results.
#Memory#Inference-opt#Reasoning#Research release
why featured
HKR-K passes via concrete mechanisms: memory layers replacing DUS-copied FFNs and per-head key spaces. HKR-H/R miss because no benchmark, scale, or practitioner-facing impact is disclosed, so it sits in the low-60 research-release band.
editor take
MIDUS swaps duplicated DUS FFNs for HML memory layers, with no numbers disclosed; I’d treat it as a structural bet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
The paper proposes an adaptive DNN partitioning framework that profiles models at startup, measures network links, and re-evaluates partitions periodically; on a Raspberry Pi, laptop, and desktop PC testbed with VGG16, AlexNet, and MobileNetV2, it reduces energy by 27.09–35.82% and end-to-end latency by 6.34–22.92% versus static partitioning.
#Inference-opt#arXiv#Raspberry Pi#Research release
why featured
HKR-K is supported by concrete testbeds and energy/latency numbers; HKR-R comes from edge-inference cost pressure. The systems-optimization scope is niche, with no product or major-model impact.
editor take
Adaptive partitioning cut energy 27.09–35.82% on a 3-node testbed. Nice systems result; CNN-only eval limits LLM relevance.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
The paper introduces MS-FLOW, a sparse-bottleneck framework that replaces fully connected cross-variable communication with selective sparse routing under a strict communication budget, and reports state-of-the-art multivariate forecasting accuracy on 12 real-world benchmarks while producing fewer dependency paths.
#Benchmarking#MS-FLOW#arXiv#Research release
why featured
HKR-K passes via a concrete mechanism and 12-benchmark claim. HKR-H and HKR-R are weak because this is a niche forecasting paper with limited product or industry spillover.
editor take
MS-FLOW reports SOTA on 12 real benchmarks; sparse routing for spurious-correlation control is plausible, but budget and ablations are undisclosed.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PGID: Progressive Guided Inversion and Denoising for Robust Watermark Detection
The paper proposes PGID, a training-free noise extraction framework that uses progressive inversion-denoising cycles to project perturbed latents back to their original regions, defending semantic watermark detection against both watermark removal and forgery attacks.
#Vision#Safety#PGID#arXiv
why featured
HKR-K and HKR-R pass: the mechanism is concrete and watermark attacks matter. Metrics, datasets, and attack conditions are not disclosed, and HKR-H fails due to a specialist paper title.
editor take
PGID claims training-free defense against removal and forgery; no metrics disclosed, so treat it as a patch for inversion-based watermarking.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Efficient Estimation of Kernel Surrogate Models for Task Attribution
The paper introduces kernel surrogate models for task attribution and estimates them with a gradient-based procedure using a first-order approximation of pretrained models, avoiding repeated retraining; experiments on transformer math reasoning, in-context learning, and multi-objective reinforcement learning report under 2% relative error, 25% higher correlation with leave-one-out ground truth than linear surrogates and influence-function baselines, and 40% improvement in downstream data selection.
#Agent#Reasoning#Benchmarking#Research release
why featured
HKR-K passes with a concrete mechanism and error/correlation claims. HKR-H and HKR-R are weak: the title is academic, and task attribution is too narrow for broad practitioner resonance, so this stays in the 60-71 all band.
editor take
Kernel attribution reports under 2% error; I buy the nonlinear-interaction angle, but the 40% data-selection gain needs code.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
The paper proposes Group Cognition Learning, a two-stage agent collaboration protocol after modality-specific encoding, and reports state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec across regression and classification benchmarks.
#Agent#Multimodal#Benchmarking#Research release
why featured
HKR-K passes with a named mechanism and three benchmarks. HKR-H is weakened by slogan-like framing, and HKR-R stays narrow to affect/intent benchmarks, so this fits the 60–71 research-release band.
editor take
GCL reports SOTA on 3 multimodal benchmarks; I have doubts, since the RSS gives no gains, variance, or ablations.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
The paper proposes MiTA, using a small set of landmark queries to collect top-k key-value pairs as reusable routed experts. The abstract reports vision-task experiments and open-source code, but the post does not disclose concrete speedup ratios, model sizes, or benchmark numbers.
#Inference-opt#Vision#Research release#Open source
why featured
HKR-K passes with a testable attention mechanism and open code. HKR-H/R are weak because speedups, scale, and production impact are not disclosed, so this stays in the lower research-release band.
editor take
MiTA reuses top-k KV routes via landmark queries; no speedup ratios or model sizes are disclosed, so I buy the method, not the efficiency claim.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning
The paper evaluates pixel-based deep reinforcement learning on Procgen-HD and reports that Impoola replaces Impala’s spatial flattening with global average pooling, decoupling parameter count from input resolution and delivering a 28% performance gain over Impala under each model’s best conditions.
#Vision#Robotics#Benchmarking#arXiv
why featured
The story earns HKR-K via Procgen-HD, Impoola's global-average-pooling swap, and a 28% gain over Impala. HKR-H/R stay weak because this is a niche deep-RL paper, so it remains in all.
editor take
Impoola beats Impala by 28% on Procgen-HD best settings; low-res pixel RL now looks like inherited laziness.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Privacy-Preserving Distributed Learning in IoT Systems: A Unified Threat Model and Evaluation Framework
The paper introduces a unified threat model for IoT distributed learning covering four attack types, then compares five privacy-preserving method families by privacy robustness, computation, memory, and communication overhead.
#Fine-tuning#Safety#Research release
why featured
HKR-K has concrete framework numbers and HKR-R touches edge/IoT privacy risk, but HKR-H is weak. This is useful academic synthesis, not a model, product, or reproducible tool release, so it sits in the 60–71 all band.
editor take
The paper covers 4 IoT distributed-learning attacks; I don't buy the unified-framework novelty, but Bloom Filter overhead is actionable.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models
Bi-CoG assigns pseudo-labels using inter-model and intra-model consistency, plus an error-aware dynamic strategy; the paper reports consistent gains for semi-supervised fine-tuning across 14 datasets.
#Vision#Multimodal#Fine-tuning#Research release
why featured
HKR-K is clear: bi-consistency pseudo-labeling plus results on 14 datasets. HKR-R applies for VLM fine-tuning cost, but the arXiv method paper is incremental and technical, so it stays in 60–71.
editor take
Bi-CoG reports gains on 14 VLM semi-supervised datasets; no effect sizes in the snippet, so treat it as pseudo-label threshold removal.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
APEX: Audio Prototype Explanations for Classification Tasks
APEX interprets pre-trained audio classifiers without fine-tuning the original backbone, preserving output invariance and separating explanations into four prototype views: square-based, time-based, frequency-based, and time-frequency-based.
#Audio#Interpretability#APEX#Research release
why featured
HKR-K passes: APEX proposes audio classifier explanations without backbone fine-tuning and uses four prototype types. HKR-H and HKR-R are weak, so this is a niche research item for all, not featured.
editor take
APEX keeps audio classifier outputs invariant with 4 prototype views; no benchmark numbers disclosed, so I don’t buy the gradient-method claim yet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings
RelFlexformer applies arbitrary integrable modulation functions to universal 3D relative positional encodings, giving L-length sequence attention O(L log L) complexity and extending efficient RPE attention from homogeneous grids to arbitrarily distributed 3D tokens, including point clouds.
#Vision#Inference-opt#RelFlexformer#Research release
why featured
HKR-K passes on the O(L log L) attention mechanism, but HKR-H and HKR-R are weak because the story is niche and jargon-heavy. No product, code, or broad deployment hook is disclosed.
editor take
RelFlexformer claims O(L log L) 3D RPE attention; the missing piece is benchmark scale versus sparse attention on nonuniform point clouds.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Breaking the Grid: Distance-Guided Reinforcement Learning in Large Discrete Action Spaces
The paper proposes DGRL for discrete action spaces with up to 10^20 actions, gives local value improvement guarantees on structured tasks, and reports up to 66% gains over state-of-the-art benchmarks across regular and irregular environments.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-K is solid: 10^20 actions, a local value-improvement guarantee, and a 66% benchmark lift are testable claims. HKR-H/R are weak; this is a niche RL paper, so it stays in all rather than featured.
editor take
DGRL claims 10^20 discrete actions; I want reproducible tasks before trusting the 66% SOTA gain.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reward-Conditioned Reinforcement Learning
RCRL collects experience under one nominal objective, recomputes counterfactual rewards from shared replay data, and trains agents across multiple reward parameterizations without extra environment interaction; the abstract reports gains on single-task, multi-task, and vision-based benchmarks, but does not disclose numeric scores or benchmark names.
#Agent#Reasoning#Vision#Research release
why featured
HKR-K passes: RCRL recomputes counterfactual rewards from shared replay for multiple objectives. HKR-H and HKR-R are weak; no scale, benchmark gain, or code is disclosed, so this stays in all.
editor take
RCRL reuses one-objective trajectories via counterfactual rewards; no scores or benchmark names, so I file it as replay reuse.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LEPO: Latent Reasoning Policy Optimization for Large Language Models
LEPO injects controllable stochasticity into latent reasoning with Gumbel-Softmax, keeps stochastic sampling during rollouts, and estimates unified gradients for continuous latent representations and discrete tokens; the abstract says experiments outperform existing discrete and latent RL methods, but it does not disclose benchmark names or scores.
#Reasoning#Fine-tuning#Research release
why featured
HKR-K passes because the summary gives LEPO’s latent-reasoning optimization mechanism. HKR-H and HKR-R are weak, and no benchmark numbers, model scale, or results are disclosed, so it fits the 60–71 research band.
editor take
LEPO keeps multi-trajectory rollouts via Gumbel-Softmax; benchmarks and scores are undisclosed, so latent RL is not proven shortcut yet.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
SMIXAE: Towards Unsupervised Manifold Discovery in Language Models
The paper introduces Sparse MIXture of Autoencoders, which directly learns known manifold structures and finds new structures inside open-source Gemma 2 2B and 9B models.
#Interpretability#Gemma#Research release
why featured
HKR-K passes via SMIXAE plus Gemma 2 2B/9B experiments. HKR-H and HKR-R are weak, and the technical paper angle keeps it in all below the featured threshold.
editor take
SMIXAE finds manifolds in Gemma 2 2B/9B; SAE direction-tiling is a weak stopping point for interpretability.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Study Diagnoses and Mitigates Domain Shift in Permission-Based Android Malware Detection
The paper tests permission-based Android malware detection with PerMalDroid, NATICUSdroid, and five ensemble classifiers. In-domain accuracy exceeds 92%, NATICUSdroid-to-PerMalDroid transfer drops to 73%, and hybrid training reaches 88% on PerMalDroid while keeping 97% on NATICUSdroid.
#Benchmarking#Interpretability#PerMalDroid#NATICUSdroid
why featured
HKR-K and HKR-R pass: the paper quantifies Android malware domain shift across PerMalDroid and NATICUSdroid, dropping from >92% to 73%. It is useful but niche security benchmarking, below featured threshold.
editor take
NATICUSdroid-to-PerMalDroid falls to 73%; permission malware detection is losing to dataset artifacts, not feature scarcity.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Procrustean Bed of Time Series: The Optimization Bias in Point-wise Loss Functions
The paper defines EOB as a KL divergence for point-wise loss bias, derives Gaussian and mixture lower bounds, and reports 5.2%/5.0% average MSE/MAE reductions on iTransformer forecasting across 11 datasets.
#Benchmarking#arXiv#iTransformer#Research release
why featured
HKR-K passes via the EOB metric and 5.2%/5.0% gains across 11 datasets. HKR-H/R are weak because time-series loss optimization is narrow, so this fits the 60-71 band.
editor take
EOB cuts iTransformer forecasting MSE by 5.2% across 11 datasets. I buy the framing, but one backbone is thin proof.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Finding Connections: Membership Inference Attacks for the Multi-Table Synthetic Data Setting
The paper proposes MT-MIA, a No-Box membership inference attack that uses heterogeneous graph neural networks to target user-level representations in multi-table synthetic relational data; the post does not disclose the number of datasets or leakage metrics.
#Safety#Benchmarking#arXiv#Research release
why featured
HKR-K has a concrete mechanism and HKR-R hits synthetic-data privacy risk. The post gives the method and threat model only, with no datasets or leakage results, so it stays in the lower research-update band.
editor take
MT-MIA attacks multi-table synthetic data with hetero-GNNs; no leakage metrics disclosed, so the privacy claim still sits at abstract level.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Assessing Trustworthiness of AI Training Dataset Using Subjective Logic: A Bias Use Case
The paper introduces a formal Subjective Logic framework for assessing AI training dataset trustworthiness and evaluates bias on a traffic sign recognition dataset, testing class imbalance under centralized and federated conditions while quantifying uncertainty when evidence is incomplete, distributed, or conflicting.
#Alignment#Safety#Benchmarking#Research release
why featured
HKR-K/R pass: the paper offers a concrete mechanism and bias use case tied to data governance. HKR-H is weak, and as a single arXiv methods paper without tooling or production proof, it stays in the 60-71 band.
editor take
Subjective Logic scores dataset bias, but traffic-sign validation is narrow; real dirty-data governance will stress this framework harder.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Predicting 3D Structure by Latent Posterior Sampling
The paper proposes a 3D reconstruction method that combines NeRF scene representations with diffusion-model posterior sampling, uses a two-stage training process, and evaluates inputs including single-view, multi-view, noisy images, sparse pixels, and sparse depth data.
#Vision#Multimodal#Reasoning#Research release
why featured
HKR-K passes with a clear mechanism and test conditions; HKR-H/R are weak, and results, baselines, and code are not disclosed. This is useful vision research, not a featured AI-industry story.
editor take
NeRF latents plus diffusion posterior sampling for 3D reconstruction; metrics and datasets aren’t disclosed, so don’t read uncertainty modeling as SOTA.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
The paper evaluates 10 dimensionality-reduction strategies on frozen features from six vision backbones across CIFAR-100, Tiny ImageNet, and CUB-200-2011. LDA improves accuracy in 11 of 12 coarse-grained configurations, reaches gains up to 4.5 percentage points, and cuts feature dimensionality by 48-87%, while hurting all six CUB-200 fine-grained setups.
#Vision#Benchmarking#Research release#Benchmark
why featured
HKR-K passes with concrete experiment scope and measurable gains. HKR-H and HKR-R are weak because this is a niche CV dimensionality-reduction paper, so it stays in the lower all band.
editor take
LDA wins 11/12 coarse setups by up to 4.5 points; before distillation, try the old blade on frozen features.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization
MASS-DPO selects compact negative subsets with a PL-specific Fisher-information objective, and matches or exceeds existing methods across 4 benchmarks and 3 model families.
#Alignment#Fine-tuning#Benchmarking#MASS-DPO
why featured
HKR-K passes via a concrete mechanism and evaluation scope. HKR-H/R are weak, and the DPO-training focus is too niche for featured placement.
editor take
MASS-DPO matches or beats baselines on 4 benchmarks and 3 model families; if negatives are costly, Fisher selection beats brute-force pooling.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Elucidating Representation Degradation Problem in Diffusion Model Training
The paper defines Representation Degradation as an optimization bottleneck in diffusion training and proposes ERD; the abstract says ERD reallocates optimization effort by effective recoverability, but the RSS snippet does not disclose benchmark numbers or datasets.
#Multimodal#Benchmarking#arXiv#Research release
why featured
HKR-K passes because ERD gives a concrete recoverability-based optimization mechanism. HKR-H/R are weak, and no experiment numbers are disclosed, so this stays in the normal research band.
editor take
ERD reallocates optimization by recoverability; RSS gives no datasets or numbers, so treat this as a diffusion-training diagnosis paper.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity
The paper proposes raFLoRA, a rank-partitioned aggregation method for heterogeneous FedLoRA; experiments across vision, language, and reasoning tasks show it prevents rank collapse versus FedLoRA baselines, while the RSS snippet does not disclose dataset names or numerical gains.
#Fine-tuning#Reasoning#Vision#Research release
why featured
HKR-K passes: raFLoRA gives a concrete rank-partitioned aggregation mechanism across vision, language, and reasoning tasks. HKR-H/R are weak because the topic is a narrow federated-tuning research item, so it stays in the low research-release band.
editor take
raFLoRA aggregates local updates by rank partitions; gains and datasets are undisclosed, so I buy the mechanism, not the claims.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Complex-Valued Phase-Coherent Transformer
The Phase-Coherent Transformer replaces softmax token competition with a smooth, element-independent gate over L2-normalized complex query-key similarities, and the paper reports parameter-fair gains over standard Transformer and a direct complex-valued counterpart across mid-scale benchmarks covering long-range memory, hierarchical reasoning, positional retrieval, phase memory, superposition, and image classification.
#Reasoning#Memory#Vision#Research release
why featured
HKR-K passes with a concrete mechanism and benchmark claims. HKR-H is weak paper framing, and HKR-R only lightly touches long-memory pain; no hard exclusion, but niche research keeps it in all.
editor take
PCT drops softmax token competition, but only mid-scale wins are disclosed; I’d wait for large-model and long-context replication.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
PHALAR: Phasors for Learned Musical Audio Representations
PHALAR improves stem retrieval by up to about 70% relative accuracy over prior state of the art. It uses under 50% of the parameters and trains 7x faster, with learned spectral pooling and a complex-valued head enforcing pitch- and phase-equivariant biases across MoisesDB, Slakh, and ChocoChorales.
#Audio#Embedding#Benchmarking#PHALAR
why featured
HKR-K passes with concrete retrieval and efficiency numbers for PHALAR. HKR-H and HKR-R are weak because the title is niche and the impact is narrow, so it fits all rather than featured.
editor take
PHALAR claims 70% better stem retrieval across three music sets; phase-equivariant bias still beats pure black-box audio embeddings.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
The paper proposes RaPO, an RFT method that uses retention rewards and Cross-Task Advantage Normalization to address trajectory-level drift in visual continual learning, and evaluates it across five settings where it reduces catastrophic forgetting while preserving plasticity.
#Fine-tuning#Vision#Multimodal#Research release
why featured
HKR-K passes with RaPO, retention rewards, CTAN, and 5 settings; HKR-R is modest because forgetting in fine-tuning matters to practitioners. No hard exclusion, but it is a niche arXiv paper without artifact, benchmark detail, or major-lab signal.
editor take
RaPO cuts forgetting across 5 visual continual-learning settings; I buy trajectory-level KL as the bug, not another generic regularizer.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
NEO performs hyperparameter-free test-time adaptation by re-centering target embeddings at the origin, raising ViT-Base accuracy on ImageNet-C from 55.6% to 59.2% after one 64-sample batch.
#Vision#Inference-opt#Benchmarking#NEO
why featured
HKR-H and HKR-K pass: the “no-optimization TTA” hook is clear, and ImageNet-C improves from 55.6% to 59.2%. The audience is narrow, so it stays in the 60–71 research band.
editor take
NEO lifts ViT-Base ImageNet-C to 59.2% with 64 samples; I buy TTA more as an inference patch than a training script.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis
DataArc-SynData-Toolkit provides an open-source synthetic data framework with a configuration-driven pipeline, visual interface, simplified CLI, and modular architecture for multimodal, multilingual, and multi-task adaptation; the abstract does not disclose the code repository, benchmark scores, or measured training gains.
#Multimodal#Fine-tuning#Tools#DataArc-SynData-Toolkit
why featured
HKR-K and HKR-R pass, but no repo, benchmark score, or training gain is disclosed, so this stays below featured. No hard exclusion applies; it fits a modest arXiv tooling release in the low 60s.
editor take
DataArc-SynData-Toolkit claims an open-source closed-loop synth-data framework; no repo, benchmarks, or training gains disclosed, so treat as tooling shell.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation
The paper shows that class-level machine unlearning can suppress forgotten classes by lowering final classification-head bias terms, then evaluates BiasShift, TS-BGRM, LB-HR, and three bias metrics on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
#Safety#Interpretability#Benchmarking#Research release
why featured
HKR-K is strong: new methods, metrics, and CIFAR-10/CIFAR-100/Tiny-ImageNet conditions. HKR-R is moderate via privacy and unlearning evaluation, but the topic is narrow and lacks model or product impact.
editor take
BiasShift passes standard unlearning metrics by tweaking classifier-head bias; CIFAR-10/100 and Tiny-ImageNet make that benchmark weakness hard to ignore.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Additive Atomic Forests for Symbolic Function and Antiderivative Discovery
The paper introduces additive atomic forests to recover a function and its antiderivative from data; in reported runs on 17 classification benchmarks, sparse atom combinations match or exceed XGBoost on 13 datasets while producing interpretable formulas.
#Benchmarking#XGBoost#Research release#Benchmark
why featured
HKR-K passes via a named method and a concrete 13/17 XGBoost comparison. HKR-H and HKR-R are weak: this is a specialist ML paper, not a product, agent, or frontier-model competition story, so it fits the 60–71 band.
editor take
Additive atomic forests match or beat XGBoost on 13 of 17 classification benchmarks; I’d check dataset size first—symbolic regression loves small tables.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Exploration-Driven Optimization for Test-Time Large Language Model Reasoning
Changhao Li and six coauthors propose EDO for test-time LLM reasoning. It integrates with iDPO and GRPO, improving three in-distribution reasoning benchmarks by 1.0-1.3% and adding a 1.5% average gain on five out-of-distribution tasks.
#Reasoning#Fine-tuning#Inference-opt#Changhao Li
why featured
HKR-K passes: EDO adds exploration to iDPO/GRPO and reports small reasoning gains. HKR-H/R miss: the title is academic and the impact is incremental, with no code or production replacement claim.
editor take
EDO adds only 1.0-1.3% on three in-distribution benchmarks. I’d inspect entropy curves before treating it as GRPO default.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
The paper proposes Queryable LoRA, a parameter-efficient fine-tuning method that replaces purely layer-local adapters with shared low-rank update atoms. Each layer block forms a query from the current low-rank state and prior block summary, routes updates via attention, and is tested on noisy nonlinear regression and LLM fine-tuning.
#Fine-tuning#Memory#Tools#Research release
why featured
HKR-K passes on the adapter-routing mechanism, but HKR-H and HKR-R fail: no result number, cost claim, artifact, or practitioner controversy. This stays in the interesting research band.
editor take
Queryable LoRA routes shared low-rank atoms via attention; no parameter counts or benchmarks, so treat it as dynamic LoRA stability work.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Federated Concept-Based Models: Interpretable Models with Distributed Supervision
The paper proposes Federated Concept-based Models, which aggregate concept-level information across institutions and adapt model architecture as concept supervision changes while preserving privacy in federated learning settings.
#Interpretability#Fine-tuning#Research release
why featured
HKR-K/R pass: the paper offers a federated concept-supervision mechanism with privacy and interpretability relevance. No experiment numbers, artifact, or product path are disclosed, keeping it in the 60-71 band.
editor take
F-CMs federate distributed concept labels; the abstract omits clients and datasets, so I’d treat it as concept-model label completion.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Likelihood Scoring for Continuations of Mathematical Text: A Self-Supervised Benchmark with Tests for Shortcut Vulnerabilities
The paper introduces a label-free continuation benchmark using 1,363 equation suffixes from 138 physics and mathematics papers, where GPT-5.5 forecasts improve clipped likelihood under Qwen3-8B and Kimi K2.6 scorers and still beat a context-only fine-tuned control, while GPT-5.4 nano does not.
#Reasoning#Benchmarking#Fine-tuning#OpenAI
why featured
HKR-K passes with a new benchmark, sample count, and controls. HKR-H/R are weak because the angle is narrow mathematical-text evaluation, with no hard-exclusion trigger, so it sits in all.
editor take
GPT-5.5 beats a fine-tuned control on 1,363 equation continuations; I like the label-free setup, but scorer bias survives.
HKR breakdown
hook knowledge resonance
open source
62
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Shapley Regression for Rare Disease Diagnosis Support: A Case Study on APDS
The paper proposes Shapley regression for APDS diagnosis support and evaluates a 2-additive model with l2 regularization on eight public biomedical datasets and a real-world cohort of 222 patients.
#Interpretability#Reasoning#arXiv#Research release
why featured
HKR-K passes with concrete dataset and cohort details. HKR-H/R are weak because this is a niche medical ML paper, so it fits all rather than featured.
editor take
Shapley regression ran on 222 APDS patients; I buy the interpretability, not “accurately distinguished” without metrics.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
The paper compares five temporal encoding strategies for LLM event-sequence modeling and fine-tunes models on real-world datasets, finding that prediction performance depends on matching the tokenizer to data distributions ranging from smooth log-normal to discrete spiky patterns.
#Fine-tuning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes: 5 temporal encodings and tokenizer-data fit offer useful signal. HKR-H/R are weak, and the topic is specialized research, so it sits in the low-60s all tier.
editor take
The paper tests five time encodings; I buy it—don’t default to calendar strings before checking distribution spikiness.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reinforcement Learning with Action Chunking
The paper presents Q-chunking, which runs TD-based reinforcement learning in a chunked action space and uses unbiased n-step backups to improve offline-to-online sample efficiency on long-horizon, sparse-reward manipulation tasks.
#Agent#Robotics#Reasoning#Research release
why featured
HKR-K and HKR-R pass: the mechanism is concrete and the problem matters for robotics/agent training. No metrics, code, or reproducible setup are disclosed here, and HKR-H is weak, so this stays in all.
editor take
Q-chunking runs TD RL in chunked action space; no numbers disclosed, but this beats another reward-hacking patch for sparse robotics.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimates and policy with self-normalized importance sampling, uses symmetric Beta sampling and tiered training, and reports gains in training efficiency, final effectiveness, and inference efficiency across multiple models and domains, while the abstract does not disclose exact benchmark scores.
#Reasoning#Fine-tuning#Inference-opt#DARE
why featured
HKR-K passes because the post states concrete RL training mechanisms. HKR-H/R are weak: no benchmark numbers, code link, model scale, or production impact are disclosed, so this stays in the ordinary research-release band.
editor take
DARE updates difficulty and policy with SNIS. Scores are undisclosed, so I’d file it as a practical rollout-saving patch.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Causal Discovery Should Embrace the Wisdom of the Crowd
The paper proposes a crowd-based causal learning framework that integrates partial and noisy knowledge from many contributors into a global causal structure through elicitation, modeling, aggregation, and optimization.
#Reasoning#Research release#Commentary
why featured
HKR-H and HKR-K pass: the angle is novel and the post gives a four-step mechanism. No metrics, artifact, or product implication; causal discovery is specialized, so this sits in the 60–71 research-signal band.
editor take
arXiv 2603.02678v3 offers a four-step framework; no benchmarks disclosed, so I don’t buy the crowd-wisdom prior.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Efficient Ensemble Selection from Binary and Pairwise Feedback
The paper models ensemble selection as multiwinner voting over an unknown task distribution, gives a failure-conditioned greedy algorithm with a 1-1/e guarantee under binary feedback, and reports small-scale LLM experiments on query savings and complementarity.
#Benchmarking#Inference-opt#Research release
why featured
HKR-K passes for a concrete approximation guarantee and LLM experiment. HKR-H/R are weak, and the multi-winner voting framing is academic, so this stays in all below featured.
editor take
The paper keeps a 1-1/e guarantee for binary-feedback ensemble selection; small LLM tests make this theory, not a router recipe.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Mitigating Membership Inference in Intermediate Representations with Differentially Private Training
The paper introduces LM-DP-SGD, which trains a shadow model on a public shadow dataset, fits layer-specific MIA adversaries, and reweights each layer’s contribution to the globally clipped gradient under a fixed noise magnitude.
#Embedding#Fine-tuning#Safety#Research release
why featured
HKR-K passes via LM-DP-SGD: public shadow data, layer-wise MIA, and clipping reweighting. HKR-H fails and HKR-R is narrow; a single arXiv paper with no reported numbers stays in all.
editor take
LM-DP-SGD tunes gradients by layer-wise shadow MIAs; no metrics disclosed, but EaaI privacy finally targets IR leakage.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure
The paper introduces C2LT-3D, which factorizes 3D representation into canonical local geometry, partition-conditioned context, and relational seam variables, then trains on single-object CAD models and evaluates zero-shot on open-world multi-component assets without a separate post-hoc structure recovery module.
#Multimodal#Reasoning#arXiv#C2LT-3D
why featured
HKR-K passes: the item gives C2LT-3D's representation mechanism and zero-shot setup. HKR-H/R are weak; without product impact, open artifacts, or benchmark numbers, it fits the 60-71 band.
editor take
C2LT-3D trains on single-object CAD and tests zero-shot on multi-component assets; no metrics shown, so treat this as a tokenizer bet.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
LLM-Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection
The paper uses an LLM to generate 730 synthetic regression datasets, augmenting 42 UCI datasets for meta-learning-based algorithm selection. Uniform sampling beats margin-based sampling, reducing Hamming loss by 17.47%, improving subset accuracy by 100.41%, and adding 6.09% pooled out-of-fold R² under the reported setup.
#Reasoning#Benchmarking#arXiv#UCI
why featured
HKR-K passes on concrete generation and evaluation numbers; HKR-H and HKR-R are weak because the angle is academic and niche. No hard exclusion triggered, so it lands in the 60–71 interesting-but-not-featured band.
editor take
730 LLM-made regression datasets augment 42 UCI sets; I buy uniform coverage beating margin sampling, not the “performance manifold” story yet.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Synergistic Simplex: Cooperative Runtime Assurance for Safety-Critical Autonomous Systems
The paper proposes the Synergistic Simplex architecture for AV obstacle detection, allowing safety monitors to use ML outputs while formally deriving the conditions under which runtime assurance safety guarantees are preserved.
#Robotics#Safety#Alignment#Research release
why featured
HKR-K and HKR-R pass: the paper offers a concrete mechanism and formal safety conditions for AV detection. HKR-H is weak, and runtime assurance is specialized, so it stays in the 60–71 research band.
editor take
Synergistic Simplex lets monitors consume ML outputs; no benchmark numbers disclosed, so the formal conditions carry the claim.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
TopoGeoScore scores OOD checkpoints using only source-domain embeddings, with no target samples or labels. It combines three geometric and topological signals, learns non-negative linear weights through self-supervision, and is evaluated on CIFAR corruption and shift benchmarks, ImageNet-C, MNLI→HANS transfer, and OGBN-Arxiv.
#Benchmarking#Research release#Benchmark
why featured
HKR-K passes: the paper offers a concrete source-only OOD checkpoint selection mechanism and evaluations. HKR-H is weak and HKR-R is narrow, so this stays in all rather than featured.
editor take
TopoGeoScore selects OOD checkpoints from source embeddings only; no target samples is strict, but cross-architecture stability is undisclosed.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning
The paper diagnoses collapsed adaptive gating in frozen few-shot prompt learning with CLIP-style backbones, using controlled experiments across datasets and multiple prompt-learning architectures, and identifies two recurring failure modes: gradient magnitude imbalance and gate degradation.
#Vision#Multimodal#Fine-tuning#CLIP
why featured
HKR-K passes because the paper offers testable failure mechanisms and controlled experiments. HKR-H/R are weak: CLIP-style few-shot prompt learning is useful but narrow, so this stays in all.
editor take
The paper tests multiple datasets; CLIP few-shot gates often collapse to constants. I buy the diagnosis: many adaptive prompts are parameter noise.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Ister: Linear Transformer for Efficient Multivariate Time Series Forecasting
Ister replaces multi-head self-attention with Dot-attention, a linear-complexity element-wise dot-product mechanism for MTSF, and adds inverted seasonal-trend decomposition to isolate periodic components; the arXiv abstract reports state-of-the-art results across several real-world benchmarks and provides code on GitHub.
#Reasoning#Inference-opt#Benchmarking#Ister
why featured
HKR-K passes on a concrete mechanism, efficiency claim, benchmarks, and code. HKR-H and HKR-R are weak because the angle is niche MTSF research, so it fits all rather than featured.
editor take
Ister makes MTSF attention linear; SOTA is abstract-only here, with no tables or ablations, so don’t ditch PatchTST yet.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Multimodal Representation Learning Conditioned on Semantic Relations
The paper proposes RCML, a framework that conditions multimodal embeddings on natural-language relation descriptions; experiments cover multiple datasets and zero-shot, fine-tuned, and out-of-domain settings, but the post does not disclose exact metrics.
#Multimodal#Embedding#Benchmarking#Research release
why featured
HKR-K passes: RCML’s relation-conditioned embedding mechanism is informative, but the post gives no concrete metrics and HKR-H/R are weak. No hard exclusion; it sits in the 60–71 research-paper band.
editor take
RCML conditions multimodal embeddings on natural-language relations; no metrics disclosed, so I read it as a clean shot at CLIP’s single-embedding assumption.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Emergent Semantic Role Understanding in Language Models
The paper freezes decoder-only transformers and trains linear probes for semantic roles, finding that pretrained representations contain substantial role information, while probe performance still does not fully match task-specific fine-tuned models.
#Interpretability#Reasoning#Benchmarking#Research release
why featured
HKR-K passes because the paper gives a testable probing setup and finding. HKR-H and HKR-R are weak; this is a narrow representation-analysis paper with limited product or industry impact.
editor take
Frozen decoder-only Transformers expose semantic roles via linear probes; the useful move is turning “emergence” into a measurable residual.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Portable Active Learning for Object Detection
PAL selects annotation samples using only detector inference outputs and combines class-wise instance uncertainty with image-level diversity; experiments on COCO, PASCAL VOC, and BDD100K report better label efficiency and detection accuracy than active learning baselines.
#Vision#Benchmarking#PAL#COCO
why featured
HKR-K passes with a clear mechanism and three datasets, but the body gives no gain size. HKR-H and HKR-R are weak, so this stays in all rather than featured.
editor take
PAL selects samples from detector outputs across COCO, VOC, and BDD100K; gains are undisclosed, so portability is the claim to test.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
The paper trains tensor product representation probes on an Othello model with linearly decodable board states. The probes factor the representation into square embeddings, color embeddings, and a binding matrix, and the authors report that linear probes can be recovered directly from TPR probe parameters.
#Interpretability#Research release
why featured
HKR-K passes because the article gives a concrete TPR probing mechanism. HKR-H and HKR-R are weak, and an Othello interpretability paper sits far from product or industry decisions.
editor take
TPR probes work on one Othello model; factoring directions into square, color, and binding terms is neat, but LLM transfer is unproven.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
When Normality Shifts: Risk-Aware Test-Time Adaptation for Unsupervised Tabular Anomaly Detection
The paper proposes RTTAD for unsupervised tabular anomaly detection, using dual-task training and risk-aware test-time contrastive learning, and reports state-of-the-art overall detection performance across 15 tabular datasets.
#Fine-tuning#Embedding#Benchmarking#RTTAD
why featured
HKR-K passes: RTTAD adds a two-stage risk-aware TTA method and reports SOTA on 15 tabular datasets. Niche tabular anomaly detection lacks HKR-H/HKR-R pull, so it stays all.
editor take
RTTAD reports SOTA on 15 tabular datasets; I want contamination rates and pseudo-normal thresholds, not abstract-level confidence.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
GraphBench: Graph Learning Benchmarking
GraphBench introduces a graph learning benchmark suite spanning node-level, edge-level, graph-level, and generative tasks across real-world domains. The paper specifies standardized dataset splits, metrics for selected out-of-distribution generalization tasks, and a unified hyperparameter-tuning framework, then evaluates message-passing neural networks and graph transformers as baselines.
#Benchmarking#GraphBench#Benchmark#Research release
why featured
HKR-K passes: GraphBench offers a unified evaluation setup across graph-learning tasks. HKR-H and HKR-R are weak, and the niche research scope keeps it in all rather than featured.
editor take
GraphBench spans 4 graph task types with OOD metrics; graph foundation models still lack a clean evaluation floor.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
LAGCD embeds a residual linear adapter into each ViT block for generalized category discovery, adds an auxiliary distribution alignment loss to reduce biased predictions between seen and novel categories, and reports consistent gains over multiple baselines on generic and fine-grained datasets; the arXiv abstract does not disclose exact accuracy numbers in the RSS snippet.
#Vision#Fine-tuning#Benchmarking#Research release
why featured
HKR-H/K pass via the counterintuitive title and concrete LAGCD mechanism. HKR-R fails: this is a niche vision/GCD paper with no product or industry impact disclosed, so it stays in the lower research-release band.
editor take
LAGCD puts linear adapters in every ViT block, but RSS gives no accuracy; its jab at nonlinear adapters is the useful claim.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction
GLiNER-Relex uses one model for joint named entity recognition and relation extraction, evaluates on four benchmarks—CoNLL04, DocRED, FewRel, and CrossRE—and releases an open-source Python package with a simple inference API.
#RAG#Embedding#Tools#GLiNER-Relex
why featured
HKR-K passes with a concrete joint extraction setup, four benchmarks, and an open Python package. HKR-H/R are weak, so this is a niche applied-NLP research release kept in all.
editor take
GLiNER-Relex ran 4 RE benchmarks, but scores aren’t disclosed here; I buy one-call triples, not the “competitive” handwave.
HKR breakdown
hook knowledge resonance
open source
60
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
CoreQ applies a closed-form coefficient to correct layerwise mismatch in PTQ and solves the induced triangular least-squares objective with successive rounding; the paper reports improved perplexity and downstream accuracy across multiple LLM families, model scales, bit-widths, and quantization settings.
#Inference-opt#Benchmarking#Research release#Benchmark
why featured
HKR-K passes: CoreQ describes PTQ mismatch correction and successive rounding, with claimed gains across models and bit widths. HKR-H is weak, HKR-R is thin; the arXiv-only technical angle lacks concrete gain numbers, so it stays in all.
editor take
CoreQ uses closed-form PTQ mismatch correction. No model table or numbers are disclosed; treat “broad gains” as unverified.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
The paper proposes OPL and G-OPL for video anomaly detection, where G-OPL uses weak supervision from face-presence signals to suppress facial attributes without identity labels or adversarial training.
#Vision#Safety#Research release
why featured
HKR-K passes via the OPL/G-OPL mechanism; HKR-R comes from surveillance privacy risk. No benchmark numbers or product path are disclosed, so it stays in the lower research-signal band.
editor take
G-OPL suppresses facial attributes via face-presence signals; datasets and gains are undisclosed, but auditable projection beats adversarial privacy theater.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness
FragileFlow uses a calibrated margin buffer to identify correct-but-fragile predictions and organizes off-class probability mass into a class-wise vulnerable-risk matrix for LLM and VLM adaptation; the arXiv abstract reports a PAC-Bayes upper bound plus experiments on multiple-choice LLM benchmarks and few-shot CLIP adaptation, but does not disclose dataset names or numeric gains.
#Reasoning#Vision#Fine-tuning#FragileFlow
why featured
HKR-K passes for the margin-buffer and vulnerable-risk-matrix mechanism. HKR-H and HKR-R are weak, and the post discloses no experiment numbers, artifact, or production impact.
editor take
FragileFlow gives mechanism but no gains; I buy margin-flow diagnostics, not “most settings improve” without numbers.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Auction-Based Online Policy Adaptation for Evolving Objectives
The paper proposes an auction-based multi-objective reinforcement learning framework where local policies bid for action control as objectives appear or disappear at runtime, and it evaluates the PPO-trained implementation on two Atari games and one gridworld path-planning task with dynamic targets.
#Agent#Reasoning#Research release
why featured
HKR-K passes because the mechanism and evaluation setup are concrete; HKR-H and HKR-R are weak. This is a niche RL paper for researchers, not a broad practitioner story, with no hard-exclusion trigger.
editor take
Auction policies ran on 2 Atari games and 1 Gridworld; I’d hold the runtime-adaptation claim until heterogeneous objectives survive.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ERIS: Enhancing Privacy and Scalability in Federated Learning via Federated Shard Aggregation
ERIS introduces Federated Shard Aggregation, which partitions each client update into non-overlapping shards and distributes aggregation across multiple client-side aggregators, preserving the centralized FL update after reassembly and reaching FedAvg-level utility in image, text, and large language model experiments without heavy cryptography or utility-degrading perturbations.
#Fine-tuning#Inference-opt#Safety#ERIS
why featured
HKR-K passes: the post gives a concrete aggregation mechanism and claims FedAvg-level utility across image, text, and LLM tests. HKR-H/R are weak, with no metrics, artifact, or production impact disclosed.
editor take
ERIS shards client updates across aggregators and keeps FedAvg utility; I buy the mechanism, not the scale claim without LLM size or overheads.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
RigidFormer: Learning Rigid Dynamics using Transformers
RigidFormer uses an object-centric Transformer to learn mesh-free rigid-body dynamics from point inputs, advances objects through compact anchors, projects updates with differentiable Kabsch alignment, and scales to more than 200 objects on standard benchmarks while matching or outperforming mesh-based baselines.
#Robotics#Reasoning#RigidFormer#Research release
why featured
HKR-K passes via a concrete mechanism and 200+ object scale; HKR-H/R are weak, and rigid-dynamics research has a high access bar for general AI readers. No hard exclusion, so it lands as all research signal.
editor take
RigidFormer scales rigid simulation to 200+ objects; I’d stress-test long-horizon contact error before buying the robotics angle.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reasoning-Aware Training for Time Series Forecasting
STRIDE distills LLM reasoning traces into a continuous prior for TSFMs, reaching 0.674 MASE and 0.454 CRPS on GIFT-Eval while improving Chronos-2 and Timer-S1 as a plug-and-play module.
#Reasoning#Embedding#Benchmarking#STRIDE
why featured
HKR-K passes because the mechanism and GIFT-Eval numbers are concrete; HKR-H and HKR-R are weak. This is useful time-series ML research, but it lacks product impact or industry-event tension, so it stays in the lower band.
editor take
STRIDE hits 0.674 MASE on GIFT-Eval; distilling reasoning into embeddings beats forcing LLMs to tokenize time series.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Benchmarking Sensor-Fault Robustness in Forecasting
The paper introduces SensorFault-Bench, a CPS sensor-fault stress-test protocol that evaluates forecasting models across four real-world datasets and eight scored scenarios, reporting clean MSE, worst-scenario degradation, and worst-scenario fault-time MSE under a standardized severity model.
#Benchmarking#SensorFault-Bench#Chronos-2#Research release
why featured
HKR-K passes with a named benchmark, dataset count, and evaluation setup. HKR-H is weak, and HKR-R is narrow to time-series/CPS practitioners, so it stays in the lower research-news band.
editor take
SensorFault-Bench tests 8 scenarios on 4 datasets; Chronos-2 losing to last-value is a clean-MSE reality check.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
The paper compares KV, KQV, and QKQV KV-cache quantization under a fair bit budget; at n=4, KQV wins on KL divergence, geometric K error, and 6D distance across all tested distributions and ranks.
#Inference-opt#Benchmarking#TurboQuant#Research release
why featured
HKR-K passes with a concrete KV/KQV/QKQV comparison and n=4 result. HKR-H/R are weak: the post stays at paper metrics and does not tie them to real inference cost or reproducible deployment conditions.
editor take
KQV beats QKQV at n=4 on every metric; I’d trust that negative result before adding QJL to K near softmax.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Spectrally-Guided Diffusion Noise Schedules
The paper proposes per-instance noise schedules for pixel diffusion based on image spectral properties, derives bounds for minimum and maximum noise levels, and removes redundant sampling steps; experiments report better single-stage pixel diffusion quality under low-step inference, while the snippet does not disclose model names, datasets, or exact metrics.
#Vision#Inference-opt#Research release
why featured
HKR-K passes because the mechanism is specific diffusion inference optimization. HKR-H and HKR-R are weak: no speedup, FID, or cost numbers are disclosed, and the title is technical, so this stays in all.
editor take
Spectrally-Guided trims sampling by image spectra; models, datasets, and metrics are undisclosed, so don’t call it a generic accelerator yet.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Generative Cross-Entropy: A Strictly Proper Loss for Data-Efficient Classification
The paper proposes Generative Cross-Entropy as a drop-in CE replacement, proves strict propriety under a mild completeness condition, and reports better results than CE across 3 datasets, 2 architectures, and both balanced small-data and class-imbalanced settings.
#Fine-tuning#Benchmarking#Research release#Benchmark
why featured
HKR-K passes: the paper gives a testable loss, experiment settings, and a strict-propriety proof. HKR-H/R are weak; this is a niche method paper without product or agent impact.
editor take
GenCE beats CE on 3 datasets and 2 architectures; I’d wait for large fine-tuning replications, since small-data gains often hide in splits.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Robust Spectral Watermark for Synthetic Tabular Data
The paper proposes TAB-DRW, a post-editing watermark for synthetic tabular data that uses Yeo-Johnson normalization, DFT, and rank-based pseudorandom bits; experiments on five benchmark tabular datasets test detectability, robustness against post-processing and adaptive attacks, and mixed-type feature support.
#Safety#Benchmarking#Research release#Safety/alignment
why featured
HKR-K passes because the method and 5 benchmark datasets add concrete information. HKR-H and HKR-R are weak; the topic is academic and far from mainstream model, agent, or product shifts, so it sits near the top of 40–59.
editor take
TAB-DRW tests watermark robustness on five tabular datasets. DFT imaginary edits are neat; source code and false-positive rates are undisclosed.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns each ingested token a learnable fractional order α_i∈[δ,1], approximates the non-Markovian power-law memory kernel with an SOE decomposition, and reports advantages on two synthetic tasks: Zipf-distributed retrieval and uniform-lag entity label copying.
#Memory#Reasoning#Benchmarking#VORT
why featured
HKR-K has a concrete mechanism and test setup; HKR-R connects to long-context memory pain. But this is a narrow arXiv paper with only two synthetic experiments and no product or open-source impact.
editor take
VORT learns α_i∈[δ,1] per token; with only two synthetic tasks, I don’t buy the long-context win yet.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Reliable LLM-Based Edge-Cloud-Expert Cascades for Telecom Knowledge Systems
The paper proposes a three-stage edge-cloud-expert LLM QA cascade for telecom knowledge systems, using multiple hypothesis testing to select thresholds and bound misalignment risk with finite-sample guarantees on the TeleQnA benchmark.
#RAG#Inference-opt#Benchmarking#TeleQnA
why featured
HKR-K passes with a concrete cascade mechanism, threshold method, and TeleQnA condition. HKR-H/R are weak because the angle is narrow telecom QA, so this stays in all below featured.
editor take
MHT sets thresholds for an edge-cloud-expert cascade; TeleQnA numbers are missing, so field value hinges on ticket-distribution drift.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal applies a three-stage pipeline to privacy-aware multi-camera domain-adaptive object detection, using one target-domain image for diffusion-based synthetic augmentation and server-side fusion across heterogeneous architectures without raw data access; experiments report 33.4% mAP, 2.1 points above prior privacy-preserving methods.
#Vision#Fine-tuning#Safety#HeroCrystal
why featured
HKR-K passes with testable mAP and method details; HKR-H/R are weak because the title is technical and the use case is narrow. No hard exclusion, but this is a niche vision paper, so it stays below featured.
editor take
HeroCrystal reports 33.4% mAP, +2.1 points; one-image synthesis is neat, but surveillance deployment validation is undisclosed.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES moves input dependence from the discretization step to the diagonal state matrix, preserving Δ as physical time while keeping per-token expressivity. The paper reports top average rank on UEA time-series classification and Physiome-ODE regression, and releases code on GitHub.
#Benchmarking#Mamba#S5#TIDES
why featured
HKR-K passes: TIDES gives a concrete architecture change and claims top average rank on UEA and Physiome-ODE. HKR-H/R are weak; this is a narrow technical paper, so it stays in all.
editor take
TIDES moves input-dependence to the diagonal state matrix and ranks first on UEA plus Physiome-ODE; for irregular time series, Mamba needed this fix.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems
The paper models binary prediction-based decisions as multi-objective optimization and shows that the Pareto frontier consists of deterministic group-specific threshold rules over individual success probabilities.
#Alignment#Research release
why featured
HKR-K lands with a concrete theorem on group-specific threshold rules. HKR-H/R are weak: the piece is theoretical fairness optimization with no experiments, product path, or deployment impact disclosed.
editor take
The paper pins the Pareto frontier to group thresholds; the sharp bit is upper-bound rules favoring lower-success individuals.
HKR breakdown
hook knowledge resonance
open source
58
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning
The paper proposes Trajectory Regularized Merging, using three objectives in the merge phase to reduce storage dependency for prior knowledge in continual learning, while the RSS snippet does not disclose benchmark names, dataset counts, or numerical gains.
#Fine-tuning#Research release
why featured
HKR-K passes because the post names a concrete mechanism and three merge-stage objectives; HKR-H and HKR-R are weak. This is a routine ML paper with no disclosed benchmark, code, or production replacement claim.
editor take
TRM adds 3 merge objectives; benchmarks and gains are undisclosed, so the storage-dependency claim still feels underproven.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation
The paper introduces FeDa4Fair, a benchmarking framework with three components for evaluating client-level heterogeneous bias in federated learning, covering attribute-bias and value-bias conditions where server-level average fairness can hide persistent client discrimination.
#Benchmarking#FeDa4Fair#Research release#Benchmark
why featured
HKR-K passes with a named benchmark, 3 components, and two bias-conflict settings; HKR-R is limited to fairness-eval specialists. No product impact, model release, or deployment claim, so it stays in the 40–59 band.
editor take
FeDa4Fair adds 3 pieces for client-level bias; FL fairness papers reporting only server averages now deserve a haircut.
HKR breakdown
hook knowledge resonance
open source
56
SCORE
H0·K1·R1
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
EventTSF: Event-Aware Non-Stationary Time Series Forecasting
EventTSF integrates historical time series and textual events with an autoregressive diffusion framework, outperforming 12 non-stationary forecasting baselines on 7 synthetic and real-world datasets with average gains of 41.3% in probabilistic forecasting and 27.5% in deterministic forecasting.
#Multimodal#Reasoning#Benchmarking#EventTSF
why featured
HKR-K passes on the mechanism and 7-dataset/12-baseline claim. HKR-H and HKR-R are weak; this is a niche academic forecasting paper with no product, open-source, or adoption detail.
editor take
EventTSF beats 12 baselines on 7 datasets; 41.3%/27.5% gains pop, but event-label cost is undisclosed.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
The paper fine-tunes pretrained LLMs on offline, oracle-labeled trajectories for few-shot sequential decision-making, then evaluates them in synthetic MDP, POMDP, and APOMDP settings; it reports smaller optimality gaps than in-context-only and random baselines, and derives a suboptimality bound for linear MDPs that separates in-context estimation error from training-length bias.
#Fine-tuning#Reasoning#Benchmarking#Research release
why featured
HKR-K passes via a concrete SFT mechanism and test settings; HKR-H and HKR-R are weak because the angle is academic and not tied to deployed agents. Kept as low-value research signal, not featured.
editor take
Oracle-labeled SFT beats ICL on synthetic MDP/POMDP/APOMDP; models and numbers are undisclosed, so don’t sell this as healthcare-ready.
HKR breakdown
hook knowledge resonance
open source
55
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management
The paper proposes an RL framework for chronic disease management using tiered TTC rewards, execution intensity ε, and clinician capability κ; in synthetic hypertension and type 2 diabetes simulations, capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D time-to-control.
#Agent#Reasoning#Alignment#CMS
why featured
HKR-K passes via concrete mechanisms and a 15-point simulation result. HKR-H/R are weak; the medical RL angle is specialist and lacks product or deployment evidence, so it stays in the lower research-signal band.
editor take
T2D synthetic sims show +15 points; I don’t buy clinical extrapolation until κ-weighting survives real EHR data.
HKR breakdown
hook knowledge resonance
open source
54
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation
The paper proposes Compressed Video Aggregator, a lightweight micro-video recommendation module that aggregates frozen VFM embeddings and uses CLIP to reselect key frames from titles, reporting consistent gains on MicroLens and Short-Video with orders-of-magnitude lower training time and GPU memory; the snippet does not disclose exact metrics.
#Embedding#Inference-opt#Compressed Video Aggregator#CLIP
why featured
HKR-K passes on the named mechanism and benchmarks. HKR-H/R are weak, and the post gives no gains, latency, or compute cost, so it stays in the lower research-signal band.
editor take
CVA reports gains on MicroLens and Short-Video, but no metrics; CLIP title-based frame picking is useful and dataset-bias prone.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation
The paper proposes GCLIP for training-free open-vocabulary semantic segmentation, reshaping last-block attention and Value embeddings to use CLIP global context, and reports stronger results than prior TF-OVSS methods on five standard benchmarks.
#Vision#CLIP#GCLIP#Research release
why featured
HKR-K passes with a concrete mechanism and 5-benchmark claim; HKR-H and HKR-R are weak. This is narrow vision research, not hard-excluded, but sparse abstract-level detail keeps it in the 40–59 band.
editor take
GCLIP beats prior TF-OVSS on five benchmarks; I care more about failure classes and CLIP-backbone ablations.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding
ShifaMind matches LAAT on MIMIC-IV top-50 ICD-10 coding across F1, AUC, and ranking metrics, while using a learned multiplicative gate over concept-grounded representations; the abstract does not disclose exact F1, AUC, or ranking scores.
#Interpretability#Benchmarking#ShifaMind#LAAT
why featured
HKR-K passes via a testable mechanism and MIMIC-IV top-50 setup, but F1/AUC are not disclosed. HKR-H/R are weak because clinical coding interpretability is narrow, so it stays in the 40–59 band.
editor take
ShifaMind only claims LAAT-level MIMIC-IV top-50 results, with no F1/AUC disclosed; the multiplicative gate is plausible, but performance claims stay soft.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference
Dystruct proposes a training-free Bayesian structured decoding framework for diffusion language models, jointly computing expansion length, block boundaries, and decoding schedule at each window expansion step; the abstract cites multiple benchmarks but does not disclose exact scores.
#Inference-opt#Reasoning#Dystruct#Research release
why featured
HKR-K passes because Dystruct adds a concrete training-free Bayesian decoding mechanism. HKR-H/R are weak, and the feed gives no speed, quality, or reproducible benchmark numbers, keeping it in the low-value research band.
editor take
Dystruct jointly computes 3 decoding decisions per window step; no scores in the abstract, so “significant gains” stays placeholder.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Text-Guided Multi-Scale Frequency Representation Adaptation
The paper proposes FreqAdapter, a text-guided multi-scale frequency-domain adapter, and reports experiments on CLIP and LLaVA where it improves performance and converges within one epoch.
#Fine-tuning#Multimodal#Vision#CLIP
why featured
HKR-K passes via a concrete mechanism and 1-epoch convergence claim. HKR-H is weak, and HKR-R is limited because the post lacks deployment impact or adoption details, so it stays in the lower research-release band.
editor take
FreqAdapter converges within 1 epoch on CLIP and LLaVA; no parameter count or baselines in the snippet, so don’t crown frequency adapters yet.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
The Truth Lies Somewhere in the Middle (of the Generated Tokens)
arXiv:2605.09969 finds that mean pooling hidden states across generated tokens gives stronger semantic representations than any single token, quantified by kernel alignment against reference spaces in language, vision, and protein domains.
#Interpretability#Reasoning#Multimodal#arXiv
why featured
HKR-K passes: the paper gives a testable representation-extraction claim, but only title and summary are available, and kernel alignment is niche. This is research signal, not a product, model, or safety event.
editor take
Mean-pooled generated states beat single tokens; without model list or effect sizes, I’m not buying the generality yet.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression
The paper proposes a GRPO-based distribution-aware reinforcement learning framework for MLLM regression. It uses a Concordance Correlation Coefficient reward for batch-level comparison supervision, requires no architectural changes, and improves over SFT and existing MLLM regression methods on long-tailed regression benchmarks.
#Multimodal#Fine-tuning#Benchmarking#Research release
why featured
HKR-K passes via a concrete training mechanism and benchmark claim. HKR-H/R are weak, and the niche regression focus sits far from products or major model competition, so it stays in the 40–59 band.
editor take
GRPO plus CCC targets long-tail regression; gains aren’t disclosed, so don’t treat this as a general MLLM regression fix.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
TrajDLM models GPS trajectories as discrete road-segment sequences and reports up to 2.8x faster generation than prior work across three city-scale datasets, with code released on GitHub.
#Reasoning#TrajDLM#Cruise Research Group#arXiv
why featured
HKR-K passes with a concrete mechanism, 3 datasets, a 2.8x speed figure, and open code. HKR-H/R are weak because this is a niche mobility-generation paper with limited impact on mainstream AI practice.
editor take
TrajDLM reports 2.8x speedups on three city datasets; topology-constrained sampling makes it feel closer to simulation tooling than LLM hype.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
STILL DEVELOPING · 27darXiv · cs.LG· atomEN04:00 · 05·12
MDL-GBG: Interpretable Clustering Method Using Minimum Description Length Principle
The paper proposes MDL-GBG for clustering, selecting among three local granular-ball explanations under the Minimum Description Length principle; experiments on 20 UCI datasets report that MDL-GBG+AC achieves the best overall average ranks in ARI, ACC, and NMI among compared methods.
#Interpretability#Benchmarking#MDL-GBG#UCI
why featured
HKR-K passes on a concrete mechanism and benchmark count, while HKR-H and HKR-R fail. This is a traditional ML clustering paper with little agent, product, or frontier-model relevance, so it sits in the low-value band.
editor take
MDL-GBG beats clustering baselines on 20 UCI datasets; I buy the three-way MDL choice more than the interpretability label.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Task Complexity Shapes Internal Representations and Robustness in Neural Networks
The paper tests MLPs on MNIST and Fashion-MNIST with five data-agnostic probes, showing that weight binarization drops hard-task accuracy to chance while easy-task models remain robust.
#Interpretability#Benchmarking#Inference-opt#arXiv
why featured
HKR-K is clear, and HKR-H comes from the binarization contrast. The work stays on MNIST/Fashion-MNIST MLPs, so practical transfer is weak and the score stays in the low-mid research band.
editor take
The paper tests 5 probes on 2 datasets; I don’t buy “model-agnostic” from MLPs on MNIST/Fashion-MNIST.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H1·K1·R0
04:00
28d ago
arXiv · cs.LG· atomEN04:00 · 05·12
Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models
The paper proposes Relative Kinetic Utility for structural pruning in LLMs and tests it on Qwen-2.5-7B and LLaMA-3-8B, reporting 13.34% GSM8K accuracy at 40% sparsity and better preservation of reasoning representations under out-of-distribution evaluation.
#Reasoning#Inference-opt#Benchmarking#Qwen
why featured
HKR-K passes with a new pruning method and concrete benchmark details. HKR-H and HKR-R are weak: the title is technical, and the post does not disclose cost reduction or production deployment impact.
editor take
RKU gets 13.34% GSM8K at 40% sparsity. I don’t buy the reasoning-preservation story without latency and perplexity curves.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0

more

feeds

admin