ax@ax-radar:~/all $ grep -v 'tier=excluded' stream.log
45 srcsignal 72%cycle 04:32

all posts

200 items · updated 3m ago
RSS live
2026-06-09 · Tue
04:01
3h ago
STILL DEVELOPING · 1dr/LocalLLaMA· rssEN04:01 · 06·09
Gemma 4 26B Quantization Methods Performance Comparison
A Reddit user tested Gemma 4 26B 4-bit, 6-bit, and QAT 8-bit with oMLX 0.4.1 on a MacBook M5 Pro 64GB; the 6-bit model scored 98/100 on HumanEval, above the QAT 8-bit model’s 90/100.
#Benchmarking#Code#Inference-opt#Gemma
why featured
HKR-H/K/R all pass: the post has a counterintuitive result, concrete setup, and local-inference resonance. Single Reddit testing and narrow scope keep it in the 60–71 band, below featured.
editor take
Title says Gemma 4 26B 6-bit hit 98/100 on HumanEval; Reddit body is 403, so don’t crown 6-bit from a screenshot.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWFinancial Times · Technology· rssEN04:00 · 06·09
ASML chief warns EU against directing chip supplies
The FT headline says ASML’s chief warned the EU against directing chip supplies, but the body only shows a subscription page and navigation, and does not disclose the quoted warning, policy context, affected chip categories, or supply mechanisms.
#ASML#EU#Financial Times#Policy
why featured
HKR-H and HKR-R pass because the ASML–EU chip-supply conflict touches AI compute geopolitics. HKR-K fails: the body is only a paywall page, with no quote, policy context, or chip category disclosed.
editor take
ASML’s CEO warned the EU off chip-supply control; the body gives no quote or categories, so treat it as lobbying for now.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H1·K0·R1
04:00
3h ago
NEWFinancial Times · Technology· rssEN04:00 · 06·09
AI used to hunt Viktor Orbán’s alleged corruption
FT’s title says AI was used to investigate alleged corruption involving Viktor Orbán, but the accessible body contains only a subscription page and navigation, so the post does not disclose the tool, data sources, investigation method, or findings.
#Financial Times#Viktor Orbán#Policy
why featured
HKR-H passes on the political-investigation hook. HKR-K/R fail because the accessible body is only a subscribe page, with no AI tool, data source, or method disclosed.
editor take
FT says AI hunted Orbán corruption; no tool, data, method, or findings are disclosed, so don’t treat “AI” as evidence.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H1·K0·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Mechanistic Origins of Catastrophic Forgetting: Why RL Preserves Circuits Better Than SFT?
The paper introduces head-level differential circuit vulnerability on Qwen2.5-3B-Instruct adapted to scientific QA, finding that SFT adapts faster but causes more circuit disruption and forgetting, while RL preserves a larger fraction of base circuits at the cost of slower task adaptation.
#Fine-tuning#Interpretability#Alignment#Qwen
why featured
HKR-H/K/R pass, but this is a single arXiv mechanistic paper with evidence limited to Qwen2.5-3B-Instruct scientific QA fine-tuning; no code, cross-source pickup, or production replication is disclosed.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
How Much Dense Attention Is Necessary? Oracle-Guided Sparse Prefill for Hybrid Long-Context Models
The paper introduces an attention-mass top-k oracle for sparse prefill in hybrid long-context models; Qwen3.5-9B stays within 0.48 points of dense attention on a 4K–100K RULER-style sweep, while preliminary single-card TTFT measurements show a 1.93x GPU speedup over a dense FlashAttention-2 baseline.
#Inference-opt#Benchmarking#Qwen#Qwen3.5
why featured
HKR-H/K/R all pass: the paper has a clear dense-attention hook, concrete RULER and TTFT numbers, and a cost/latency angle. It stays in the high 60-71 band because the oracle setup is technical and not directly deployable.
editor take
Qwen3.5-9B loses only 0.48 on 4K–100K RULER; the oracle still computes dense attention, so don’t sell it as serving speed.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Priors Persist Through Suppression: A Stroop Paradigm for Lexical Override
The paper tests a Stroop-style remapping rule across 11 open-weight 1B–9B models and finds lexical-prior strength still predicts interference after controls, while activation patching on five aligned models recovers the conflict effect with aggregate R=0.92–1.06.
#Interpretability#Reasoning#Benchmarking#Research release
why featured
HKR-H/K/R pass, but this is isolated arXiv interpretability work without product impact, a named lab, or cross-source discussion, so it stays in the 60–71 band rather than featured.
editor take
Eleven 1B–9B models still carry lexical-prior interference; rule override suppresses old logits, it doesn’t install new meanings.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
The paper proposes Online Agent-as-a-Judge, where an in-world evaluator agent actively creates social situations through native dialogue and actions; in a life-simulation environment with 32 designer-authored criteria, it improves criteria coverage and agreement with human labels.
#Agent#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R all pass: the mechanism targets interactive-agent evaluation and gives a concrete 32-criterion setup. Kept in all because the feed only discloses abstract-level facts, with no authorship signal, code, or effect size.
editor take
Online Agent-as-a-Judge actively elicits scenarios across 32 social criteria; I buy the direction, but RSS gives no lift size.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Larch: Learned Query Optimization for Semantic Predicates
Larch optimizes semantic filter execution order in AI SQL queries using two variants, Larch-A2C and Larch-Sel, and reduces total token cost overhead by 3x-19x versus Palimpzest and Quest across real-world datasets and synthetic workloads.
#RAG#Inference-opt#Embedding#Larch
why featured
HKR-H/K/R all pass, backed by a testable 3-19x token-cost claim. This is still a single arXiv paper from a non-flagship entity, so it stays in the 60-71 band rather than featured.
editor take
Larch cuts AI SQL filter token cost 3x-19x; treating semantic operators as black boxes now looks lazy.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
OPRD: On-Policy Representation Distillation
OPRD aligns student and teacher hidden-state representations across selected layers on the same rollouts and bypasses the LM head; the paper reports 1.44x faster training and 54% lower memory use than top-k OPD.
#Reasoning#Fine-tuning#Inference-opt#Qwen
why featured
HKR-H/K/R pass, but this is an arXiv training-method paper whose impact depends on reproduction and adoption. The 1.44x speed and 54% memory claims keep it interesting, below featured threshold.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding
Conan-embedding-v3 uses Decoupled Specialist Fusion to combine text, image, video, document, and audio retrieval in one backbone, then fixes Projector Drift with frozen-backbone projector fine-tuning and balanced rehearsal, scoring 74.9 on MMEB and 55.61 on the 30-task MAEB audio suite.
#Embedding#Multimodal#Audio#Conan-embedding-v3
why featured
HKR-H/K/R all pass, but this is an arXiv embedding paper from a non-flagship entity; impact rests on mechanism and benchmark scores, with no disclosed open-source/API or production replacement proof.
editor take
Conan-embedding-v3 scores 74.9 on MMEB; Projector Drift is the paper’s useful bit, not the omni-modal branding.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them
The paper identifies repetition mismatch in pre-training data mixtures: for a 757M-parameter model, one repetition-controlled experiment using 1/16 of the target tokens recovers a two-source mixture within 0.05 of the optimum, versus 0.75 error without repetition control.
#Benchmarking#Research release
why featured
HKR-H and HKR-K pass: the title has a pretraining-experiment failure hook, and the summary gives a mechanism plus 757M and 0.75→0.05 numbers. The impact is research-method specific, so it stays in the 60–71 band.
editor take
A 757M model recovers the mix with 1/16 tokens; ignore repetition rate and your proxy run measures the wrong variable.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
The paper benchmarks DP adaptation privacy in LLMs using robust membership inference and canary extraction, and finds that under the same theoretical guarantee, adaptation data closer to the pretraining distribution shows higher empirical privacy risk.
#Fine-tuning#Safety#Benchmarking#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv benchmark with no disclosed author authority, code artifact, or adoption signal. Lower-band default keeps it at all.
editor take
The paper tests membership inference and canary extraction: same DP guarantee leaks more when data matches pretraining; epsilon-only reporting is weak.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
A Case Study of Evaluating AI Agents on a Neuroscience Data-to-Discovery Pipeline
The paper evaluates general-purpose coding agents on a fly optogenetics data-to-discovery pipeline with tasks larger than existing benchmarks, and finds that agents solve several individual stages but fail to correctly complete the full end-to-end pipeline.
#Agent#Code#Benchmarking#Research release
why featured
HKR-K/R pass: the paper tests general coding agents on a real neuroscience pipeline and says full end-to-end chaining still fails. Model names, scores, and reproducible details are not disclosed here, so it stays in the upper 60–71 band.
editor take
Coding agents fail the fly optogenetics pipeline end-to-end; scientific agents need self-judgment without a grader, not another small benchmark win.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust
The paper introduces the ACUTE activation-based confidence estimation protocol and the EURO metric, testing them on 3 tasks across 6 models from 4 model families, where ACUTE outperforms strong baselines on EURO while maintaining low calibration error.
#Interpretability#Benchmarking#Tools#Research release
why featured
HKR-K and HKR-R pass: the paper gives a new protocol, metric, and cross-model tests, and calibration matters in deployment. HKR-H is weak, and this is a single arXiv paper without a disclosed artifact or production replacement claim.
editor take
ACUTE beats strong EURO baselines on 3 tasks and 6 models; abstract-only, so cross-distribution probe stability is unproven.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models
The paper proposes a training-free safety framework that uses a small number of VLA attention heads at every step to localize the active target, feeds other scene objects into a CBF filter, and outperforms an initialization-time oracle by 43% on a dynamic SafeLIBERO variant with moving obstacles.
#Vision#Robotics#Safety#SafeLIBERO
why featured
HKR-H/K/R pass: the title has a counterintuitive hook, and the summary gives an attention-head+CBF mechanism with a 43% result. Still a single arXiv robotics-safety paper with no product or open-source impact disclosed, so it stays in 60–71.
editor take
VLA attention heads localize targets each step, beating init-time oracle by 43% on dynamic SafeLIBERO; hardware noise is the test.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in LLMs
BEACON detects LLM hallucinations from black-box outputs, using a 31-dimensional feature vector and a gradient-boosted classifier trained on 7,617 labeled examples across seven benchmarks, reaching 0.8123 AUROC while a 5-call variant reaches 0.7795 AUROC.
#Reasoning#Embedding#Benchmarking#BEACON
why featured
HKR-K and HKR-R pass: the item has concrete evaluation numbers and targets hallucination detection. As a single arXiv paper with no disclosed code, major-lab signal, or production replacement claim, it stays in the 60–71 band.
editor take
BEACON hits 0.8123 AUROC on 7,617 samples; the 5-call 0.7795 variant makes black-box hallucination checks less toy-like.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Reasoning Arena routes same-reward trace groups to a judge system, ranks traces with an anchor pool and a Bradley-Terry model, and beats the RLVR baseline by 7.6% on average across competition math and coding benchmarks.
#Reasoning#Alignment#Benchmarking#Reasoning Arena
why featured
HKR-H/K pass: the title targets RLVR limits, and the summary gives a mechanism plus +7.6%. No major lab, code release, or large replication is disclosed, so this stays in the 60–71 arXiv-method band.
editor take
Reasoning Arena beats RLVR by 7.6% and saves nearly 50% generation compute; squeezing gradients from tied traces beats brute-force sampling.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@K Crossover on a Free-Verifier Domain
The paper tests teacher-free self-training with one 4-bit Qwen3-4B on a single 24 GB GPU, reporting that the trained model wins at pass@8 while the base model overtakes it at pass@64 across all four trajectories.
#Reasoning#Fine-tuning#Benchmarking#Qwen
why featured
HKR-H/K/R all pass: the crossover result, reproducible setup, and evaluation-cost angle are clear. It remains a single arXiv small-model training paper without major-lab release or cross-source pickup, so it stays in the 60–71 band.
editor take
Qwen3-4B self-training wins at pass@8, loses at pass@64 across 4 runs; self-improvement looks like probability reshuffling.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
More Bang for the Buck: Improving LLM Inference at a Fixed Budget using Reset and Discard (ReD)
The paper proposes Reset-and-Discard, a query method that improves coverage@cost at a fixed budget and reduces attempts, tokens, and USD cost across three LLMs on HumanEval, GSM8K, and MMLU-Pro.
#Inference-opt#Benchmarking#Reasoning#Research release
why featured
HKR-K and HKR-R pass: ReD targets fixed-budget inference efficiency and reports tests on 3 models and 3 common benchmarks. The post lacks cost-reduction percentages, model names, and reproducibility details, so it stays in the 60–71 band.
editor take
ReD cuts attempts and token cost across 3 LLMs and 3 benchmarks; pass@k-era sampling looks too blunt now.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Muon Learns More Robust and Transferable Features than Adam
The paper evaluates pretrained models on corrupted images and texts and finds Muon learns more robust features than Adam and SGD across transformers and CNNs, with layer-wise probes, larger logit margins, downstream transfer tests, and effective-rank measurements supporting the transferability result.
#Fine-tuning#Benchmarking#Reasoning#Muon
why featured
HKR-H/K/R all pass, but this is a single arXiv optimizer paper with no disclosed artifact, replication, or adoption signal. Useful for training teams, still narrow for the broader AI-practitioner feed.
editor take
Muon beats Adam and SGD on corrupted image/text tests; no effect sizes in the snippet, so don't canonize the optimizer yet.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
Ghosted Layers uses a small calibration set to derive a closed-form linear operator for activation alignment after Transformer layer pruning; the paper reports higher accuracy and lower perplexity than prior training-free baselines across multiple LLM backbones and pruning strategies.
#Inference-opt#Research release#Open source
why featured
HKR-K and HKR-R pass: the mechanism is concrete and cost-relevant. But this is still an arXiv compression paper; gains and code details are not disclosed here, so it stays below featured.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Operationalising the Superficial Alignment Hypothesis via Task Complexity
The paper defines task complexity as the shortest program length needed to reach target performance, then estimates it on mathematical reasoning, machine translation, and instruction following; the experiments find pre-training exposes strong performance but may need gigabyte-scale programs, while post-training reduces the required length by several orders of magnitude.
#Reasoning#Fine-tuning#Alignment#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete complexity metric and claims results across math, MT, and instruction following. Single arXiv item lacks authors, benchmark numbers, and reproducibility detail, so it stays in the lower band.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Curvature-Guided LoRA: Matching Full Fine-Tuning in Function Space
The paper proposes CG-LoRA, which selects low-rank adaptation directions using local curvature information and avoids explicit second-order matrix construction; experiments on standard natural language understanding benchmarks report faster convergence and better performance than existing LoRA variants, but the abstract does not disclose exact scores.
#Fine-tuning#Inference-opt#Benchmarking#Research release
why featured
HKR-H/K/R pass: the paper makes a concrete LoRA-vs-full-fine-tuning claim and names a curvature mechanism. Score stays in 60–71 because benchmark numbers, model sizes, and reproduction conditions are not disclosed.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
TinyJudge uses an ensemble of about 0.6B-parameter specialist models to reward soft constraints, outperforming baselines by about 10% on average across five benchmarks, improving reward precision by 12%, and cutting total training time by 3x.
#Alignment#Fine-tuning#Benchmarking#TinyJudge
why featured
HKR-H/K/R all pass, but this is a single arXiv alignment-training paper without a major-lab release or visible discussion cluster. Concrete metrics keep it high in the 60–71 band, below featured.
editor take
TinyJudge gets 3x training speed with 0.6B specialists; I buy small judges, but five benchmarks don't prove soft-constraint generalization.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Post-training is (Massive) Supervised Learning
arXiv:2606.07527 compares pretrained models with randomly initialized ones, fine-tunes both on modern reasoning datasets, and evaluates them on competitive math and code benchmarks to argue that current LLM post-training mainly acts as distribution fitting.
#Fine-tuning#Reasoning#Benchmarking#Research release
why featured
HKR-H and HKR-R pass: the title challenges the post-training narrative and touches the reasoning-model training debate. HKR-K is weak because the summary gives no scores, scale, or reproducible detail, so it stays in all.
editor take
The paper fine-tunes random-init models too, but scores aren’t disclosed here; if close, RL post-training lore takes a hit.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K0·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
The paper probes three frozen video model families on IntPhys2 and MVP; V-JEPA performs best overall, and disrupting frame order substantially reduces performance, especially on MVP.
#Vision#Benchmarking#V-JEPA#VideoMAE
why featured
HKR-H/K/R pass: the paper tests physics understanding in video models with named benchmarks and a concrete shuffle result. As a single arXiv probing study with no model release or production claim, it stays in the 60–71 band.
editor take
V-JEPA leads on IntPhys2 and MVP; I read this as temporal representation strength, not video models understanding physics.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI
The study tests 44,800 ARC-AGI runs and finds that hand-crafted grid descriptors at 50% trajectory completion predict within-task solver success, with mean best-feature AUC reaching 0.885 and p < 0.001 under within-task label permutation.
#Reasoning#Benchmarking#Inference-opt#ARC-AGI
why featured
HKR-H/K pass: halfway success prediction on ARC-AGI is a real hook, with 44,800 runs and 0.885 AUC. HKR-R is weak because this stays in benchmark research, not a product or tooling shift.
editor take
44,800 ARC-AGI runs put 50%-trajectory features at AUC 0.885; I trust mid-run diagnostics more than scoreboards.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search
The paper introduces BGPS, a two-part framework that uses an LLM to generate attribute-neutral prompts and attribute classifiers on TTI internal representations to steer decoding, then tests it on Stable Diffusion 1.5 and a debiased model to find previously undocumented biases that worsen fairness metrics.
#Vision#Safety#Benchmarking#Stable Diffusion
why featured
HKR-H/K/R all pass, but this is a single arXiv paper with no disclosed bias scale, failure rate, or code link in the summary. Stable Diffusion 1.5 also keeps it in the 60–71 research-signal band.
editor take
BGPS tests Stable Diffusion 1.5 plus one debiased model; automated bias search looks more like red-teaming than evaluation hygiene.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
RAM: Reachability Across Morphologies
RAM predicts robot pose reachability with a morphology-conditioned implicit neural representation, trained on 3×10^10 forward-kinematics samples, reaching 86% F1, beating the baseline by 14%, and cutting inference time by three orders of magnitude.
#Robotics#Inference-opt#RAM#Research release
why featured
HKR-K is strong with concrete numbers; HKR-R is limited to robotics practitioners. The paper is useful but specialized, so it lands high in the 60–71 band rather than featured.
editor take
RAM trades 3×10^10 FK samples for 86% F1; I want the drop under real joint limits and payloads.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
The paper introduces Ego-MC-Bench for step-by-step mistake correction in cooking videos and Ego-CoMist, a synthetic counterfactual dataset for fine-tuning video LLMs, with experiments showing larger gains for smaller, efficient models suited to edge-device assistance.
#Multimodal#Vision#Fine-tuning#Ego-MC-Bench
why featured
HKR-H and HKR-K pass: the real-time correction angle is clickable, and the post names a benchmark, synthetic data, and a fine-tuning result. Missing result numbers and reproducibility details keep it in the 60–71 band.
editor take
Ego-MC-Bench tests live cooking-error fixes; no scores disclosed. Small edge video LLMs gaining from synthetic data is the practical hook.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
BUDDY: Budget-Driven Dynamic Depth Routing for Adaptive Large Language Model Inference
BUDDY uses a lightweight Decision Module to select top-k Transformer layers under a compute budget, and experiments on Llama-family and Qwen models show support for multiple budgets in one trained model and decode-time rerouting.
#Inference-opt#Llama#Qwen#Research release
why featured
HKR-K and HKR-R pass: BUDDY proposes budget-based layer selection and decode-time rerouting for inference cost control. With only abstract-level detail and no disclosed open-source artifact, benchmark gains, or production proof, it stays in all.
editor take
BUDDY routes top-k layers by budget on Llama/Qwen; no latency numbers disclosed, so I file it under controllable depth pruning.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
PACT constrains confidence on safety-related tokens during downstream fine-tuning, matching an aligned reference model at each response step; the arXiv abstract says the code is available, but the snippet does not disclose benchmark numbers.
#Fine-tuning#Safety#Alignment#PACT
why featured
HKR-H/K/R pass, but the feed provides mechanism and open-source status without benchmark numbers or test results. This is useful safety fine-tuning research, not a same-day featured item.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
No Free Lunch for Synthetic Images under Data Scarcity Conditions
The paper evaluates VAE, GAN, and DDPM on MNIST, OCTMNIST, and OrganAMNIST, finding that after differential privacy noise is added during training, GAN and DDPM retain stronger fidelity and downstream utility across noise levels, while VAE degrades faster under tighter privacy constraints.
#Benchmarking#Safety#Research release#Benchmark
why featured
HKR-H/K/R pass: the paper gives a concrete synthetic-image benchmark under data scarcity and DP noise. It remains a single research release, not a major model or product update.
editor take
Across MNIST, OCTMNIST, and OrganAMNIST, GAN/DDPM handle DP noise better; stop treating VAE as the default privacy synthetic-data baseline.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models
The paper proposes JustGRPO for diffusion language models, dropping arbitrary-order generation and applying standard Group Relative Policy Optimization, reaching 89.1% accuracy on GSM8K while retaining parallel decoding ability.
#Reasoning#Inference-opt#Benchmarking#Research release
why featured
HKR-H/K pass: the title has a counterintuitive hook and the summary gives JustGRPO plus 89.1% on GSM8K. It stays in the 60–71 band because this is a technical arXiv method paper without adoption or broad industry heat.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces
KAHM replaces online Transformer query encoding on an Austrian-law retrieval benchmark with 5,000 test queries, reaching MRR@20 of 0.504, Hit@20 of 0.694, Top-1 Accuracy of 0.411, and 8.53x lower per-query CPU time than direct Transformer encoding.
#Embedding#Inference-opt#RAG#Mixedbread
why featured
HKR-K and HKR-R pass: the benchmark numbers are concrete and the latency claim matters for RAG. But this is a narrow arXiv methods paper with a high technical barrier and no product or open-source impact shown.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Item Response Scaling Laws: A Measurement Theory Approach for Efficient Neural Scaling Estimation
IRSL integrates Item Response Theory into scaling laws, reducing parameter complexity for M models and N questions from O(M×N) to O(M+N), and reports scaling estimates using only 50 questions per benchmark after one-time calibration on existing model responses.
#Benchmarking#Reasoning#Research release#Benchmark
why featured
IRSL offers a testable eval-efficiency claim, but this is a single arXiv paper with a dense measurement-theory title; HKR-K/R pass, HKR-H misses, so it stays in all.
editor take
IRSL estimates scaling from 50 items after 6,612-checkpoint calibration; I buy the efficiency, not broad benchmark transfer.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
The paper introduces SySRs, a hyperparameter-free bandit algorithm that adds paired comparisons to Successive Rejects and uses model similarity to identify the best LLM, reporting lower average error rates across 15 standard benchmarks and lower worst-case budget for reliable best-model identification.
#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R all pass, but this is still a methods paper: the disclosed facts are the SySRs algorithm and 15 benchmark tests, with no adoption or tooling release. Upper 60–71 band.
editor take
SySRs cuts average error across 15 benchmarks; savings per API call are undisclosed, so I’d inspect the repo before trusting it.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Domain-Adapted Small Language Models with Hybrid Post-Processing for Cost-Efficient Low-Latency Multi-Label Structured Prediction
The authors fine-tune LLaMA 3.1 8B with LoRA on 219 curated examples and add rule-based postprocessing, reaching 83.0% overall accuracy and 100% JSON validity on 53 unseen production transcripts.
#Fine-tuning#Inference-opt#Tools#LLaMA
why featured
HKR-K and HKR-R pass: the sample count, blind-test size, and JSON-validity result give concrete evidence, and SLM deployment touches cost and latency. Single arXiv paper with tiny evaluation keeps it below featured.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through Imitation Learning
The paper formulates LLM self-play fine-tuning as a min-max game between the model and a regularized implicit reward player, then proposes a self-play imitation fine-tuning algorithm using a χ²-divergence variational objective with bounded rewards.
#Fine-tuning#Alignment#Reasoning#Research release
why featured
HKR-H and HKR-K pass: the title has a reversal, and the body states a concrete training mechanism. The arXiv item stays theory-heavy, gives no result numbers or production claim, so HKR-R fails and it remains all.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records
The paper evaluates GAN-based, VAE-boosted, diffusion-based, and masked modelling on the 50,000-person PRIME-CVD cohort; all four paradigms reproduce marginal distributions, but none simultaneously preserve subgroup structure, effect estimates, and dependency structure for structured electronic medical records.
#Benchmarking#PRIME-CVD#Research release#Benchmark
why featured
HKR-H/K/R pass: the paper has a concrete failure finding on a 50k cohort. Scope is narrow—synthetic medical EMR evaluation, with no product artifact or wider industry uptake—so it stays in all.
editor take
Four model families passed marginals on 50k PRIME-CVD records; judging synthetic EHRs by similarity alone is self-deception.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint
FIT-Print uses targeted fingerprints to verify model ownership, and evaluations report a 100% defense success rate against false-claim attacks, 0.0% false alarms on independent models, and a 100% ownership verification rate under diverse model reuse techniques.
#Safety#Benchmarking#FIT-Print#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv paper with metrics only; code, reproducibility conditions, and adoption are not disclosed. It stays in the 60–71 research-signal band.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Can LLMs Extract Scientific Consensus? A Case Study in High-Temperature Superconductivity
The paper evaluates LLM extraction of scientific consensus by building a knowledge graph from nearly 18,000 highly cited high-temperature superconductivity papers, linking competing mechanisms, material families, evidence types, and citation relations across seven decades.
#Reasoning#RAG#Benchmarking#Research release
why featured
HKR-H/K pass: the consensus-extraction question is a real hook, and the paper gives a ~18k-paper KG setup. HKR-R is weak because the superconductivity case stays niche, so this lands in all, not featured.
editor take
LLM graphs cover 18,000 HTS papers; extraction is fine, but citation-shaped “consensus” can masquerade as physics.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Enabling KV Caching of Shared Prefix for Diffusion Language Models
Younghun Go and four coauthors propose bicache, a bidirectional prefix caching method that dynamically selects safe shallow layers for reusing shared-prefix KVs in diffusion language models, improving serving throughput by 36.3%–98.3% over existing techniques while keeping accuracy differences at 0–1.8%.
#Inference-opt#Younghun Go#Jaehoon Han#arXiv
why featured
HKR-H/K/R pass, but this is a narrow inference-systems paper rather than a model or product release. No hard exclusion applies; it lands in the upper 60-71 research-signal band.
editor take
bicache lifts DLM serving throughput 36.3%–98.3%; diffusion LMs need boring prefix-cache plumbing before serving hype lands.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Diffuse AI Control on Fuzzy Tasks
The paper introduces a Diffuse AI Control game framework where a blue team trains against a weak scorer and a red team uses multi-objective evolutionary prompt optimization, testing the setup on writing experimental proposals for research questions from recent ML papers.
#Alignment#Safety#Benchmarking#Opus 4.6
why featured
HKR-H/K/R pass, but this is a single arXiv methods paper with no disclosed code, result numbers, broad uptake, or large-scale study. It stays in the 60–71 band rather than featured.
editor take
Opus 4.6 loses to GPT-OSS-20B on proposals yet fools the weak scorer; fuzzy-task control finally looks like red-teaming.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework
OrderDP randomly selects a subset and then keeps top-q samples, and evaluations on CIFAR-10, CIFAR-100, and ImageNet-1K report over 40% lower training cost with competitive accuracy and stable convergence.
#Fine-tuning#Inference-opt#Benchmarking#OrderDP
why featured
HKR-H/K/R all pass, but this is a single arXiv training-efficiency paper with impact shown mainly on vision benchmarks; 68 keeps it in all, below featured.
editor take
OrderDP cuts training cost over 40% on ImageNet-1K/CIFAR; the guarantee is tied to surrogate loss, not magic lossless training.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
The paper reformulates scaled dot-product attention with Mathematics of Arrays and derives a DNF that removes the transposed-key buffer and softmax temporaries. It reports O(n·dk+n·dv) data movement versus O(n²+n·dk+n·dv) for standard attention, numerical verification against PyTorch in double precision, and projected 2–100× speedups with 2–50× energy reduction.
#Inference-opt#Reasoning#PyTorch#DARPA
why featured
HKR-H/K/R pass, but this is a low-level attention-kernel math paper with no disclosed reproducible implementation or framework path. Technical-accessibility penalty keeps it below featured.
editor take
MoA cuts attention data movement to O(n·dk+n·dv); the 2–100× speedup is modeled, so wait for code versus FlashAttention.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy
The paper reorganizes LLM pruning methods by GEMM’s M/N/K dimensions and benchmarks their real inference acceleration with a unified framework; during prefill, the Pareto frontier shifts from static depth pruning at 0%–4% quality loss, to dynamic depth at 5%–16%, and to static width pruning at 17%–26%.
#Inference-opt#Benchmarking#EIT-NLP#Research release
why featured
HKR-H/K/R pass, but this is a systems-heavy arXiv benchmark on GEMM and pruning, not a broad product or model release. Lower-band default keeps it at 68 and tier all.
editor take
EIT-NLP maps pruning to GEMM axes and shows prefill frontiers shift across 0%–26% loss; FLOPs-only pruning claims deserve less trust.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Beyond Item IDs: Scaling Short-Form-Video Recommendation via Semantic-Native Long Sequence Modeling
The paper presents a production-deployed short-form video recommendation framework that uses Semantic IDs and a Global-Aware Compression Transformer to model ultra-long watch histories at billion-user scale; offline profiling shows an order-of-magnitude peak-memory reduction, while the abstract does not disclose exact online A/B lift values.
#Embedding#Inference-opt#Research release
why featured
HKR-K/R pass: production-deployed framework, concrete mechanisms, and a memory number. HKR-H is weak, and online A/B lift is not disclosed, keeping it below featured.
editor take
Semantic IDs cut recommender peak memory by 10x; without disclosed A/B lift, this stays credible engineering, not product proof.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Stage-1 Controls the Entropy Regime, Not the Outcome
The study compares three Stage-1 warm starts on Qwen2.5-VL-7B using a 72B VLM teacher, finding Geometry3K validation clustered at 53%–54%; OPD enters RL with higher policy entropy, but endpoint pass@16 differs by at most 1.1 points.
#Fine-tuning#Multimodal#Reasoning#Qwen
why featured
HKR-H and HKR-K pass: the paper has a counterintuitive claim and concrete results. HKR-R is weak because the VLM/RL training detail has narrow reach, so it stays in the 60–71 band.
editor take
Qwen2.5-VL-7B Stage-1 choices end within 1.1 pass@16 points; OPD buys entropy, not payoff.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Code Is More Than Text: Uncertainty Estimation for Code Generation
The paper proposes a three-axis uncertainty estimator for code generation and raises average AUROC from 0.696 to 0.776 across five code LLMs; on Qwen3-14B, single-pass Top-K token entropy matches the strongest multi-pass baseline at under one-third of the cost.
#Code#Benchmarking#Safety#Qwen
why featured
HKR-K/R pass with concrete AUROC and cost claims, but this is a single research paper without release, replication, or product impact, so it stays in the 60-71 band.
editor take
Three-axis UE lifts five code LLMs to 0.776 AUROC; I buy it, code confidence needs code-native signals.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
The paper introduces PPV, an unsupervised delegation-based aggregator for multi-sample LLM inference, and reports a +1.5 pp gain over majority voting on MMLU-Pro, with +2.24 pp on 8,099 non-trivial samples under paired McNemar p ≈ 1.0e-14.
#Reasoning#Embedding#Inference-opt#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv aggregation paper. The disclosed evidence is +1.5 pp/+2.24 pp on MMLU-Pro, with no major lab signal, artifact, or production replacement claim, so it stays in 60–71.
editor take
PPV beats majority voting by 1.5 pp on MMLU-Pro; 128 samples into 16 groups is for inference budgets with room.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
The paper introduces a pre-intervention screening framework for SAE steering side effects, evaluating GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries, with a Llama Scope width comparison from 32K to 128K.
#Interpretability#Safety#Benchmarking#GPT-2
why featured
HKR-K and HKR-R pass via concrete SAE steering tests and safety relevance. HKR-H is weak because the angle is niche interpretability, with no product impact or broad discussion disclosed.
editor take
Across 4 models and 3 SAE types, steering side effects are forecastable; I trust it more because Gemma-2-2B breaks the story.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning
The paper introduces CoDaPO, which scores each question using rollout confidence and empirical difficulty, then reweights policy updates and resamples high-value learnable questions; across 12 benchmarks, it reports higher accuracy than existing RL methods under a fixed compute budget.
#Reasoning#Fine-tuning#Benchmarking#TMLR Group
why featured
HKR-H and HKR-K pass: the title has a sample-difficulty hook, and the summary states CoDaPO’s mechanism plus 12 benchmarks. Missing named-lab weight, code details, effect sizes, and deployment relevance keeps it in the 60–71 band.
editor take
CoDaPO beats existing RL on 12 benchmarks; spending samples on learnable questions looks saner than another GRPO-loss tweak.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization
The paper proposes ISPO, a policy-optimization method that densifies RLVR rewards using the policy’s own conditional probabilities, and reports stronger results than GRPO-style baselines across three base models and five mathematical reasoning benchmarks, with larger gains on harder benchmarks where zero-advantage collapse appears more often.
#Reasoning#Alignment#Benchmarking#Research release
why featured
HKR-K/R pass: ISPO has a concrete mechanism and GRPO comparison for reasoning training. HKR-H is weak, and the post lacks gain sizes, code, or replication details, so it stays in the lower research band.
editor take
ISPO beats GRPO across 3 bases and 5 math benchmarks; self-probability reward densification looks less brittle than binary RLVR.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
The paper introduces an end-to-end LLM compression framework that jointly searches structural pruning and mixed-precision PTQ policies; at 1–3 bits, it reports up to 59% lower WikiText perplexity than leading weight-only quantization baselines.
#Inference-opt#Benchmarking#Research release#Benchmark
why featured
HKR-K is strong: 1–3 bit joint structural pruning plus mixed-precision PTQ, with up to 59% lower WikiText perplexity. HKR-H is weak and the paper is infra-specialist, so it stays in all.
editor take
This targets brutal 1–3 bit compression; 59% lower WikiText perplexity is nice, but no model size or latency is disclosed.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Neural Field Tokenizations with Hierarchy and Spatial Locality Priors
LH-NeF replaces meta-learning inner loops with one forward pass, uses 42× less memory, and supports 133× larger batches than the strongest modality-agnostic baseline across images, 3D shapes, and climate fields.
#Multimodal#Embedding#Inference-opt#LH-NeF
why featured
HKR-K is strong: one forward pass replaces the meta-learning inner loop, with 42x memory and 133x batch claims. HKR-H has an efficiency hook, but no code or product adoption is disclosed, keeping it in 60–71.
editor take
LH-NeF cuts memory 42× with one forward pass; I buy the direction, but cross-modal wins need code-backed replication.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Adversarial Robustness of Activation Steering in Large Language Models
The paper evaluates activation steering robustness under adversarial text perturbations across four extraction methods, three attack strategies, six personas, and five 1.5B–30B parameter models, finding directional robustness drops up to 64% and optimal steering layers shift by up to 17 positions under perturbation.
#Alignment#Safety#Interpretability#Anthropic
why featured
HKR-K/R pass: the evaluation matrix is concrete and the reliability question matters. HKR-H is weak, and no headline result or artifact is disclosed, so this stays below featured.
editor take
Activation steering loses up to 64% robustness under 3 attacks; treating it as a safety control surface looks reckless.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
MetaEvaluator meta-learns over a pool of reference models to evaluate unseen models on unlabeled datasets, under the condition that it avoids per-model retraining; the arXiv abstract says the code is available on GitHub.
#Benchmarking#Fine-tuning#Multimodal#MetaEvaluator
why featured
HKR-K and HKR-R pass: the method targets unlabeled evaluation cost and claims open code. HKR-H is weak, and the summary gives no accuracy, cost-reduction, or benchmark numbers, so it stays mid-band.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Sequential Statistical Inference for Large Language Models: Representation, Validity, and Monitoring
The paper frames trustworthy LLM deployment as statistical process control and defines three tasks: representation, validity, and monitoring under dependent interactions, repeated use, adaptation, model updates, and distribution shifts.
#Safety#Benchmarking#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: it recasts trusted LLM deployment as statistical process control under dependence, reuse, and drift. HKR-H is weak, and no experiment numbers or tools are disclosed, so it stays all.
editor take
This paper frames LLM deployment as statistical process control; no experiments disclosed, but the missing piece is temporal validity.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models
Sparrow uses a dynamic sparsity schedule to keep the lower-tail sparse-to-dense actor-policy mismatch near a threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B.
#Reasoning#Inference-opt#Fine-tuning#Qwen
why featured
HKR-K is strong: the mechanism and three Qwen3 speedup numbers are concrete. HKR-R comes from long-context RL training cost, but HKR-H is weak and the angle is too technical for featured.
editor take
Sparrow gets 2.0–2.4x rollout speedups on Qwen3-1.7B/4B/8B; RLVR’s long-CoT tax now has a concrete tail-mismatch knob.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings
The paper runs 41 games across four conditions in a 7-player Warring States Diplomacy variant, finding that per-round reflective symbolic prompts change winner distributions while the framework-receiving agent, Han, never wins.
#Agent#Reasoning#Alignment#Research release
why featured
HKR-H/K/R all pass, but this is a single arXiv game-study with limited reach beyond the abstracted setup. It fits the 60–71 band as useful agent-safety research, not a same-day must-write.
editor take
In 41 Diplomacy-variant games, prompt scaffolds shifted winners but Han won zero; this smells like reflection-induced system noise, not symbolism.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
SpectrumKV changes KV cache transfer in prefill-decode disaggregated serving into per-token precision allocation across FP16, INT8, and INT4, using three NIAH probe trials to decide INT4 tolerance; at b=0.5, transfer-path GPU timing shows 50-62% TTFT reductions.
#Inference-opt#Benchmarking#Qwen#Mistral
why featured
HKR-K/R pass: the paper gives a concrete mechanism and 50-62% TTFT reduction, with clear cost/latency relevance. HKR-H is weak, and the LLM-serving infra focus keeps it in all.
editor take
SpectrumKV cuts TTFT 50-62% at b=0.5; the catch is screening INT4-hostile models like Qwen before using three-tier KV transfer.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
C³ache: Accelerating World Action Models with Cross Inference Chunk Cache
C³ache reuses residuals from the same denoising step across adjacent inference chunks, and experiments with a Fast-WAM backbone report up to a 2.5× reduction in total wall-clock inference time with negligible task-success degradation.
#Robotics#Inference-opt#Vision#C³ache
why featured
HKR-H/K/R pass, but this is a narrow arXiv inference-optimization paper. The 2.5x Fast-WAM result is useful, yet its audience is mainly robotics/world-action-model practitioners, below featured threshold.
editor take
C³ache gets 2.5× speedup by reusing cross-chunk residuals; training-free is nice, but smooth-motion assumptions break on contact-heavy robotics.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Report the Floor: A Training-Free Conformal Interval Is a Mandatory Baseline for Probabilistic Time-Series Forecasting
The paper evaluates ConformalNaive on 2,217 real series from nine public sources: in one-step online forecasting, it beats CSP on 71% of series, with a 95% bootstrap CI of [69,73].
#Benchmarking#arXiv#Monash#DeepNPTS
why featured
HKR-H/K/R all show up via the training-free baseline and concrete 2,217-series result, but the topic is narrow probabilistic time-series forecasting, so it stays in the 60–71 band.
editor take
ConformalNaive beats CSP on 71% of 2,217 series; plenty of learned forecasters still fail the floor test.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
SoK: Reconstruction Attacks on Synthetic Tabular Data
The paper evaluates 14 reconstruction attacks, 9 synthetic data generation methods, and 5 benchmark datasets, finding that the SDG method drives risk more than attack choice and that differential privacy mainly protects at budgets of ε≤1.
#Safety#Benchmarking#NIST#Research release
why featured
HKR-K/R are strong: the paper gives a 14/9/5 evaluation grid and a DP threshold at ε≤1 for synthetic-data risk work. HKR-H has the NIST CRC hook, but this remains a specialized privacy paper below featured threshold.
editor take
14 attacks hit 9 SDG methods; the generator drives risk, and DP above ε>1 plateaus—bad news for synthetic-data compliance theater.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
LoTUS evaluates machine unlearning on Transformer and ResNet18 models against 8 baselines across 5 public datasets, adds ImageNet1k for large-scale retrain-free conditions, and introduces RF-JSD to measure unlearning without full retraining.
#Fine-tuning#Benchmarking#LoTUS#ImageNet1k
why featured
HKR-K/R pass: the paper provides concrete evaluation settings and addresses machine-unlearning governance. HKR-H is weak, and this is a single arXiv paper with no adoption or code signal, so it stays in 60–71.
editor take
LoTUS tests 5 datasets against 8 baselines; RF-JSD is useful, but the SOTA claim needs deletion sampling and attack results.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Need We Teach Foundation Models What Is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation
The paper proposes gradient-free generative artifact detection by reframing binary classification as OOD anomaly measurement; its reported extreme zero-shot setup trains on face forgeries and tests on universal Text-to-Image generations.
#Vision#Safety#Inference-opt#Research release
why featured
HKR-H/K/R all pass: the paper offers a concrete gradient-free detection mechanism and test setting. It stays in the 60–71 all band because no large benchmark, code release, or deployment evidence is disclosed.
editor take
They train on face forgeries and test T2I; without datasets or scores, I don’t buy “significantly outperforms.”
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Decoy-Calibrated Failure Audits for Language Models
Janus filters language-model failure explanations with frequency-matched random decoys and held-out replication; on LongBench v2, a fixed threshold reported 20 descriptors, the decoy floor left one, and the holdout check rejected it after lift shrank from 0.36 to 0.05.
#Benchmarking#Safety#Interpretability#Janus
why featured
HKR-K is strong and HKR-R matters for eval and safety-audit builders; HKR-H is weak because the angle is buried in technical wording. No hard exclusion, but as an arXiv methods paper it stays in the interesting-not-featured band.
editor take
Janus cuts 20 LongBench v2 failure descriptors to zero; LLM audits need less storytelling and more held-out replication.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Differentially Private Synthetic Data via APIs 4: Tabular Data
The paper introduces Tab-PE, an evolutionary algorithm for differentially private synthetic tabular data that uses heuristic tabular operators instead of foundation models, and reports up to 10% higher classification accuracy than AIM while running 28 times faster on datasets with high-order correlations.
#Safety#Benchmarking#AIM#Research release
why featured
HKR-K is strong and HKR-R is moderate: the article has a concrete mechanism and 10%/28x claims. HKR-H is weak, and the DP tabular-data angle is specialized, so it stays in the 60–71 band.
editor take
Tab-PE beats AIM by up to 10% accuracy and 28× speed; for DP tables, heuristic operators look cleaner than foundation-model PE.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video RAG
The paper presents a two-stage training-free Video RAG pipeline: a high-recall retrieval stage uses visual summaries and global text descriptions, then an A.I.R. filtering agent reranks candidates with full multimodal context and returns JSON with chunk-level citations.
#RAG#Multimodal#Agent#MAGMaR
why featured
HKR-K passes on the concrete pipeline mechanism, and HKR-R passes on Video RAG citation and training-free deployment pain. HKR-H is weak, and the post lacks benchmarks, datasets, and comparisons, so it stays in 60-71.
editor take
MAGMaR shows a 2-stage training-free Video RAG recipe; no scores disclosed, so it reads like plumbing, not proof.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
STAR-KV uses differentiable thresholding for attention-head and block-level rank control, reaching up to 75% KV cache compression across multiple LLMs and benchmarks, and up to 20x total KV cache reduction when combined with quantization.
#Inference-opt#STAR-KV#Triton#Research release
why featured
HKR-K and HKR-R pass: the paper gives a concrete mechanism and compression numbers, and maps to inference-cost pressure. HKR-H is weak, and the topic is narrow inference optimization, so it stays in all.
editor take
STAR-KV claims 75% KV compression and 6.9x attention speedup; strong, but the snippet lacks long-context latency curves.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
FiberTune improves VLA fine-tuning across six controlled simulation settings and physical SO-101 pick-place, with SR(5) on long-horizon CALVIN ABC-to-D rising by 10.7 percentage points and SO-101 task success increasing from 72.7% to 78.1% under identical training conditions.
#Fine-tuning#Vision#Robotics#FiberTune
why featured
HKR-K/R pass on cross-sim and SO-101 results; HKR-H is weak because the title is specialist. Useful for embodied-AI practitioners, but no code or broad replication is disclosed, so it stays in the 60-71 all band.
editor take
FiberTune gains across 6 sims and SO-101; I buy the mechanism, VLA fine-tuning has long trashed visual residuals.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for MLLMs Under Visual Saturation
DPVR-LF routes vision tokens at the saturation point into a one-layer side branch, runs a 13-layer text-only forward pass, and trains about 3% of parameters while preserving competitive multimodal benchmark performance.
#Multimodal#Vision#Inference-opt#LLaVA-1.5
why featured
HKR-H/K/R pass, but this is a single arXiv architecture-optimization paper. The text gives mechanism and parameter ratio, not broad deployment evidence or cross-model impact, so it stays in 60–71.
editor take
DPVR-LF trains 3% of parameters and skips 13 visual layers; I buy the bet: LLaVA-style vision tokens overstay deep stacks.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency
The study trains a 1.1B-parameter TinyLlama on the same GPU, architecture, optimizer settings, and epoch count, and finds parameter efficiency declines strictly monotonically as token count rises across 500K, 1M, and 2M training tokens.
#Benchmarking#Inference-opt#TinyLlama#Research release
why featured
HKR-K is solid: fixed setup, token counts, and a testable monotonic-efficiency claim. HKR-R comes from training cost, but HKR-H is weak and the 500K–2M-token scale keeps it in the 60–71 band.
editor take
TinyLlama 1.1B loses efficiency at 500K, 1M, and 2M tokens; tiny scale, but energy belongs in scaling tables.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation
MilliVid uses a hierarchical token autoencoder and coarse-to-fine rollout to generate long Minecraft videos, preserving geometry and object permanence more consistently than existing baselines; the abstract does not disclose dataset size, frame counts, compute cost, or quantitative scores.
#Multimodal#Vision#MilliVid#Research release
why featured
HKR-H/K/R all pass, but the post gives mechanisms and qualitative baseline claims only; metrics, authors, code, and reproduction details are not disclosed. Treat as a regular arXiv research release in the 60–71 band.
editor take
MilliVid tackles long-video consistency with hierarchical tokens; dataset size, frame counts, compute, and scores are undisclosed, so don’t call it general video progress yet.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective
The paper formulates reward model optimization under KL regularization as a Stackelberg game, then evaluates a reward shaping scheme for inference-time alignment and reports win-tie rates above 66% against all baselines across evaluation settings.
#Alignment#Inference-opt#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: it has a concrete mechanism and a >66% win/tie claim. HKR-H is weak and the source detail is abstract-level, so this stays in the 60–71 research-update band.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
The paper proposes HIVE, which selects prompts before rollouts using historical reward trajectories and prunes stale-utility instances with prompt entropy; experiments span multiple math reasoning benchmarks and models, but the abstract does not disclose the exact rollout-efficiency gains.
#Reasoning#Fine-tuning#Inference-opt#HIVE
why featured
HKR-K/R pass: the mechanism is concrete and targets RL training cost for reasoning models. No efficiency number is disclosed, and the paper remains training-specialist content, so the lower 60–71 band fits.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Distilling Safe LLM Systems via Soft Prompts for On-Device Settings
The paper evaluates multiple LLM architectures, training objectives, and parameter-efficient tuning methods, and finds that soft prompts with distillation training outperform LoRA adapters, steering vectors, and direct optimization for on-device safety alignment with minimal extra inference memory and compute.
#Fine-tuning#Safety#Alignment#Research release
why featured
HKR-K and HKR-R pass: the method comparison is concrete and on-device safety alignment has practical pull. HKR-H is weak, and the feed gives no datasets, model sizes, or absolute metrics, so it stays in all.
editor take
Soft-prompt distillation beats LoRA and steering vectors across evaluated architectures; no model sizes or benchmark numbers in the snippet, so hold the coronation.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Locality-Aware Redundancy Pruning for LLM Depth Compression
The paper proposes LoRP, a training-free one-shot depth pruning framework that uses a small calibration set to compute pairwise layer similarity and cluster layers, with experiments across multiple LLM families reporting gains in perplexity and downstream task accuracy.
#Inference-opt#LoRP#Research release#Open source
why featured
HKR-K and HKR-R pass: LoRP has a concrete pruning mechanism and cost relevance. HKR-H misses; the arXiv snippet lacks compression ratios, model sizes, code, and replication detail.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
The Value of Personalized Recommendations: Evidence from Netflix
The paper estimates a discrete choice model on Netflix viewership data and finds that replacing the current recommender with matrix factorization or popularity-based ranking would reduce engagement by 4% and 12%, respectively.
#Benchmarking#Netflix#Research release
why featured
HKR-H/K/R pass, but the impact is mainly recommender systems and platform economics, not a broad AI model or product update. Concrete Netflix counterfactuals put it in the high 60–71 band.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination
DICE formalizes multi-agent LLM systems as discounted incomplete-information Markov games and introduces HQRE, an entropy-regularized equilibrium with agent- and state-dependent temperatures; across 11 benchmarks in four domains, DICE-PC improves reasoning and planning accuracy by 4.3 percentage points on average, while DICE-FT improves it by 8.5 points.
#Agent#Reasoning#Fine-tuning#DICE
why featured
HKR-H/K/R all pass, but this is an arXiv method paper with benchmark gains, not a major lab release or production artifact. It fits the 60-71 research-signal band.
editor take
DICE reports +4.3/+8.5 points across 11 benchmarks; I buy the target—multi-agent LLMs lack equilibrium selection, not more personas.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
STARIXNet: Multivariate and Multi-attribute Deep Learning for Real-Time Cloud Resource Allocation
Ahmed Abdulaal and three coauthors present STARIXNet, a lightweight neural network for cloud microservice scaling that models multiple system metrics and reports 10% to 50% cost savings after deployment on critical Walmart production services.
#Inference-opt#Ahmed Abdulaal#Walmart#arXiv
why featured
HKR-H/K/R pass, but this is a cloud resource-allocation paper, not a model, agent, or major AI product update. The Walmart 10%-50% cost-saving claim lifts it into the useful 60-71 band, not featured.
editor take
STARIXNet reports 10%-50% Walmart production savings; multi-metric conservative scaling beats CPU-only autoscaling dogma.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Chiaroscuro Attention: Spending Compute in the Dark
CHIAR-Former routes each token to DCT spectral mixing, RBF kernel mixing, or full self-attention using per-token spectral entropy; its DCT+Attention variant reaches 36.54 validation perplexity on WikiText-103, versus 66.62 for a full-attention baseline, while using 62.5% fewer attention FLOPs.
#Inference-opt#Benchmarking#CHIAR-Former#Research release
why featured
HKR-K and HKR-R are strong: spectral-entropy token routing reports 36.54 WikiText-103 PPL and 62.5% lower attention FLOPs. As a single early arXiv architecture paper without production or frontier-model validation, it stays in all.
editor take
CHIAR-Former hits 36.54 PPL on WikiText-103 with 62.5% fewer attention FLOPs; I buy DCT+Attention, not the RBF garnish.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Position: Deployed Reinforcement Learning Should Be Continual
The paper argues that deployed RL agents should keep learning, identifies four post-deployment sources of non-stationarity, and positions train-then-fix as insufficient when agents receive evaluative reward signals.
#Agent#Reasoning#Research release#Commentary
why featured
HKR-H/K/R all pass, but this is an arXiv position paper; the summary discloses no experiments, benchmarks, or deployed case, so it stays in the 60–71 band.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems
Truong Xuan Khanh proposes the Hierarchical Emergence Framework and tests it on 111 modular arithmetic transformer experiments, where weight-norm peaks precede grokking in 92% of runs, normalized accuracy curves fit a tanh kink with R²=0.93, and grokked models converge to 0.9745±0.014 across initialization, weight decay, or training fraction.
#Reasoning#Interpretability#Benchmarking#Truong Xuan Khanh
why featured
HKR-H and HKR-K pass: the paper offers a testable grokking precursor with concrete experiment counts. Technical-accessibility concerns keep it below featured; HKR-R is weak for practitioners.
editor take
HEF gets a 92% pre-grokking norm signal across 111 runs; I buy the grokking fingerprint, not the biology-physics umbrella.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
CLASP: Language-Driven Robot Skill Selection and Composition Using Task-Parameterized Learning
CLASP combines task-parameterized kernelized movement primitives with pretrained VLMs for robot skill selection and composition, learning each skill from 2 to 5 kinesthetic demonstrations and reaching 73.3% to 100% success rates on a 7-DoF manipulator without fine-tuning.
#Robotics#Multimodal#Reasoning#CLASP
why featured
HKR-H/K pass via few-demo robot skill composition and success-rate numbers. HKR-R is weak, and this is a single arXiv paper without an open artifact or adoption signal, so it stays in the 60-71 band.
editor take
CLASP learns each skill from 2-5 demos; 73.3%-100% success is nice, but one 7-DoF setup is still lab robotics.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Optimizing Few-Step Generation with Adaptive Matching Distillation
The paper introduces AMD to detect and escape Forbidden Zones in few-step generation, raising SDXL HPSv2 from 30.64 to 31.25 and testing across image and video tasks including SDXL, Wan2.1, VBench, and GenEval.
#Multimodal#Vision#Inference-opt#arXiv
why featured
HKR-H/K/R pass, but the evidence is a paper method plus a modest metric gain, with no disclosed code, major model adoption, or production replacement claim. This stays in the 60–71 research-release band.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training
LEAF improves speech-aware large language model post-training with retrospective tree-based RL, assigning span-level advantages from descendant rewards and outperforming GRPO on speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget.
#Audio#Fine-tuning#Reasoning#LEAF
why featured
HKR-H comes from the counterintuitive title, and HKR-K from a testable RL method versus GRPO. No major lab, artifact, or cross-source cluster is disclosed, so this stays in the interesting research band.
editor take
LEAF beats GRPO under the same rollout and LoRA budget; span-level credit makes sense, but I want code and exact benchmark numbers.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation
Huawei-AI4Math released PyGeoX and the 300-problem PyGeoX-Bench, using Saturating Additive Rewards to improve the hard-tier geometric solving rate by 2.3x over an MSE-based reward baseline.
#Reasoning#Benchmarking#Tools#Huawei-AI4Math
why featured
HKR-K is strong: 300 benchmark tasks, SAR reward, and a 2.3x hard-tier gain are testable. The topic is narrow geometry reasoning with no product path disclosed, so it stays in the 60–71 band.
editor take
PyGeoX-Bench has 300 tasks, and SAR gives 2.3x hard-tier gains over MSE; the 8B frontier claim needs outside replication.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation
The paper proposes TLVS, a token-level visual-sensitivity steering method that adjusts steering strength at each decoding step, and evaluates it against prior steering methods on POPE, AMBER, CHAIR, MMHal, and HallusionBench.
#Vision#Multimodal#Alignment#Research release
why featured
HKR-K and HKR-R pass: TLVS gives a concrete decoding-time mechanism and named benchmarks for LVLM hallucination. HKR-H is weak, and the post does not disclose gains, code, or reproducibility details.
editor take
TLVS steers per decoding step across 5 hallucination benchmarks; I buy the direction, but the abstract gives no deltas.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
FunctionEvolve: Structure-Guided Symbolic Regression with LLMs
FunctionEvolve recovers 107 exact forms on the 129-task synthetic subset of LLM-SRBench, using Claude Opus 4.6 with expression-tree search to reach 82.9% SA@50 and 55.8% SA@1.
#Reasoning#Tools#Benchmarking#Claude Opus 4.6
why featured
HKR-K is strong and HKR-H passes on the formula-recovery hook. The work is still a synthetic symbolic-regression benchmark, so it stays in the 60–71 research-paper band rather than featured.
editor take
FunctionEvolve recovers 107/129 exact formulas; for LLM symbolic regression, tree structure beats prompt alchemy.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval
The paper introduces CausalNeg with two modules: CoT-guided counterfactual perturbation for negative construction and query-view entropy maximization during training, targeting source-dependent shortcuts in generated hard negatives; the authors provide code on GitHub.
#RAG#Embedding#Reasoning#CausalNeg
why featured
HKR-H/K/R pass, but the post gives mechanisms without benchmark numbers or production impact. This is a useful RAG/Embedding research release, not a same-day featured industry story.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Workflow for Acute Asthma Risk Assessment
AeroSpectra Sentinel combines STFT respiratory-sound analysis, lightweight ML screening, and a five-stage LLM prompt chain; on 584 recordings, a random forest reached 91.10% binary accuracy, and in 40 simulated clinical vignettes, the guardrail-plus-FHIR-schema variant produced the strongest safety and documentation consistency.
#Agent#Audio#Safety#AeroSpectra Sentinel
why featured
HKR-K is solid: the paper gives dataset size, accuracy, simulation count, and guardrail mechanism. HKR-H passes, but HKR-R is weak because this remains a niche clinical study with no deployment or cross-source signal.
editor take
AeroSpectra Sentinel hits 91.10% on 584 clips; I don’t buy the safety story from 40 simulated vignettes.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
VideoGPA uses a geometry foundation model to automatically derive dense preference signals and trains video diffusion models with DPO; the abstract says it improves temporal stability, geometric plausibility, and motion coherence with minimal preference pairs, but the snippet does not disclose dataset names, metric values, or model size.
#Multimodal#Vision#Alignment#VideoGPA
why featured
HKR-H and HKR-K pass: the method hook is clear and the mechanism is specific. But only abstract-level facts are available, with no benchmark numbers, model scale, or release details, so it stays mid-band.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
MOLOT System Card: Malicious Operational Logic Observation Transformer
MOLOT models static call graphs as behavior sequences to detect malicious code in PyPI and npm packages, adds explanations mapped to source locations, and releases Open Malicious-Code Bench; the abstract does not disclose specific accuracy, latency, memory, or false-positive numbers.
#Code#Interpretability#Benchmarking#MOLOT
why featured
HKR-K and HKR-R pass: the paper names a static-call-graph-to-behavior-sequence method and a PyPI/npm benchmark. HKR-H is weak, and missing accuracy, latency, and false-positive data keeps it in all.
editor take
MOLOT covers PyPI and npm, but no accuracy is disclosed; I trust the benchmark release more than the deployability claim.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning
ATLAS uses a lightweight verifier over intermediate hidden states to choose steering actions at inference time per example and step; the paper says it beats vanilla decoding and fixed steering on multiple math and coding benchmarks while reducing test-time tokens, but the abstract does not disclose exact scores.
#Reasoning#Inference-opt#Code#ATLAS
why featured
HKR-K and HKR-R pass: the mechanism is concrete and targets costly reasoning. HKR-H is weak, and the abstract omits benchmark scores or release details, so this stays an interesting research item, not featured.
editor take
ATLAS steers latents with a lightweight verifier per step; scores and token savings are undisclosed, so I’d file it under less-sampling reasoning.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy
The paper formulates LLM post-training as a Discrepancy-Constrained Markov Decision Process, using Lagrangian relaxation to dynamically weight reward maximization against train-inference alignment, and reports improved RL stability and performance on Qwen-3-8B and Qwen-3-30B-A3B under black-box discrepancy.
#Fine-tuning#Alignment#Inference-opt#Qwen
why featured
HKR-K/R pass because the paper names a mechanism and Qwen test models for RL post-training stability. Missing effect sizes, benchmarks, and reproducible settings keep it in the 60–71 research-signal band.
editor take
DCMDP constrains black-box train-inference mismatch; gains lack disclosed numbers, so treat the stability claim as unproven.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
How Reliable Are Fairness Audits with Unreliable Data?
The paper tests protected-label missingness on ACS/Folktables tasks and finds that positive-availability missingness usually does not move selected mitigation methods beyond the complete-label seed floor.
#Safety#Benchmarking#arXiv#ACS
why featured
HKR-H and HKR-K pass: the title has tension, and the summary gives an ACS/Folktables missing-label result. The impact is limited to fairness-audit research, with no product or industry event hook.
editor take
This tests missing protected labels on ACS/Folktables; missingness is not the villain, threshold optimization causing intersectional harm is.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
The paper introduces a behavioral cloning framework for scientific annotation with 9 synthetic tasks that model exploration, error correction, and strategic decisions; experiments show multi-task pretraining supports efficient fine-tuning to new tasks, while training from scratch fails entirely.
#Agent#Fine-tuning#Benchmarking#Research release
why featured
HKR-H/K pass, but this is a single arXiv methods paper without a known lab, artifact, or production-replacement claim. It fits the 60–71 research-interest band.
editor take
Nine synthetic annotation tasks show scratch training fails; I buy the pretraining signal, not the real-world leap yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
CURE fine-tunes a multimodal instruction model with error-aware curriculum learning for radiology report generation without extra data, improving grounding by +0.35 IoU, increasing CXRFEScore by +0.192, and reducing hallucinations by 18.6% on public datasets.
#Multimodal#Vision#Fine-tuning#CURE
why featured
HKR-K/R pass: the paper gives testable metrics and an error-aware curriculum mechanism, tied to medical-report hallucinations. HKR-H fails; as a narrow single arXiv paper with no product uptake, it stays in 60–71.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Disjoint Generation of Synthetic Data
The paper proposes a disjoint generation framework for tabular synthetic data: it partitions a dataset into disjoint subsets, fits separate generative model instances, and joins outputs without shared variables or identifiers. Case studies report higher empirical privacy measurements, improved feasibility for some model types, and mixed-model synthesis with competitive Accuracy and AUC while lowering empirical re-identification risk.
#Fine-tuning#Benchmarking#arXiv#Research release
why featured
HKR-K is clear: disjoint generation and identifier-free joining are testable mechanisms. HKR-R is moderate around privacy and compute cost, but HKR-H is weak and this is a single arXiv paper, so it stays in 60–71.
editor take
Disjoint Generation splits tabular data, trains separate generators, then joins outputs; dataset count is undisclosed, so don't treat privacy gains as law.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Payoff Scaling Shapes Cooperation in LLM Agents Across Languages
arXiv 2601.19082v2 tests LLM agents in a repeated Prisoner’s Dilemma, where higher payoffs make EGT predict more defection while LLMs become more cooperative; the authors also report the pattern in three smaller open-weight models.
#Agent#Alignment#Benchmarking#Research release
why featured
HKR-H/K/R pass, mainly on a counterintuitive agent-behavior experiment. As an arXiv-only research item with no product impact or visible industry debate, it stays in the 60–71 band.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Safe-RULE: Safe Reinforcement UnLEarning
Safe-RULE proposes a defense framework for offline Safe RL that removes poisoned-data influence without retraining from scratch or accessing the original environment, and its unlearning process explicitly accounts for both task performance and safety constraints across benchmark Safe RL tasks.
#Robotics#Safety#Alignment#Safe-RULE
why featured
HKR-K/R pass: the paper offers a concrete safe RL unlearning setup without environment access or retraining, and touches poisoning defense. It remains niche research with abstract-level detail, below featured threshold.
editor take
Safe-RULE removes poisoned-data influence without env access or retraining; no poison rate or baselines in the snippet, so don’t call it a robotics safety patch yet.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
LEAP learns unstructured pruning masks with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation, and across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, it improves six-task average zero-shot accuracy by 2.59 points over ADMM.
#Inference-opt#Fine-tuning#Benchmarking#LEAP
why featured
HKR-K is strong: mechanism, model scale, sparsity levels, and six-task zero-shot results are disclosed. HKR-R comes from inference-cost pressure, but HKR-H is weak and the arXiv method is not yet a deployable tool.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning
The paper analyzes 15 calibration sources and shows that, on LLaMA-3.1-8B with SparseGPT at 60% sparsity, a uniform multi-source calibration mix reaches 58.8% total retention, 8.8 points above the best single source MetaMath and 18.8 points above the C4 default.
#Inference-opt#Code#Benchmarking#LLaMA
why featured
HKR-K and HKR-R pass: the paper gives concrete pruning numbers and a practical calibration-data claim. HKR-H is weak, and this remains niche infra research rather than a same-day must-write.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering
The paper recasts manifold steering as Riemannian geodesic computation over activation space and trains an encoder on output distances from a small concept-token schema, avoiding per-prompt labels, topology priors, and per-task curve fitting.
#Alignment#Reasoning#Research release
why featured
HKR-K is clear, while HKR-H/R mainly work for the model-steering niche. The post gives mechanisms but no metrics, model scale, code, or reproducible setup, so the technical bar keeps it in all, not featured.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
EinSort: Sorting Is All We Need for Tensorizing LLM
EinSort uses index ordering to discover low-rank structure in target tensors, and its weight and KV-cache compression experiments show better reconstruction quality than baselines.
#Inference-opt#EinSort#Research release
why featured
HKR-H/K/R pass through the surprising sorting hook, concrete tensorization mechanism, and inference-cost nerve. Importance stays in the lower band because the post discloses no compression ratio, latency, or production impact.
editor take
EinSort sorts indices for weight and KV-cache compression; no compression ratio disclosed, so reconstruction wins feel underpowered.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates
The paper proposes truncating the SVD tail of fine-tuning update ΔW, reducing spurious-group gaps across three 0.5B–7B instruction-tuned models and four classification benchmarks while keeping accuracy loss under 2 percentage points.
#Fine-tuning#Alignment#Benchmarking#Research release
why featured
HKR-K is clear: SVD-tail compression, 3 0.5B–7B models, 4 classification benchmarks, and <2pp accuracy loss. HKR-R is present via bias and fine-tuning reliability, but this is a single narrow arXiv paper, so it stays in the interesting band.
editor take
SVD-tail truncation cuts gaps in 12 model-benchmark cells at <2pp loss; I buy the patch, not the debiasing story.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEWarXiv · cs.LG· atomEN04:00 · 06·09
PriFT: Prior-Support Guided Supervised Fine-Tuning
PriFT computes token weights from a frozen pretrained reference model rather than the online fine-tuned model, and experiments on mathematical reasoning, code generation, and medical question answering show stronger results than multiple SFT baselines plus better initialization for subsequent RL training.
#Fine-tuning#Reasoning#Code#PriFT
why featured
HKR-K is clear: the method and test domains are specific. HKR-R is limited to fine-tuning and RL-training practitioners; with only one arXiv paper and no code or scale numbers disclosed, this fits the 60–71 band.
editor take
PriFT weights tokens with a frozen reference model, avoiding online self-reinforcement; no model sizes or gains disclosed, so hold the SFT hype.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
NEW · 2 sourcesarXiv · cs.LG· atomEN04:00 · 06·09
BrainSurgery paper introduces declarative weight operations for model editing and upcycling
BrainSurgery modifies neural network checkpoints through declarative YAML plans; the arXiv abstract presents four examples and three case studies covering model upcycling and LoRA extraction.
#Fine-tuning#Tools#BrainSurgery#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv tool paper. The text gives a mechanism and case counts, not metrics, cost, or adoption, so it stays in the 60–71 band.
editor take
BrainSurgery edits checkpoints via YAML, with 4 examples and 3 cases; I buy it, weight surgery needs reproducible guardrails.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
The paper proposes R2, a training-free correction that projects each text embedding off the mean direction, and reports classification gains on MMTEB across 38 models, with 29 models showing t>2 and zero losses.
#Embedding#Benchmarking#arXiv#MMTEB
why featured
HKR-H and HKR-K pass: R2’s mean-direction projection and 38-model MMTEB test add signal. The audience fit is narrow and it remains a benchmark-method paper, not a same-day industry story.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance
VSD formulates draft training as variational inference over latent draft paths, then uses EM, Adaptive Rejection Weighting, and Confidence-Aware Regularization to increase expected acceptance length, with experiments reporting up to 9.6% speedup over EAGLE-3 and 7.9% over ViSpec across LLMs and MLLMs.
#Inference-opt#Multimodal#EAGLE-3#ViSpec
why featured
HKR-K and HKR-R pass via a concrete VSD mechanism and 9.6% speedup claim, but HKR-H fails. This is a specialized speculative-decoding paper, so it stays in the 60–71 research-signal band.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
04:00
3h ago
arXiv · cs.LG· atomEN04:00 · 06·09
Solving Inverse Problems with Flow-based Models via Model Predictive Control
MPC-Flow formulates inverse problem solving with flow-based generative models as sequential control sub-problems, provides training-free inference-time guidance, and guides 32B FLUX.2 in a quantized setting on consumer hardware for image restoration tasks including in-painting, deblurring, and super-resolution.
#Inference-opt#Vision#FLUX.2#Research release
why featured
HKR-H/K/R pass, but this is a single arXiv methods paper with high inverse-problem/MPC overhead and no disclosed code, metrics table, or cross-source pickup; it stays in all.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R1

more

feeds

admin