papers · 2026-06-05

▸ 7 papers · updated 3m ago

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-06-05 · Fri

17:59

3d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 06·05

→How Reliable Are LLMs When Playing Dice?

The study tests 8 state-of-the-art models on two discrete-probability datasets, with and without Chain-of-Thought prompting; average accuracy reaches 0.96 on standard problems, falls to 0.59 on counterintuitive ones, and drops by up to 34% when prompts include misleading suggestions.

#Reasoning#Benchmarking#Research release#Benchmark

why featured

HKR-H/K/R all pass: the dice setup is novel, the accuracy drops are concrete, and the topic hits reasoning robustness. The task is narrow and only arXiv-summary depth is available, so it stays near the featured threshold.

editor take

Dice problems puncture the math-reasoning story: 8 models hit 0.96 on standard items, then fall to 0.59 on counterintuitive ones.

sharp

The uncomfortable part is not that probability is hard. It is that surface familiarity still does too much work. Across 8 state-of-the-art models, average accuracy is 0.96 on standard discrete-probability exercises, then drops to 0.59 on counterintuitive ones. Disguising canonical formulations cuts performance by over 20%. Adding misleading suggestions cuts it by up to 34%. That is a bad look for the Chain-of-Thought story. The paper tests each model with and without CoT, but the abstract does not disclose model-level CoT gains. From the numbers given, CoT looks closer to fluent template execution than robust probability-space checking. Set beside high scores on GSM8K or MATH-style benchmarks, simple dice problems become the cleaner autopsy tool.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:59

3d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 06·05

→MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer turns long-video understanding into an agentic exploration process, reaches SOTA on four mainstream benchmarks, and limits the reasoning context window to 2% of full-context ingestion while reporting a 12.5-point absolute accuracy gain.

#Agent#Multimodal#Memory#MemDreamer

why featured

HKR-H/K/R all pass: 2% context is a strong hook, graph memory plus agentic retrieval gives a testable mechanism, and video-context cost has practitioner pull. Single arXiv paper keeps it at 78.

editor take

MemDreamer makes long video look like graph retrieval, not context stuffing; 2% reasoning context plus 12.5 points is a direct hit on the long-context myth.

sharp

MemDreamer lands a clean hit on the brute-force long-context story: hours-long video should be indexed first, then reasoned over. The concrete hook is strong: a three-tier Hierarchical Graph Memory, Observation-Reason-Action retrieval over nodes and edges, only 2% of full ingestion used as reasoning context, and a 12.5-point absolute accuracy gain across four SOTA benchmarks. I don’t fully buy the “agentic capability scaling” framing yet. Long-video QA benchmarks often reward retrieval structure, especially when the questions reduce to localization plus causal links. Against Gemini-style or GPT-4o-style multimodal context expansion, MemDreamer looks like the more deployable token-budget answer. The unresolved risk is memory poisoning: once the streaming perception layer writes a bad node or relation, the agent can reason very confidently over the wrong graph.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:53

3d ago

arXiv · cs.AI· atomEN17:53 · 06·05

→Sparse Subspace-to-Expert Sharing Method Addresses Catastrophic Forgetting in Continual Learning

SETA addresses catastrophic forgetting in LLM continual learning by decomposing sparse subspaces into task-specific and shared experts, with experiments on LLaMA-2 7B and Qwen3-4B; the RSS snippet does not disclose the number of benchmarks or exact scores.

#Fine-tuning#Inference-opt#Memory#LLaMA

why featured

HKR-K/R pass: the mechanism and tested models are concrete, and forgetting matters to fine-tuning workflows. HKR-H fails; no metrics or benchmark count are disclosed, so this stays a routine research release.

editor take

SETA tests LLaMA-2 7B and Qwen3-4B; scores and benchmark count are missing, so don't buy MoE-as-forgetting-cure yet.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:45

3d ago

FEATUREDarXiv · cs.AI· atomEN17:45 · 06·05

→How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

The study compares similar tasks in Perplexity Search and Computer production data: Computer performs 26 minutes of autonomous work per session and cuts matched-task completion time from 269 minutes to 36 minutes.

#Agent#Tools#Benchmarking#Perplexity

why featured

HKR-H/K/R all pass: production data, measured time delta, and knowledge-work automation stakes. It stays in 78–84 because this is an arXiv paper, not a major model or product launch.

editor take

Perplexity finally puts agent claims on production logs: 36 minutes vs 269 is loud, but vendor data is not an industry benchmark.

sharp

Perplexity’s strongest move is dragging agent value out of demos and into production logs, but I would not treat it as a neutral benchmark. Computer does 26 minutes of autonomous work per session, versus 33 seconds for Search. On matched tasks, completion time drops from 269 minutes to 36 minutes, with estimated cost down 94%. That is much closer to workflow evidence than another browser-chat A/B. I still have doubts about the 87% efficiency claim. The data comes from Perplexity Search and Computer, with near-identical initial query pairs inside one product ecosystem. That leaves selection bias alive: users who choose Computer are already handing off composite tasks. The 55% lower dissatisfaction rate is the best quality hook, but the paper still needs external task sets and a failure taxonomy.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:26

3d ago

FEATUREDarXiv · cs.AI· atomEN17:26 · 06·05

→Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

The paper detects Whisper hallucinations using encoder activations and SAE latents, then applies SAE steering that cuts hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on a full non-speech test set.

#Audio#Interpretability#Safety#Whisper

why featured

HKR-H/K/R all pass: the paper ties SAE steering to concrete Whisper hallucination drops across two model sizes. It is still a single arXiv research release, below a major model or product launch.

editor take

Whisper’s non-speech hallucination bug now has a plug-in scalpel: SAE steering cuts small from 72.63% to 14.11%, without pretending retraining is the only fix.

sharp

Whisper’s failure mode here is not bad transcription; it is confident text from non-speech audio. This paper attacks the encoder representation directly, which is a cleaner lever than bolting on a VAD gate. The concrete result is strong: SAE latent steering cuts hallucinations from 72.63% to 14.11% on Whisper small, and from 86.88% to 27.33% on large-v3, with the signal concentrated in sparse features in deeper encoder layers. I buy the engineering value, but not the broad “near fine-tuning” framing yet. The snippet says speech WER degradation is small, but gives no WER number, language mix, noise setting, or latency cost. ASR safety patches fail in production when they suppress accented speech, low-resource languages, or messy call-center audio along with the hallucinations.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

17:16

3d ago

arXiv · cs.AI· atomEN17:16 · 06·05

→Planning-aligned Token Compression for Long-Context Autonomous Driving

COMPACT-VA compresses long-context autonomous-driving memory with a conditional VQ-VAE, conditions compression on trajectory history and learned planning intent, reaches a 68.3% success rate with over 6% gain under comparable token budgets, and reports 3.3× speedup plus 2.7× memory reduction in closed-loop evaluation.

#Robotics#Memory#Agent#COMPACT-VA

why featured

HKR-K is strong with concrete mechanism and closed-loop metrics; HKR-R lands on inference cost and latency. HKR-H is weak because this is a niche autonomous-driving paper, so it stays below featured.

editor take

COMPACT-VA hits 68.3% success at matched token budgets; I buy the direction, but dynamic-scenario filtering flatters the gain.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

SCORE

H0·K1·R1

17:13

3d ago

FEATUREDarXiv · cs.AI· atomEN17:13 · 06·05

→Act As a Real Researcher: Benchmarks for Frontier LLMs and Agentic Harnesses in Research Lifecycle

The AARR team introduced AARRI-Bench to evaluate research-intern-level agents, and the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reached a 68.3% success rate.

#Agent#Reasoning#Benchmarking#AARR

why featured

HKR-H/K/R all pass: the research-intern framing is clickable, AARRI-Bench adds a concrete 68.3% result, and the topic hits research automation anxiety. Single arXiv benchmark, so below must-write.

editor take

68.3% is the useful number here: AARRI-Bench pokes the agent-demo bubble where task completion gets mistaken for research judgment.

sharp

AARRI-Bench hits the lazy part of agent evaluation: finishing long tasks gets treated as doing research. The best setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3%, and the reported misses are not tool wiring or raw execution. They are field sensitivity, research ethics, and subtle judgment—the stuff a human intern gets corrected on every week. I like this benchmark because it drags the SWE-bench mindset into messier territory. Coding agents get clean tests; research work rarely gives you an oracle. The paper snippet does not disclose a human-intern baseline, so 68.3% has no clean anchor yet. Still, this is a healthier direction than another autonomous-experiment demo with a cherry-picked success trace.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

SCORE

H1·K1·R1

papers · 2026-06-05

more

feeds

admin