papers · 2026-05-22

▸ 23 papers · updated 3m ago

browse by dayclear filter ✕

May 2026

MTWTFSS

1150 2 39 4190 5295 6173 7203 8283 9 10 11296 12541 13284 14248 15209 16 17 18224 19490 20273 21240 2223 23185 24 25200 26431 27258 28251 29257 30 31

June 2026

MTWTFSS

1261 2473 3215 4239 57 6 7 8173 9377101112131415161718192021222324252627282930

2026-05-22 · Fri

17:59

17d ago

FEATUREDarXiv · cs.CL· atomEN17:59 · 05·22

→SkillOpt: Executive Strategy for Self-Evolving Agent Skills

SkillOpt uses a separate optimizer model to convert scored rollouts into bounded text edits, accepts only validation-improving changes, ranks best or tied across 52 evaluated cells, and raises GPT-5.5 no-skill accuracy by 23.5 points in direct chat, 24.8 in Codex, and 19.1 in Claude Code.

#Agent#Fine-tuning#Benchmarking#SkillOpt

why featured

HKR-H/K/R all pass: the hook is self-evolving skills, the paper gives a 23.5-point gain and 52 eval units, and agent builders care about automated skill updates. It stays in 78–84 because this is a single arXiv paper, not a major model or product release.

editor take

SkillOpt turns prompt tinkering into validated training; 52 cells is strong, but I want leakage checks and baseline implementations first.

sharp

SkillOpt’s sharp move is treating an agent skill as trainable external state, not dressing up reflection loops again. A separate optimizer model converts scored rollouts into bounded add/delete/replace edits, then accepts only edits that improve held-out validation. Deployment adds zero extra inference calls. That is much closer to an optimizer than GEPA, TextGrad, or EvoSkill-style text search. The numbers are hard to ignore: six benchmarks, seven target models, three harnesses, and best-or-tied results across all 52 evaluated cells. On GPT-5.5, it reports +23.5 points in direct chat, +24.8 inside Codex, and +19.1 inside Claude Code. My first checks would be validation/test separation and whether the human, Trace2Skill, GEPA, and EvoSkill baselines were implemented strongly. If those survive, this is a credible path to agent gains without touching weights.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

80

SCORE

H1·K1·R1

17:59

17d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 05·22

→Research Proposes Shannon Scaling Law Analyzing LLM Capacity and Performance

The paper proposes the Shannon Scaling Law, mapping parameters to bandwidth and tokens to signal power; fitted on ≤6.9B Pythia models and ≤180B tokens, it predicts an unseen 12B model up to 307B tokens with pooled R²=0.847.

#Reasoning#Fine-tuning#Benchmarking#Pythia

why featured

HKR-H/K/R pass, but this is an arXiv scaling-law modeling paper, not a model or product release. Evidence is limited to Pythia fits up to 6.9B/180B tokens and 12B/307B-token prediction, so it lands in the 72–77 featured band.

editor take

The bandwidth/signal framing is neat, but a 6.9B-to-12B extrapolation does not dethrone Chinchilla-style scaling practice.

sharp

The useful part is not the Shannon metaphor; it is the attempt to fit overtraining and quantization damage with one curve. The paper fits on Pythia up to 6.9B parameters and 180B tokens, then predicts an unseen 12B run up to 307B tokens with pooled R²=0.847. That is a better story for late-training failure than monotonic power laws. I still do not buy “capacity law” yet. A 12B extrapolation sits inside the small-model regime, and Pythia / OLMo2 are not frontier training stacks. Chinchilla mattered because its compute-data tradeoff held across budget scales; this paper is stronger as a perturbation model. Before anyone uses it to plan 70B, MoE, or long-context continued pretraining, it needs cross-architecture evidence.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

73

SCORE

H1·K1·R1

17:59

17d ago

FEATUREDarXiv · cs.AI· atomEN17:59 · 05·22

→From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

The paper evaluates the full lifecycle of model-generated agent skills across five agentic task domains, finding average gains but non-trivial negative transfer; skill utility is independent of model scale or baseline task strength, and the proposed meta-skill improves skill quality across domains.

#Agent#Benchmarking#Research release#Benchmark

why featured

Single arXiv paper, so it stays below 78. HKR-H/K/R pass because the 5-domain study, negative transfer claim, and model-size independence are testable and relevant to agent builders.

editor take

Agent skill reuse is not a bigger-models-win story; splitting extractor from consumer exposes why skill libraries keep producing negative transfer.

sharp

Agent skill libraries fail at the consumption step, not only at extraction. The paper spans five agentic task domains and lands on an ugly result: model-generated skills help on average, but negative transfer is non-trivial, and utility is independent of model scale or baseline strength. That cuts against the optimistic Voyager / Reflexion line: compress experience into reusable procedures, then agents get better. Here, a strong extractor can be a weak consumer, and the reverse also holds. The proposed meta-skill improves skill quality across domains and reduces negative transfer, but the snippet gives no benchmark numbers. Without those, I don’t buy this as a general agent-memory recipe yet; I buy it as a useful demolition of the naive skill-library story.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

74

SCORE

H1·K1·R1

17:58

17d ago

arXiv · cs.AI· atomEN17:58 · 05·22

→SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

SpaceNum evaluates VLMs with two bidirectional tasks, Num2Space and Space2Num, and finds that current models often perform close to random guessing across dynamic transitions and static layouts.

#Vision#Multimodal#Benchmarking#Research release

why featured

HKR-H/K/R all pass, but the arXiv summary lacks model names, score tables, and error conditions. This is a useful VLM benchmark item, not a featured-level story.

editor take

SpaceNum tests Num2Space and Space2Num; VLMs hover near random guessing, so CoT is no coordinate system.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

70

SCORE

H1·K1·R1

17:58

17d ago

arXiv · cs.CL· atomEN17:58 · 05·22

→ETCHR: Editing To Clarify and Harness Reasoning

ETCHR decouples a reasoning-aware image editor from downstream multimodal models and trains it in two stages; across five task families, it raises average Pass@1 from 55.95 to 60.77 on Qwen3-VL-8B, from 65.08 to 70.55 on Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 on Kimi K2.5.

#Reasoning#Multimodal#Vision#Qwen

why featured

HKR-H and HKR-K pass: the image-editing-before-reasoning angle is testable and the Pass@1 gain is concrete. Impact remains an arXiv multimodal reasoning method without open-source, product adoption, or cross-source discussion, so it sits in 60–71.

editor take

ETCHR lifts Kimi K2.5 Pass@1 by 4.61; I buy the decoupled editor, but VLM-reward eval leakage needs checking.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

66

SCORE

H1·K1·R0

17:55

17d ago

arXiv · cs.AI· atomEN17:55 · 05·22

→Good Token Hunting: Token Selection Method for Visual Geometry Transformers

Good Token Hunting restricts key/value tokens per query in global attention with inter-frame and intra-frame selection, accelerating visual geometry transformers by over 85% on scenes with 500 images while maintaining or improving baseline performance.

#Vision#Inference-opt#Research release

why featured

HKR-K and HKR-R pass: the 85% speedup and two-stage KV token selection are concrete. The scope is niche visual-geometry research, not a broad product or model release, so it stays in the 60–71 band.

editor take

Good Token Hunting reports 85%+ speedup on 500-image scenes; sparse attention keeps eating 3D reconstruction latency.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

66

SCORE

H0·K1·R1

17:47

17d ago

FEATUREDarXiv · cs.AI· atomEN17:47 · 05·22

→CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS addresses temporal knowledge-graph data marketplaces with a three-layer architecture, reporting 0.937 Recall@10, 2.74 QPS, 161 ms latency, and total ε=4.25 at δ=1e-6 across four benchmarks.

#Agent#RAG#Benchmarking#CHRONOS

why featured

HKR-K passes on concrete architecture and metrics; HKR-H and HKR-R are weak because the topic is a niche temporal-KG marketplace paper. No hard exclusion, but accessibility and practical pull keep it in the 40–59 band.

editor take

CHRONOS stitches temporal decay, Shapley pricing, and DP scheduling well; ε=4.25 still leaves valuations noise-dominated, so the marketplace story needs a discount.

sharp

All 3 sources use the same title and sit on the arXiv / HF paper-distribution chain; this is synchronized indexing of one v1 paper, not independent market validation. CHRONOS has a clean technical hook: 500 sellers, Recall@10 of 0.937, 161 ms latency, 2.74 QPS, and ε=4.25 at δ=1e-6, tying temporal KG routing, Shapley valuation, and DP budget control into one architecture. I buy the problem framing, but I don’t buy the “data marketplace” strength yet. The paper itself says released valuations remain noise-dominated at that privacy level, and utility mainly comes from public index routing plus adaptive scheduling. For agentic data infra builders, this reads more like a privacy-constrained routing and scheduling prototype than a settlement-ready Shapley marketplace.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

75

SCORE

H0·K1·R0

17:45

17d ago

arXiv · cs.CL· atomEN17:45 · 05·22

→Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

LINK swaps randomly selected words in part of the English pretraining corpus with word-level target-language translations at a chosen replacement ratio, using only a bilingual vocabulary and no extra training stage; evaluation across eight languages and five model sizes reports target-language downstream gains and up to 2x faster training to reach equivalent performance.

#Fine-tuning#Benchmarking#LINK#Research release

why featured

HKR-H/K/R all pass: the mechanism and numbers are concrete, and the cost angle is real. Kept below featured because it is a single arXiv paper with limited body detail and no disclosed open-source or industry replication.

editor take

LINK swaps English words via bilingual vocabularies and reports 2x speedups across 8 languages; I buy it, but word-noise costs are undisclosed.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

71

SCORE

H1·K1·R1

17:45

17d ago

arXiv · cs.AI· atomEN17:45 · 05·22

→PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

PGT overlays unambiguous geometric primitives on images to generate dense supervision, improving MLLM visual grounding by up to 20% on What'sUp and 13.3% on CV-Bench-2D under LLaVA-v1.5-Instruct instruction tuning.

#Multimodal#Vision#Fine-tuning#LLaVA

why featured

HKR-H and HKR-K pass: the paper gives a concrete procedural-supervision mechanism and two benchmark gains. Reach stays research-heavy, with no major lab, released artifact, or production-replacement claim, so it fits the 60–71 band.

editor take

PGT adds geometric overlays to LLaVA-v1.5 and gains 20% on What’sUp; test real occlusion before calling it spatial reasoning progress.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

67

SCORE

H1·K1·R0

17:25

17d ago

arXiv · cs.AI· atomEN17:25 · 05·22

→Human Decision-Making with Persuasive and Narrative LLM Explanations

The study evaluates LLM-generated narrative explanations in a large-scale human behavioral experiment and finds persuasiveness did not meaningfully improve classification decision accuracy over a simple AI prediction alone.

#Reasoning#Interpretability#Research release

why featured

HKR-H/K/R all pass, but the post gives only the finding; sample size, task setup, and effect sizes are not disclosed. As an arXiv XAI study it is useful, not featured-level signal.

editor take

Large behavioral study: LLM narratives did not improve classification accuracy; they increased reliance even when the AI was wrong.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

68

SCORE

H1·K1·R1

17:20

17d ago

arXiv · cs.AI· atomEN17:20 · 05·22

→Research proposes framework for causal generative modeling with foundation models

The paper introduces FM-CGM, a three-component framework for visual causal reasoning that combines a large reasoning model with a text-to-image diffusion model; the RSS snippet does not disclose datasets, metric values, or baseline results.

#Reasoning#Vision#Multimodal#Research release

why featured

HKR-K passes for the three-part FM-CGM mechanism, but the post gives no datasets, metrics, or baselines. HKR-H and HKR-R are weak, so this lands as low-end all research signal.

editor take

FM-CGM chains 3 modules, but RSS gives no datasets, metrics, or baselines; I don’t buy “faithful” yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

52

SCORE

H0·K1·R0

16:29

17d ago

arXiv · cs.CL· atomEN16:29 · 05·22

→Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

ToolMerge uses an LLM planner to decompose long-video queries into tool calls and merge per-tool rankings with boolean operators; on the M2M benchmark, it is competitive across QA, question retrieval, and caption retrieval, with a 5% gain over other methods on caption retrieval.

#Agent#Vision#Tools#ToolMerge

why featured

HKR-K passes via the tool-call decomposition mechanism and 5% benchmark gain. HKR-H and HKR-R are weak, so this fits the 60–71 band as useful but niche research.

editor take

ToolMerge gains 5% on M2M caption retrieval; boolean rank merging is plain, but more debuggable than end-to-end video retrieval.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

64

SCORE

H0·K1·R0

16:24

17d ago

arXiv · cs.CL· atomEN16:24 · 05·22

→Research Shows Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

The paper starts from a WordNet hypernym-graph co-occurrence assumption and proves that leading eigenvectors of the word2vec embedding Gram matrix separate taxonomic branches from coarse to fine, then validates the signature across sampled WordNet subtrees and Gemma 2B unembeddings.

#Embedding#Interpretability#WordNet#Gemma

why featured

HKR-H and HKR-K pass: the hook is concept hierarchy from word co-occurrence, with a word2vec Gram-matrix mechanism and Gemma 2B unembedding checks. It stays theoretical, so tier all.

editor take

The authors prove word2vec Gram eigenvectors split WordNet branches; Gemma 2B matching it makes hierarchy look statistical, not mystical.

HKR breakdown

hook ✓knowledge ✓resonance —

→ open source

66

SCORE

H1·K1·R0

14:57

17d ago

arXiv · cs.CL· atomEN14:57 · 05·22

→NLG Evaluation: Past, Present, Future

The paper reviews NLG evaluation from 1990 to 2026, contrasting little formal experimental evaluation with today’s ML-linked evaluation norms, and names LLM-as-Judge plus impact, qualitative, and safety evaluation as future priorities.

#Benchmarking#Safety#Research release#Safety/alignment

why featured

HKR-K and HKR-R pass: it offers a 1990-2026 NLG evaluation map and a clear LLM-as-Judge placement. HKR-H is weak, and as a survey rather than a product or model release, it stays in the 60-71 band.

editor take

It spans 1990–2026 NLG evaluation; LLM-as-Judge is now canon, and safety/impact evals stop being appendix work.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

68

SCORE

H0·K1·R1

14:52

17d ago

arXiv · cs.CL· atomEN14:52 · 05·22

→Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

The paper proposes a weak-label benchmark audit using MPDS, ΔEvi, and reader-strength calibration together. Synthetic HotpotQA gives a counterexample: MPDS is 0.643, yet ΔEvi is zero, so metadata-only screening does not test evidence dependence.

#Benchmarking#Reasoning#HotpotQA#SNLI

why featured

HKR-H/K/R all pass, but this is an academic benchmark-audit method with limited immediate product impact. No model release, shipped tool, or cross-source cluster, so it stays in the 60–71 band.

editor take

Synthetic HotpotQA has MPDS 0.643 and ΔEvi 0; metadata shortcut checks alone miss evidence-blind benchmarks.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

68

SCORE

H1·K1·R1

14:49

17d ago

arXiv · cs.CL· atomEN14:49 · 05·22

→ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

ChartFI-Bench evaluates MLLM chart descriptions with 896 complex chart-description pairs and four metrics: Faithfulness, Coverage, Informativeness, and Acuity.

#Multimodal#Vision#Benchmarking#ChartFI-Bench

why featured

HKR-K is supported by the 896-pair dataset and four-metric design; HKR-R comes from chart-reading reliability concerns. No model ranking, artifact detail, or surprising result is disclosed, so this stays in all.

editor take

ChartFI-Bench has 896 complex chart pairs; small set, but it pushes chart eval beyond value recitation into insight quality.

HKR breakdown

hook —knowledge ✓resonance ✓

→ open source

64

SCORE

H0·K1·R1

13:12

17d ago

HuggingFace Papers (takara mirror)· rssEN13:12 · 05·22

→Preisach Attention: Hysteretic Sequential Memory Mechanism

The paper introduces the Preisach Attention Layer, which replaces softmax attention with a binary relay operator using learned activation and deactivation thresholds; a single-layer PAL-Transformer is Turing-complete under arbitrary precision arithmetic, and its total inference cost is O(n log n) versus O(n^2) for standard attention.

#Memory#Reasoning#Inference-opt#Research release

why featured

HKR-H/K/R all pass, but the post gives mechanism and theoretical cost only; no benchmarks, artifact, or major-lab backing is disclosed. This fits the 60–71 research-interest band, not featured.

editor take

PAL swaps softmax for binary relays at O(n log n); Turing-complete sounds flashy, but it gives up random access.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

69

SCORE

H1·K1·R1

05:59

18d ago

HuggingFace Papers (takara mirror)· rssEN05:59 · 05·22

→Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

IDEA converts VLN test-time adaptation into an asset library using soft prompts, Fisher-guided weighting, and domain coordinates, then projects a target domain onto the convex hull of historical knowledge for training-free cross-domain bridging on REVERIE, R2R, and R2R-CE.

#Agent#Vision#Multimodal#Research release

why featured

HKR-K passes: the post gives IDEA’s training-free cross-domain mechanism and three VLN benchmarks. HKR-H/R are weak because this is a niche embodied-navigation method, below featured threshold.

editor take

IDEA tests training-free bridging on REVERIE, R2R, and R2R-CE; gains are undisclosed, so I read it as a VLN prompt bank.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

66

SCORE

H0·K1·R0

04:19

18d ago

HuggingFace Papers (takara mirror)· rssEN04:19 · 05·22

→FastKernels GPU Kernel Generation Benchmarking Framework Released

FastKernels uses 46 representative architectures across 8 categories to cover 96.2% of HuggingFace Transformers architectures, and the strongest evaluated kernel agent reaches only 0.94× aggregate speedup over production baselines.

#Agent#Code#Benchmarking#Snowflake AI Research

why featured

HKR-H/K/R pass: the 0.94× result is a useful counter-hook for agent hype. GPU-kernel generation is narrow and high-accessibility, so the score stays in all rather than featured.

editor take

FastKernels covers 409/425 HF Transformers architectures; the best agent hits 0.94×, so autonomous kernel generation still trails production baselines.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

70

SCORE

H1·K1·R1

04:00

18d ago

● P1arXiv · cs.LG· atomEN04:00 · 05·22

→Insights Generator: Corpus-Level Trace Diagnostics for LLM Agents

Insights Generator diagnoses LLM agent execution traces with a multi-agent scout-investigator architecture, and human experts implementing its reports improved scaffold performance by 30.4 percentage points over the unmodified baseline.

#Agent#Tools#Benchmarking#Akshay Manglik

why featured

HKR-H/K/R all pass: the agent-debugging hook is clear, the paper gives a 30.4-point result and a concrete corpus-level diagnostic mechanism. It is a strong research-tool signal, not an 85+ same-day must-write release.

editor take

Three feeds point to one arXiv paper, not three validations. The 30.4pp gain is tempting; deployment lives or dies on messy production traces.

sharp

All 3 sources point to the same arXiv 2605.21347 record, with identical framing; this is distribution, not independent validation. Insights Generator frames agent debugging as corpus-level hypothesis generation and testing, and the hard number is a 30.4pp scaffold-performance gain after human experts used IG reports. I buy the problem more than the claimed lift. Agent teams have spent the last year staring at single-run success rates, while production failures hide in repeated tool misuse, premature stopping, and bad recovery across tens of thousands of trace tokens. IG’s scout-investigator setup fits that pain. But the abstract does not disclose benchmark names, corpus size, or compute cost here. Without those, 30.4pp reads like a strong demo result, not yet an operations tool.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

92

SCORE

H1·K1·R1

04:00

18d ago

arXiv · cs.LG· atomEN04:00 · 05·22

→GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW trains open-world VLM agents by decomposing full trajectories into state-action samples and computing advantages across samples; the paper reports state-of-the-art performance on more than 800 Minecraft tasks, while the abstract does not disclose exact scores or baselines.

#Agent#Multimodal#Reasoning#Xiongbin Wu

why featured

HKR-H/K/R all pass, but this is a single arXiv paper with method and 800+ Minecraft SOTA only; no code, major lab backing, or production replacement claim. Score stays in the 60–71 band.

editor take

GROW trains on 800+ Minecraft tasks; exact scores and baselines are undisclosed, so treat its SOTA claim as discounted.

HKR breakdown

hook ✓knowledge ✓resonance ✓

→ open source

70

SCORE

H1·K1·R1

04:00

18d ago

arXiv · cs.LG· atomEN04:00 · 05·22

→Research on behavior consistency improvement in deep reinforcement learning

Marcel Hussing and four coauthors propose QED, a state-dependent temperature schedule using double-critic disagreement, and report two orders of magnitude lower across-run policy divergence across 18 continuous-control tasks without sacrificing performance.

#Reasoning#Marcel Hussing#Benjamin Eysenbach#Eric Eaton

why featured

HKR-K passes via a concrete method, 18 tasks, and a two-order divergence claim. HKR-H/R are weak because this is specialized deep-RL research without product or frontier-model pull, so it stays in all.

editor take

QED cuts policy divergence by 100x across 18 control tasks; I buy the target, since RL deployment often dies on seed lottery.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

64

SCORE

H0·K1·R0

04:00

18d ago

arXiv · cs.LG· atomEN04:00 · 05·22

→Path-based Adaptive Weighting Improves Random Forest Classification

Youngjoon Park proposes path-based adaptive weighting for random forest classification, using root-to-leaf label-flip patterns as tree reliability signals; across 30 binary classification benchmarks with 30 repeats, it improves accuracy over standard random forests by 0.0011 with Wilcoxon p=0.007, while weighted RF and KNORA variants do not reach significance.

#Benchmarking#Youngjoon Park#arXiv#Research release

why featured

HKR-K passes on a testable weighting method and reported statistics; HKR-H/R fail because this is a niche classic-ML tweak with little industry pull. Low-value but not hard-excluded.

editor take

Youngjoon Park gets +0.0011 accuracy across 30 binary sets; p=0.007 is tidy, but don’t swap RF voting yet.

HKR breakdown

hook —knowledge ✓resonance —

→ open source

43

SCORE

H0·K1·R0

more

✕

feeds

hot events daily column all posts papers podcasts curated X monitor saved sources agent access

admin

usage system curation iterations users