ax@ax-radar:~/papers $ grep -E 'arxiv|paper' sources/tags
45 srcsignal 72%cycle 04:32

papers · 2026-05-22

23 papers · updated 3m ago
2026-05-22 · Fri
17:58
17d ago
arXiv · cs.AI· atomEN17:58 · 05·22
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
SpaceNum evaluates VLMs with two bidirectional tasks, Num2Space and Space2Num, and finds that current models often perform close to random guessing across dynamic transitions and static layouts.
#Vision#Multimodal#Benchmarking#Research release
why featured
HKR-H/K/R all pass, but the arXiv summary lacks model names, score tables, and error conditions. This is a useful VLM benchmark item, not a featured-level story.
editor take
SpaceNum tests Num2Space and Space2Num; VLMs hover near random guessing, so CoT is no coordinate system.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
17:58
17d ago
arXiv · cs.CL· atomEN17:58 · 05·22
ETCHR: Editing To Clarify and Harness Reasoning
ETCHR decouples a reasoning-aware image editor from downstream multimodal models and trains it in two stages; across five task families, it raises average Pass@1 from 55.95 to 60.77 on Qwen3-VL-8B, from 65.08 to 70.55 on Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 on Kimi K2.5.
#Reasoning#Multimodal#Vision#Qwen
why featured
HKR-H and HKR-K pass: the image-editing-before-reasoning angle is testable and the Pass@1 gain is concrete. Impact remains an arXiv multimodal reasoning method without open-source, product adoption, or cross-source discussion, so it sits in 60–71.
editor take
ETCHR lifts Kimi K2.5 Pass@1 by 4.61; I buy the decoupled editor, but VLM-reward eval leakage needs checking.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
17:55
17d ago
arXiv · cs.AI· atomEN17:55 · 05·22
Good Token Hunting: Token Selection Method for Visual Geometry Transformers
Good Token Hunting restricts key/value tokens per query in global attention with inter-frame and intra-frame selection, accelerating visual geometry transformers by over 85% on scenes with 500 images while maintaining or improving baseline performance.
#Vision#Inference-opt#Research release
why featured
HKR-K and HKR-R pass: the 85% speedup and two-stage KV token selection are concrete. The scope is niche visual-geometry research, not a broad product or model release, so it stays in the 60–71 band.
editor take
Good Token Hunting reports 85%+ speedup on 500-image scenes; sparse attention keeps eating 3D reconstruction latency.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R1
17:45
17d ago
arXiv · cs.CL· atomEN17:45 · 05·22
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
LINK swaps randomly selected words in part of the English pretraining corpus with word-level target-language translations at a chosen replacement ratio, using only a bilingual vocabulary and no extra training stage; evaluation across eight languages and five model sizes reports target-language downstream gains and up to 2x faster training to reach equivalent performance.
#Fine-tuning#Benchmarking#LINK#Research release
why featured
HKR-H/K/R all pass: the mechanism and numbers are concrete, and the cost angle is real. Kept below featured because it is a single arXiv paper with limited body detail and no disclosed open-source or industry replication.
editor take
LINK swaps English words via bilingual vocabularies and reports 2x speedups across 8 languages; I buy it, but word-noise costs are undisclosed.
HKR breakdown
hook knowledge resonance
open source
71
SCORE
H1·K1·R1
17:45
17d ago
arXiv · cs.AI· atomEN17:45 · 05·22
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
PGT overlays unambiguous geometric primitives on images to generate dense supervision, improving MLLM visual grounding by up to 20% on What'sUp and 13.3% on CV-Bench-2D under LLaVA-v1.5-Instruct instruction tuning.
#Multimodal#Vision#Fine-tuning#LLaVA
why featured
HKR-H and HKR-K pass: the paper gives a concrete procedural-supervision mechanism and two benchmark gains. Reach stays research-heavy, with no major lab, released artifact, or production-replacement claim, so it fits the 60–71 band.
editor take
PGT adds geometric overlays to LLaVA-v1.5 and gains 20% on What’sUp; test real occlusion before calling it spatial reasoning progress.
HKR breakdown
hook knowledge resonance
open source
67
SCORE
H1·K1·R0
17:25
17d ago
arXiv · cs.AI· atomEN17:25 · 05·22
Human Decision-Making with Persuasive and Narrative LLM Explanations
The study evaluates LLM-generated narrative explanations in a large-scale human behavioral experiment and finds persuasiveness did not meaningfully improve classification decision accuracy over a simple AI prediction alone.
#Reasoning#Interpretability#Research release
why featured
HKR-H/K/R all pass, but the post gives only the finding; sample size, task setup, and effect sizes are not disclosed. As an arXiv XAI study it is useful, not featured-level signal.
editor take
Large behavioral study: LLM narratives did not improve classification accuracy; they increased reliance even when the AI was wrong.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
17:20
17d ago
arXiv · cs.AI· atomEN17:20 · 05·22
Research proposes framework for causal generative modeling with foundation models
The paper introduces FM-CGM, a three-component framework for visual causal reasoning that combines a large reasoning model with a text-to-image diffusion model; the RSS snippet does not disclose datasets, metric values, or baseline results.
#Reasoning#Vision#Multimodal#Research release
why featured
HKR-K passes for the three-part FM-CGM mechanism, but the post gives no datasets, metrics, or baselines. HKR-H and HKR-R are weak, so this lands as low-end all research signal.
editor take
FM-CGM chains 3 modules, but RSS gives no datasets, metrics, or baselines; I don’t buy “faithful” yet.
HKR breakdown
hook knowledge resonance
open source
52
SCORE
H0·K1·R0
16:29
17d ago
arXiv · cs.CL· atomEN16:29 · 05·22
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
ToolMerge uses an LLM planner to decompose long-video queries into tool calls and merge per-tool rankings with boolean operators; on the M2M benchmark, it is competitive across QA, question retrieval, and caption retrieval, with a 5% gain over other methods on caption retrieval.
#Agent#Vision#Tools#ToolMerge
why featured
HKR-K passes via the tool-call decomposition mechanism and 5% benchmark gain. HKR-H and HKR-R are weak, so this fits the 60–71 band as useful but niche research.
editor take
ToolMerge gains 5% on M2M caption retrieval; boolean rank merging is plain, but more debuggable than end-to-end video retrieval.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
16:24
17d ago
arXiv · cs.CL· atomEN16:24 · 05·22
Research Shows Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
The paper starts from a WordNet hypernym-graph co-occurrence assumption and proves that leading eigenvectors of the word2vec embedding Gram matrix separate taxonomic branches from coarse to fine, then validates the signature across sampled WordNet subtrees and Gemma 2B unembeddings.
#Embedding#Interpretability#WordNet#Gemma
why featured
HKR-H and HKR-K pass: the hook is concept hierarchy from word co-occurrence, with a word2vec Gram-matrix mechanism and Gemma 2B unembedding checks. It stays theoretical, so tier all.
editor take
The authors prove word2vec Gram eigenvectors split WordNet branches; Gemma 2B matching it makes hierarchy look statistical, not mystical.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H1·K1·R0
14:57
17d ago
arXiv · cs.CL· atomEN14:57 · 05·22
NLG Evaluation: Past, Present, Future
The paper reviews NLG evaluation from 1990 to 2026, contrasting little formal experimental evaluation with today’s ML-linked evaluation norms, and names LLM-as-Judge plus impact, qualitative, and safety evaluation as future priorities.
#Benchmarking#Safety#Research release#Safety/alignment
why featured
HKR-K and HKR-R pass: it offers a 1990-2026 NLG evaluation map and a clear LLM-as-Judge placement. HKR-H is weak, and as a survey rather than a product or model release, it stays in the 60-71 band.
editor take
It spans 1990–2026 NLG evaluation; LLM-as-Judge is now canon, and safety/impact evals stop being appendix work.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H0·K1·R1
14:52
17d ago
arXiv · cs.CL· atomEN14:52 · 05·22
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks
The paper proposes a weak-label benchmark audit using MPDS, ΔEvi, and reader-strength calibration together. Synthetic HotpotQA gives a counterexample: MPDS is 0.643, yet ΔEvi is zero, so metadata-only screening does not test evidence dependence.
#Benchmarking#Reasoning#HotpotQA#SNLI
why featured
HKR-H/K/R all pass, but this is an academic benchmark-audit method with limited immediate product impact. No model release, shipped tool, or cross-source cluster, so it stays in the 60–71 band.
editor take
Synthetic HotpotQA has MPDS 0.643 and ΔEvi 0; metadata shortcut checks alone miss evidence-blind benchmarks.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K1·R1
14:49
17d ago
arXiv · cs.CL· atomEN14:49 · 05·22
ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
ChartFI-Bench evaluates MLLM chart descriptions with 896 complex chart-description pairs and four metrics: Faithfulness, Coverage, Informativeness, and Acuity.
#Multimodal#Vision#Benchmarking#ChartFI-Bench
why featured
HKR-K is supported by the 896-pair dataset and four-metric design; HKR-R comes from chart-reading reliability concerns. No model ranking, artifact detail, or surprising result is disclosed, so this stays in all.
editor take
ChartFI-Bench has 896 complex chart pairs; small set, but it pushes chart eval beyond value recitation into insight quality.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R1
13:12
17d ago
HuggingFace Papers (takara mirror)· rssEN13:12 · 05·22
Preisach Attention: Hysteretic Sequential Memory Mechanism
The paper introduces the Preisach Attention Layer, which replaces softmax attention with a binary relay operator using learned activation and deactivation thresholds; a single-layer PAL-Transformer is Turing-complete under arbitrary precision arithmetic, and its total inference cost is O(n log n) versus O(n^2) for standard attention.
#Memory#Reasoning#Inference-opt#Research release
why featured
HKR-H/K/R all pass, but the post gives mechanism and theoretical cost only; no benchmarks, artifact, or major-lab backing is disclosed. This fits the 60–71 research-interest band, not featured.
editor take
PAL swaps softmax for binary relays at O(n log n); Turing-complete sounds flashy, but it gives up random access.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H1·K1·R1
05:59
18d ago
HuggingFace Papers (takara mirror)· rssEN05:59 · 05·22
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation
IDEA converts VLN test-time adaptation into an asset library using soft prompts, Fisher-guided weighting, and domain coordinates, then projects a target domain onto the convex hull of historical knowledge for training-free cross-domain bridging on REVERIE, R2R, and R2R-CE.
#Agent#Vision#Multimodal#Research release
why featured
HKR-K passes: the post gives IDEA’s training-free cross-domain mechanism and three VLN benchmarks. HKR-H/R are weak because this is a niche embodied-navigation method, below featured threshold.
editor take
IDEA tests training-free bridging on REVERIE, R2R, and R2R-CE; gains are undisclosed, so I read it as a VLN prompt bank.
HKR breakdown
hook knowledge resonance
open source
66
SCORE
H0·K1·R0
04:19
18d ago
HuggingFace Papers (takara mirror)· rssEN04:19 · 05·22
FastKernels GPU Kernel Generation Benchmarking Framework Released
FastKernels uses 46 representative architectures across 8 categories to cover 96.2% of HuggingFace Transformers architectures, and the strongest evaluated kernel agent reaches only 0.94× aggregate speedup over production baselines.
#Agent#Code#Benchmarking#Snowflake AI Research
why featured
HKR-H/K/R pass: the 0.94× result is a useful counter-hook for agent hype. GPU-kernel generation is narrow and high-accessibility, so the score stays in all rather than featured.
editor take
FastKernels covers 409/425 HF Transformers architectures; the best agent hits 0.94×, so autonomous kernel generation still trails production baselines.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
18d ago
● P1arXiv · cs.LG· atomEN04:00 · 05·22
Insights Generator: Corpus-Level Trace Diagnostics for LLM Agents
Insights Generator diagnoses LLM agent execution traces with a multi-agent scout-investigator architecture, and human experts implementing its reports improved scaffold performance by 30.4 percentage points over the unmodified baseline.
#Agent#Tools#Benchmarking#Akshay Manglik
why featured
HKR-H/K/R all pass: the agent-debugging hook is clear, the paper gives a 30.4-point result and a concrete corpus-level diagnostic mechanism. It is a strong research-tool signal, not an 85+ same-day must-write release.
editor take
Three feeds point to one arXiv paper, not three validations. The 30.4pp gain is tempting; deployment lives or dies on messy production traces.
sharp
All 3 sources point to the same arXiv 2605.21347 record, with identical framing; this is distribution, not independent validation. Insights Generator frames agent debugging as corpus-level hypothesis generation and testing, and the hard number is a 30.4pp scaffold-performance gain after human experts used IG reports. I buy the problem more than the claimed lift. Agent teams have spent the last year staring at single-run success rates, while production failures hide in repeated tool misuse, premature stopping, and bad recovery across tens of thousands of trace tokens. IG’s scout-investigator setup fits that pain. But the abstract does not disclose benchmark names, corpus size, or compute cost here. Without those, 30.4pp reads like a strong demo result, not yet an operations tool.
HKR breakdown
hook knowledge resonance
open source
92
SCORE
H1·K1·R1
04:00
18d ago
arXiv · cs.LG· atomEN04:00 · 05·22
GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
GROW trains open-world VLM agents by decomposing full trajectories into state-action samples and computing advantages across samples; the paper reports state-of-the-art performance on more than 800 Minecraft tasks, while the abstract does not disclose exact scores or baselines.
#Agent#Multimodal#Reasoning#Xiongbin Wu
why featured
HKR-H/K/R all pass, but this is a single arXiv paper with method and 800+ Minecraft SOTA only; no code, major lab backing, or production replacement claim. Score stays in the 60–71 band.
editor take
GROW trains on 800+ Minecraft tasks; exact scores and baselines are undisclosed, so treat its SOTA claim as discounted.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K1·R1
04:00
18d ago
arXiv · cs.LG· atomEN04:00 · 05·22
Research on behavior consistency improvement in deep reinforcement learning
Marcel Hussing and four coauthors propose QED, a state-dependent temperature schedule using double-critic disagreement, and report two orders of magnitude lower across-run policy divergence across 18 continuous-control tasks without sacrificing performance.
#Reasoning#Marcel Hussing#Benjamin Eysenbach#Eric Eaton
why featured
HKR-K passes via a concrete method, 18 tasks, and a two-order divergence claim. HKR-H/R are weak because this is specialized deep-RL research without product or frontier-model pull, so it stays in all.
editor take
QED cuts policy divergence by 100x across 18 control tasks; I buy the target, since RL deployment often dies on seed lottery.
HKR breakdown
hook knowledge resonance
open source
64
SCORE
H0·K1·R0
04:00
18d ago
arXiv · cs.LG· atomEN04:00 · 05·22
Path-based Adaptive Weighting Improves Random Forest Classification
Youngjoon Park proposes path-based adaptive weighting for random forest classification, using root-to-leaf label-flip patterns as tree reliability signals; across 30 binary classification benchmarks with 30 repeats, it improves accuracy over standard random forests by 0.0011 with Wilcoxon p=0.007, while weighted RF and KNORA variants do not reach significance.
#Benchmarking#Youngjoon Park#arXiv#Research release
why featured
HKR-K passes on a testable weighting method and reported statistics; HKR-H/R fail because this is a niche classic-ML tweak with little industry pull. Low-value but not hard-excluded.
editor take
Youngjoon Park gets +0.0011 accuracy across 30 binary sets; p=0.007 is tidy, but don’t swap RF voting yet.
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0

more

feeds

admin