ax@ax-radar:~/all $ grep -v 'tier=excluded' stream.log
45 srcsignal 72%cycle 04:32

posts · 2026-04-11

42 items · updated 3m ago
RSS live
2026-04-11 · Sat
18:47
58d ago
arXiv · cs.CL· atomEN18:47 · 04·11
Comparative Analysis of Large Language Models in Healthcare
This study evaluates 5 model families on 2 medical tasks, covering ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor. It uses 3 open datasets—MedMCQA, PubMedQA, and Asclepius; the snippet says ChatDoctor is stronger on contextual reliability, while Grok and LLaMA score higher on structured QA accuracy. The key point is task split: the post does not disclose exact scores, model versions, or statistical significance.
#Benchmarking#Reasoning#OpenAI#Meta
why featured
The piece discloses only a healthcare benchmark setup—5 model families, 2 task types, 3 open datasets. With no concrete scores, model versions, or statistical significance, HKR-H/K/R all fail for this audience, so it lands as excluded.
HKR breakdown
hook knowledge resonance
open source
44
SCORE
H0·K0·R0
15:58
58d ago
● P1arXiv · cs.CL· atomEN15:58 · 04·11
The Amazing Agent Race: Strong Tool Users, Weak Navigators
A University of Minnesota team released AAR, a 1,400-instance DAG benchmark, and the best agent reached only 37.2% accuracy. It includes 800 sequential and 600 compositional tasks; navigation errors account for 27% to 52% of trials, while tool-use errors stay below 17%. The key signal is navigation failure, which linear benchmarks miss.
#Agent#Tools#Benchmarking#University of Minnesota
why featured
HKR-H/K/R all pass: the contrast is clickable, the benchmark adds concrete numbers, and the result matters to agent builders. Featured, not p1, because this is a strong research/benchmark release rather than a top-tier model launch or industry-shaking product event.
editor take
Minnesota ran agents on 1,400 DAG tasks and the best hit 37.2%; that punctures the idea that good tool calls equal capable agents.
sharp
The Minnesota team put agents through 1,400 DAG-style tool tasks, and the best system only reached 37.2% accuracy; that strongly suggests today’s agent ceiling is navigation, not tool invocation. Their breakdown is the useful part: navigation errors account for 27% to 52% of trials, while tool-use errors stay under 17%. That gap is wide enough that you can’t keep blaming failures on function-calling syntax or flaky APIs. I think this paper matters because it changes the task geometry. A lot of tool-use benchmarks are still basically straight lines: search, call tool, extract result, answer. The paper says six existing benchmarks contain 55% to 100% simple chains of 2 to 5 steps. In that setup, agents can look competent because local correctness is enough. AAR forces fork-merge behavior. The agent has to choose branches, visit the right pages, combine intermediate results, and avoid wandering. That is much closer to real agent work than the clean linear scripts that dominate demos. This also lines up with a broader pattern from the last year. On benchmarks like GAIA, WebArena, and several coding-agent evaluations, single-step model quality improved faster than end-to-end task completion. I’m not pulling exact figures from memory here, so I won’t fake precision, but the directional pattern has been consistent: better models do not automatically become good navigators. AAR sharpens that diagnosis. The bottleneck is partly state tracking and next-hop selection, not just context length or raw reasoning. Anyone who has inspected production traces has seen this: tool calls are valid, arguments are formatted correctly, and the agent still drifts off the task. I do have one pushback. Wikipedia is a smart choice for verifiable, procedurally generated evaluation, but it also biases the failure mode toward hyperlink navigation and public knowledge lookup. Enterprise agents usually operate across Jira, Slack, Notion, SQL, and internal APIs. Navigation failures there often come from permissions, naming ambiguity, stale memory, and role-dependent visibility, not just page selection. So AAR illuminates a real pathology, but it is not the whole clinical picture. The body also doesn’t disclose enough detail on which loop policies were used, how often replanning happened, or how performance varies by difficulty tier. I’d want the full paper before over-generalizing. One more signal stands out. Claude Code roughly matches Codex CLI at about 37% while using 6x fewer tokens. For practitioners, that is more important than who tops the leaderboard. It says agent architecture has not been flattened by model scale. Search policy, memory compression, rollback behavior, and replanning triggers still matter a lot. If your product plan is still “swap in a bigger model and add more tools,” this benchmark is a pretty direct warning.
HKR breakdown
hook knowledge resonance
open source
88
SCORE
H1·K1·R1
14:38
58d ago
● P1arXiv · cs.CL· atomEN14:38 · 04·11
CodeComp: Structural KV Cache Compression for Agentic Coding
CodeComp adds static program analysis to KV-cache compression for long-repository bug localization and patch generation. It uses Code Property Graph priors from Joern to preserve structurally critical tokens; the post does not disclose benchmark names, compression ratios, or absolute scores. The part to watch is practical: it is training-free, model-agnostic, and claims direct integration with SGLang agentic coding pipelines.
#Code#Inference-opt#Agent#Joern
why featured
The paper clears HKR-H/K/R: the static-analysis + KV-compression combo is novel, the mechanism is concrete, and coding-agent users care about long-repo context cost. Held to 76 because the post does not disclose benchmark names, compression ratio, or absolute scores.
editor take
CodeComp brings KV compression back to code structure, not attention worship. If Joern overhead stays sane, coding agents should care.
sharp
Two sources carry the same paper framing, and the alignment looks like an arXiv-to-TLDR chain, not independent validation. CodeComp’s claim is clean: coding-agent KV compression breaks when attention scores alone decide eviction, because call sites, branch conditions, and assignments are structural anchors, not just high-attention tokens. The concrete hook is Joern-derived Code Property Graph priors, training-free integration, SGLang compatibility, and no model modification. The body says CodeComp beats attention-only baselines under equal memory budgets and matches full-context patch-generation quality. I’ll be real: the missing numbers matter here. No compression ratio, latency tax, or VRAM delta is disclosed in the provided body. Compared with broader KV papers like RLKV or KVCOMM, CodeComp reads less like a universal cache theory and more like a workload-specific fix for agentic coding.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
13:43
58d ago
arXiv · cs.CL· atomEN13:43 · 04·11
Relational Probing: Adapting Language Models to Graphs for Financial Prediction
The paper proposes Relational Probing, replacing the LM head with a relation head that induces graphs from hidden states and is trained jointly for stock-trend prediction. Experiments use Qwen3 0.6B, 1.7B, and 4B; the authors define SLMs as models fine-tunable end to end on one 24GB GPU under stated batch and sequence settings. The snippet says it beats a co-occurrence baseline, but does not disclose exact metrics.
#Reasoning#Fine-tuning#Benchmarking#Qwen3
why featured
The paper sits in a narrow financial-prediction niche and the summary does not disclose key result numbers. It fits hard-exclusion-technical-accessibility fail for this audience, so importance stays below 40 and the tier is excluded.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
13:16
58d ago
HuggingFace Papers (takara mirror)· rssEN13:16 · 04·11
Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks
The paper derives a closed-form upper bound for the largest Hessian eigenvalue of cross-entropy loss in smooth nonlinear multilayer neural networks. The bound depends on affine parameters, hidden-layer dimensions, and training-sample orthogonality; the post does not disclose theorem conditions, experiment scale, or approximation error. The key point is replacing numerical eigenspectrum estimation with direct sharpness analysis.
#Interpretability#Research release
why featured
HKR-K passes on one concrete claim: a closed-form upper bound for the top Hessian eigenvalue. But this is a specialist curvature result with no on-ramp, and the summary omits theorem conditions, error bounds, and experiment scale, so hard-exclusion-technical-accessibility-fail 적용
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
11:11
58d ago
arXiv · cs.CL· atomEN11:11 · 04·11
ODUTQA-MDC: Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification
The paper introduces the ODUTQA-MDC task and a first benchmark for open-domain underspecified tabular QA, covering 209 tables and 25,105 QA pairs for multi-turn clarification. The benchmark adds fine-grained labels and a dynamic clarification interface to simulate user feedback; the abstract also presents MAIC-TQA, but does not disclose model scale or benchmark scores. What matters for practitioners is the shift from single-turn answer accuracy to evaluating clarification before answering.
#Agent#Benchmarking#Reasoning#arXiv
why featured
HKR-K lands because the paper turns clarification-before-answer into a measurable benchmark with 209 tables and 25,105 QA pairs. HKR-H/R miss: the niche ODUTQA framing is academic, and the summary does not disclose baseline scores, model scale, or product relevance.
editor take
The benchmark gets the problem framing right with 209 tables and multi-turn clarification; I’m not buying the “open-domain” claim yet.
sharp
I like the framing here more than I trust the headline. ODUTQA-MDC takes a very real failure mode in table QA—users ask underspecified questions—and turns it into an explicit task with 209 tables and 25,105 QA pairs. That is directionally correct. In production data assistants, the miss is often not retrieval or arithmetic; it’s that the user asked “Which product sold best last year?” without specifying region, channel, or even whether “best” means units or revenue. A benchmark that scores clarification before answering is closer to reality than another round of single-turn exact match. That said, I’m not ready to grant the “open-domain” label on the abstract alone. Two hundred nine tables is enough to define a task and study error modes. It is not enough, by itself, to convince me this captures open-domain variability. Older table benchmarks like WikiTableQuestions, TabFact, HybridQA, and FeTaQA already exposed how messy table reasoning becomes once schema variation, lexical mismatch, and outside knowledge show up together. ODUTQA-MDC’s novelty is the underspecification-plus-dialogue setup, and that part is useful. The “open-domain” claim still feels stretched unless the full paper shows broad source diversity and strong transfer beyond its own collection. My bigger pushback is the dynamic clarification interface. “Simulates user feedback” is the key phrase, and simulated users are where many interactive benchmarks get overly clean. They answer cooperatively. They stay on the annotation path. They do not contradict themselves or shift intent mid-dialogue. Real users do all three. If the paper does not disclose the simulator policy, ambiguity taxonomy, stopping criteria, and the cost model for extra turns, then any MAIC-TQA result is hard to price in. The abstract also does not disclose model scale, baseline scores, or whether the multi-agent setup beats a strong single-agent prompt/tool pipeline by a meaningful margin. The broader context is important. For about a year, frontier-model system behavior has been moving toward “ask when uncertain,” but public evals still reward premature answers. That gap shows up in agent work, browser tasks, and spreadsheet copilots: models often fail because they act too early, not because they cannot reason. If ODUTQA-MDC evaluates ambiguity detection, clarification quality, and final-answer improvement separately, it fills a hole that a lot of current benchmarks leave open. So my read is: good correction to the field’s incentives, not yet a benchmark I’d anchor on. I want to see three things in the full paper before taking it seriously: how the simulated user is built, how much net gain clarification adds versus turn cost, and whether performance transfers beyond these 209 tables. Without that, this is a strong task proposal with a useful instinct, not a settled reference point.
HKR breakdown
hook knowledge resonance
open source
69
SCORE
H0·K1·R0
10:33
58d ago
HuggingFace Papers (takara mirror)· rssEN10:33 · 04·11
MOSAIC: Multi-Domain Orthogonal Session Adaptive Intent Capture for Prescient Recommendations
MOSAIC uses a triple-encoder design to split multi-domain session preferences into 3 parts: domain-specific, domain-common, and cross-sequence-exclusive, for recommendation. It combines domain masking, gradient reversal, alignment, independence constraints, and dynamic gating; the post says it beats prior methods on 2 real-world benchmarks, but does not disclose exact metrics.
#Research release#Benchmark
why featured
HKR-K passes because the post names a 3-encoder architecture, domain masking, GRL, and dynamic gating. It still triggers hard-exclusion-technical-accessibility: this is a specialized recommender paper, and the article gives no benchmark deltas for a broader AI audience.
HKR breakdown
hook knowledge resonance
open source
42
SCORE
H0·K1·R0
10:00
58d ago
● P1arXiv · cs.CL· atomEN10:00 · 04·11
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model Capabilities
The paper inserts delimiters at sentence boundaries and tests the method on models from 7B to 600B, reporting gains up to 7.7% on GSM8K and 12.5% on DROP. It covers both in-context learning and supervised fine-tuning; the snippet says fine-tuned models show sentence awareness in internal representations, but the post does not disclose the exact evaluation setup. The key point is the mechanism is lightweight: it makes sentence structure explicit in context rather than adding new modules.
#Reasoning#Fine-tuning#Interpretability#DeepSeek
why featured
HKR-H/K/R all pass: the hook is a minimal intervention—insert sentence boundaries—and the feed reports 7B–600B tests, GSM8K +7.7%, DROP +12.5%, across ICL and SFT. It stays below P1 because the article summary does not disclose the full eval setup or replication details.
editor take
This paper lifts GSM8K by 7.7% and DROP by 12.5% with sentence delimiters. I don’t read that as a trick; I read it as evidence many LLMs still lack a stable sentence-level compute prior.
sharp
The paper reports +7.7% on GSM8K and +12.5% on DROP by inserting delimiters at sentence boundaries, under a setup that explicitly segments the input into sentences. My read is simple: if that lightweight change helps from 7B models up to 600B DeepSeek-V3, the interesting signal is not that prompting still has tricks left. The signal is that many LLMs still do not treat the sentence as a stable unit of computation. That matters more than the headline gains. For the last year, the field has spent a lot of energy on test-time scaling, chain-of-thought scaffolds, step markers, XML wrappers, and dummy tokens. The implicit assumption behind a lot of that work is that the model will infer a useful processing granularity on its own. I’ve never fully bought that. Pretraining data contains punctuation and line breaks, yes, but tokenization plus next-token loss does not force a model to represent sentence boundaries as hard organizational cues. A transformer sees token streams, not syntax trees. If you give it an explicit delimiter, you are injecting a strong structural prior: “compress here, separate here, retrieve across here.” That can change attention routing and memory packing without adding any new module. Honestly, that is more interesting than many papers that bolt on extra components and call it progress. There is also decent outside context for why this is plausible. A lot of 2024–2025 structured prompting practice worked for basically this reason. XML tags, bulletized decomposition, “Step 1 / Step 2,” and clearly separated instruction-context-example blocks often improved reliability across models. OpenAI and Anthropic both pushed prompt hygiene that relies on explicit segmentation. The difference here is that this paper isolates sentence boundaries as the structural signal, instead of treating all delimiters as equivalent prompt-engineering clutter. If that distinction holds, it tightens a messy body of empirical lore into a sharper claim: language models are highly sensitive to explicit linguistic boundaries, and scaling alone has not erased that dependence. I still have real reservations about the evidence disclosed so far. The body here is only an RSS snippet. It gives peak gains, but not the baselines, variance, delimiter format, prompt template, token overhead, task mix, or how the effect scales with model size. A 7B model getting most of the lift and a 600B model getting most of the lift tell very different stories. A 7.7% GSM8K gain also means very different things depending on whether the baseline is 80% or 20%. Same for DROP: exact match vs F1, few-shot vs fine-tuned, single run vs averaged runs all matter. Right now the title gives the claim, but the snippet does not disclose the evaluation setup that determines whether this is robust or brittle. There is another pushback I care about: is this a sentence-boundary effect, or just an extra-token effect? A lot of “reasoning improvements” collapse under ablation because the model benefited from added anchors or extra compute budget, not from the proposed mechanism. If you insert delimiters, you alter sequence statistics, salience, and attention landmarks. That alone can help. To convince me this is specifically about sentence-level processing, the paper needs clean controls: random delimiter placement, semantically wrong boundaries, matched token-budget baselines, and maybe alternative chunking schemes like clause or paragraph separators. Without that, I would not jump from “helpful formatting trick” to “cognitive-inspired mechanism.” The interpretability claim also needs more than the snippet gives. It says fine-tuned models develop “sentence awareness” in internal representations. That is plausible, but representational claims are easy to overstate. If training consistently injects boundary markers, seeing clustering or boundary-sensitive activations around those positions is not surprising. That is still a long way from showing the model has learned sentence-by-sentence reasoning in a durable sense. I’d want transfer tests, adversarial rewrites, degradation curves when delimiters are removed, or reproducible attention/residual-stream evidence at boundaries. I couldn’t find that here. If the full paper backs up the setup, this has two practical implications. First, it is cheap enough to test everywhere: SFT data formatting, RAG chunk construction, evaluation prompts, agent plans, even synthetic data pipelines. Second, it pushes against a lazy belief the field keeps carrying around: that scale automatically absorbs every useful linguistic structure. This result points the other way. Some structures still need to be made explicit, even at very large scale. I would not oversell it as a new paradigm. I also would not dismiss it as prompt cosmetics. It looks more like a reminder that current LLMs remain more format-dependent, and less sentence-native, than a lot of the public narrative admits.
HKR breakdown
hook knowledge resonance
open source
85
SCORE
H1·K1·R1
09:38
58d ago
arXiv · cs.CL· atomEN09:38 · 04·11
Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations
The paper assesses dysarthria severity with a 12D phonological profile from frozen HuBERT features across 5 languages, 10 corpora, and 890 speakers, without training any supervised severity model. It derives feature directions only from healthy control speech via Montreal Forced Aligner; five consonant features correlate with clinical severity at rho=-0.50 to -0.56 in meta-analysis, p<2e-4. The key constraint is explicit: it applies only where an MFA acoustic model exists, which the paper says is 29 languages, and the authors release pipelines for six languages.
#Audio#Benchmarking#Tools#HuBERT
why featured
HKR-K passes on concrete scale, stats, and a reproducible setup. But this is a clinical dysarthria-assessment paper with no agent, model-product, or deployment implication for the core audience, so it hits hard-exclusion-traditional science + AI crossover and stays capped below 4
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
09:00
58d ago
最佳拍档 (BestPartners)· atomZH09:00 · 04·11
AI Is Accelerating: Greg Brockman on 70% AGI, Spud, Sora, and the Super App
According to the video’s retelling, Greg Brockman said OpenAI sees the path to AGI as 70% to 80% complete, and the new pretrained base model Spud has finished pretraining. The post also says OpenAI is pausing broad Sora expansion because of compute limits and is prioritizing GPT reasoning models, a super app, and an automated AI researcher targeted for this fall; it frames a $110B infrastructure buildout as a revenue center. The post does not disclose the original interview date, Spud specs, benchmark results, or release timing.
#Reasoning#Code#Agent#OpenAI
why featured
HKR-H and HKR-R pass: the title is clicky and the claimed OpenAI roadmap shift has industry resonance. HKR-K fails because this is a secondary video retelling with no primary interview timing, Spud specs, benchmarks, or release date, so it stays in all.
editor take
If OpenAI is sidelining Sora for GPT, that is not retreat. It is a hard compute-and-product consolidation bet.
sharp
OpenAI ties a reported $110B infrastructure buildout to the GPT line, while Sora gets slowed by compute limits. My read is simple: the useful signal here is not the “70% to 80% to AGI” claim. It is the resource allocation logic. OpenAI appears to be prioritizing products that monetize fast, retain daily users, and compound usage inside one interface. I do not buy the “AGI is 70% to 80% complete” line as an external metric. The retelling gives no original interview date, no task suite, no failure boundary, and no cost threshold. The article defines AGI as human-like competence at operating computers for knowledge work. Fine. By that definition, the field has moved a lot over the last year. Anthropic pushed coding and agents, Google kept folding Gemini into tool use and multimodal workflows, and OpenAI has been turning coding ability into a broader assistant product. But turning that into a percentage is internal morale language, not a reproducible benchmark. I do find the Sora deprioritization plausible. Video generation burns training and inference compute, while user value per unit of compute is still less obvious than coding, office tasks, search-like assistance, and enterprise workflows. If OpenAI has a stronger base model in the pipeline and still needs RL, post-training, deployment, and ChatGPT capacity at scale, compute will flow to the main line first. That is not unusual. Across the last year, major labs kept moving flashy demos behind tools that fit into recurring workflows and recurring revenue. The “unified GPT architecture” claim needs pushback. The article says text, voice, and image all sit under one GPT-style core, and even image generation is framed as part of that line rather than a separate diffusion-first stack. I believe half of that. Product unification is real across the industry. Users increasingly interact with one system, not a visible bundle of models. But product unification is not the same as training unification. The body gives no architecture details, no loss design, no routing, no benchmarks, and no cost data. Without that, nobody outside the company can tell whether this is one base model or several specialized subsystems wrapped into one GPT experience. Spud is still mostly a placeholder. The article only says pretraining is done and that Spud is a new foundation model for later RL and post-training. That description is generic and believable. It also tells us almost nothing. No parameter scale is disclosed. No token count is disclosed. No context window, benchmark, release timing, or relation to existing model families is disclosed. So the key question stays open: is Spud a genuine generational jump, or a fresh inventory layer for products and internal distillation? The title gives a name. The body does not give a role. The “super app” part is the most credible strategic piece here. ChatGPT stopped being a pure chatbot business a while ago. The market has been teaching the same lesson for two years: users do not pay for “a bit smarter” by itself. They pay when AI removes steps, reduces tool switching, and takes ownership of workflow fragments. Anthropic pushed Claude into coding and enterprise use. Microsoft kept embedding Copilot into Office. Google keeps using Search and Workspace as distribution. If OpenAI is trying to combine memory, browsing, coding, spreadsheet work, and delegated action into one front end, that is not a novel idea. It is still the clearest path to retention and higher revenue per user. The hard part is not the model. It is permissions, reliability, rollback, auditability, and interface design. The automated AI researcher claim deserves caution. AI systems already help with literature review, experiment drafting, and result analysis. Calling that an end-to-end researcher targeted for this fall is a stronger statement. I would discount it until we see scope and evaluation. Over the last year, many “AI scientist” systems looked impressive on constrained benchmarks, then weakened on messy data, failed experiments, open-ended hypotheses, and interpretation under uncertainty. Treat it like a high-throughput research intern and the claim sounds reasonable. Treat it like an autonomous scientist and the article does not provide enough evidence. The safety section also pulls in two directions. It stresses prompt injection and alignment work, then leans on openness and resilience as governance language. I have doubts there. OpenAI’s actual product posture over the last two years has not been especially open at the frontier-weight level. “Broad participation” works as a governance value statement. It does not map cleanly onto current practice. The article provides no new evals, no red-team numbers, and no misuse interception rates, so I would not treat this as evidence of safety progress. My bottom-line read is narrow. Three things are believable: OpenAI still has severe compute scarcity, GPT remains the internal priority, and product usability has become a first-order concern. Three things should not be accepted at face value: the AGI percentage, Spud’s significance, and the automated researcher timeline. Without the original interview, benchmarks, or release details, those claims are still narrative, not proof.
HKR breakdown
hook knowledge resonance
open source
70
SCORE
H1·K0·R1
08:23
59d ago
arXiv · cs.CL· atomEN08:23 · 04·11
SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models
The paper presents SEPTQ, a post-training quantization method for LLMs, and says its two-step pipeline beats strong baselines under low-bit settings. It scores each weight element, picks quantization locations with a static global rule, then updates masked weights column by column. The post does not disclose model names, bit widths, datasets, or gain sizes; the key point is that it reduces PTQ to two steps.
#Inference-opt#Benchmarking#Research release
why featured
HKR-K passes on the disclosed 2-step PTQ method, but HKR-H and HKR-R are weak because the feed omits models, bit-widths, datasets, and gains. It also triggers hard-exclusion-technical-accessibility-fail: low-level quantization research lacks a generalist on-ramp.
HKR breakdown
hook knowledge resonance
open source
45
SCORE
H0·K1·R0
08:09
59d ago
X · @op7418· x-apiZH08:09 · 04·11
Hermes Agent now natively supports WeChat connection, but not via an official WeChat plugin
Hermes Agent now natively supports connecting to WeChat, but it uses a reverse-engineered integration rather than an official WeChat plugin. The post does not disclose the mechanism, rollout scope, account risk, or release timing; the key issue is stability and ban risk under reverse integration.
#Agent#Tools#Hermes Agent#WeChat
why featured
HKR-H lands on the 'native WeChat via reverse engineering' twist, and HKR-R lands because Chinese builders care about WeChat automation and ban risk. HKR-K fails: the post gives no mechanism, scope, timing, or risk details, so this stays a low-60s all item.
editor take
Hermes Agent says it natively connects to WeChat through reverse engineering. That is less a product feature than a survival test.
sharp
Hermes Agent says it natively connects to WeChat, but the condition is blunt: this is reverse-engineered, not an official integration. The title gives the route; the body does not disclose the protocol method, login flow, sync latency, rollout scope, or ban boundary. My read is simple: do not file this under product capability first. File it under gray infrastructure. I’ve always thought any serious agent product aimed at China eventually hits this wall. Enterprise WeChat has APIs. Personal WeChat effectively does not. So teams get pushed into the same bucket of workarounds: reverse protocol access, desktop automation, app hooks, or some RPA layer. The pattern over the last year has been very consistent. The demo looks great. Persistent operation is where things break. Login state drifts, device fingerprints change, messages drop, and platform risk teams tighten the screws. Since this post gives zero stability numbers, I don’t buy the phrase “native support” at face value. With no official API, “native” often just means the fragility is packaged more neatly. The bigger issue is account risk, and product teams often understate that on purpose. Once you connect a personal WeChat account to an agent, the problem is not just send/receive. It becomes contact graph exposure, reply cadence, automation patterns, session persistence, and abnormal login signatures. Platform enforcement looks at behavior, not your marketing label. If Hermes is using a common reverse stack, it is exposed to protocol changes and enforcement cycles by design. I haven’t verified which stack they use, so I can’t tell whether this is a patch-every-week situation or a one-change-and-it-dies setup. The article simply doesn’t say. The outside comparison is useful here. When agents connect to Gmail, Slack, or Notion, the debate is usually about permission scope and execution reliability because official APIs exist. WeChat personal accounts are a different category. This looks closer to the old unofficial WhatsApp client pattern: you can get traction, but the platform controls your lifespan. If Hermes later shows hard boundaries — test accounts only, single device only, low-frequency messaging only — then this becomes a narrower and more honest feature. Right now, only the headline is disclosed, and the missing conditions matter more than the launch itself.
HKR breakdown
hook knowledge resonance
open source
68
SCORE
H1·K0·R1
07:55
59d ago
● P1arXiv · cs.CL· atomEN07:55 · 04·11
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
The paper studies “incomplete learning” in SFT: even after convergence, LLMs still fail to reproduce a subset of supervised training samples. The abstract reports this across Qwen, LLaMA, and OLMo2, and attributes it to 5 recurrent causes; aggregate gains can hide persistent unlearned subsets.
#Fine-tuning#Benchmarking#Interpretability#Qwen
why featured
HKR-H/K/R all pass: the claim is counterintuitive, the abstract adds cross-model scope plus five source categories, and it challenges how practitioners trust fine-tuning metrics. I keep it at 80/featured because the provided text omits failure rates, setup details, and clear repl
editor take
This paper pins down a familiar SFT annoyance as a measurable failure mode: the model converges, yet still cannot reproduce part of its own training set.
sharp
The paper names a very real SFT failure mode: after convergence, the model still cannot reproduce a subset of its own supervised samples, and the authors split that into five recurring causes. I buy the framing. It targets a problem practitioners actually hit, not a benchmark optics problem. Anyone who has run instruction tuning has seen some version of this: eval goes up, training loss goes down, then you spot-check awkward edge cases from the training set and the model still misses them. Teams usually shrug and call it noise, bad data, or seed variance. This paper is saying the issue is systematic enough to deserve its own diagnosis layer. That cuts against a lot of the past year's fine-tuning narrative. Open-source post-training, from Llama to Qwen to OLMo-style recipes, has leaned on a familiar loop: curate better SFT data, add preference optimization, report aggregate wins, move on. Production teams do the same with pass@k, win rate, average exact match, or task-level composites as stopping criteria. The problem is that aggregate metrics hide tail failures by design. Rare formats, long dependency chains, samples with missing prerequisite knowledge, and internally inconsistent supervision all get averaged away. If this paper is right, “converged” often means “most high-frequency patterns settled,” not “the supervision was fully internalized.” That is a much less flattering picture of what SFT is doing. Of the five causes in the abstract, two matter most for real pipelines. First, conflict between pretraining knowledge and SFT supervision. That shows up constantly in code style, math procedures, refusal behavior, and domain-specific policy text. The pretrained prior is strong, the SFT correction signal is sparse, and the model only half-flips. It looks compliant in demos, then snaps back to the old distribution on slightly perturbed prompts. Second, left-side forgetting in sequential fine-tuning. That matches a lot of practical experience: train format, then domain, then safety, then a small patch set right before launch, and early capabilities get overwritten by late-stage objectives. The abstract does not disclose the share of failures each cause explains, the exact detection signals, or the size of the mitigation gains, so I would not overclaim beyond that. There is also useful outside context here. A lot of teams have quietly learned that SFT transfers style more reliably than it transfers underlying competence. You can teach a model to emit the right JSON shell faster than you can teach it when to call the tool, which arguments matter, or where the edge conditions are. LoRA and QLoRA made adaptation cheap and fast, but they also tend to spend limited optimization capacity on dominant modes first. Rare, brittle, or compositional samples are the ones that get left behind. If this ILP pattern is stable across Qwen, LLaMA, and OLMo2, then this is not one bad tokenizer choice or one bad learning-rate schedule. It points to something rougher in the SFT objective itself. I do have a pushback. The title says “Why SFT Fails to Learn,” which is stronger than the abstract supports. Failing to reproduce a training sample is not automatically the same as failing to learn. For many instruction datasets, exact reproduction is the wrong target. Some tasks are inherently multi-answer. Some labels are compressed paraphrases. Some datasets contain annotation conflicts or policy drift. The abstract says the authors use a diagnostic-first framework and map unlearned samples to causes using training and inference signals. Good. But the snippet does not tell us the criterion: exact match, semantic equivalence, verifier-based correctness, or task-specific scoring. That detail changes the whole claim. Without it, ILP can be either a sharp diagnosis or a bucket for every awkward miss. Another reality check: very few frontier teams now treat pure SFT as the final performance engine. Public materials from OpenAI, Anthropic, and Google over the last two years have steadily shifted emphasis toward preference optimization, online RL, tool-use training, and inference-time scaffolding. That is partly because SFT is excellent at writing in-distribution behavior into the model, but much less reliable at robust planning, reward shaping, and hard generalization. So I do not read this paper as “everyone used the wrong method.” I read it as a reminder that SFT is a high-bandwidth writer, not a trustworthy complete memory system. What would decide how important this paper becomes is not the diagnosis label. It is the intervention evidence. If the full paper shows that each ILP subtype has observable signals and targeted fixes, I want two numbers. First, does fixing one unlearned subset actually reduce that subset, or does the system just rotate failure onto a different subset? Second, what is the cost to out-of-distribution behavior, refusal consistency, or calibration? In practice, stronger memorization of supervised instances often trades against robustness. The abstract does not disclose those tradeoffs. My take is still positive. The paper does not introduce a new training paradigm, but it drags an under-measured loss term into the open. For people building fine-tuning platforms, data curation stacks, curriculum schedulers, or post-training eval suites, that is more useful than one more aggregate leaderboard win. If the methods section is solid, this looks less like a niche interpretability paper and more like a corrective to how the field reports SFT success.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
05:14
59d ago
arXiv · cs.CL· atomEN05:14 · 04·11
Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension
The paper implements a computational model of metaphor comprehension based on Fuyama et al.'s TINT theory and reports better results than prior algorithms on 3 measures: data fitting, systematicity, and novelty. The snippet says the authors simplified the algorithms to align more closely with the original theory; the post does not disclose sample size, number of baselines, or exact scores. The notable part is that it turns metaphor comprehension into a simulatable and comparable program, not just a theoretical account.
#Reasoning#Benchmarking#Interpretability#Fuyama
why featured
There is some HKR-K: the paper turns TINT into executable code and makes a testable improvement claim. Tier stays excluded because it lacks agent or product implications, and the category-theory framing triggers hard-exclusion-1 technical-accessibility fail.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H0·K1·R0
05:04
59d ago
arXiv · cs.CL· atomEN05:04 · 04·11
CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models
The paper introduces CoSToM, which combines causal tracing and activation steering to intervene in ToM-critical LLM layers for better social reasoning and dialogue quality. The post discloses the mechanism—locate internal ToM feature distributions, then apply lightweight targeted steering—but does not disclose model names, benchmarks, or gain sizes. The real point is shifting from prompt-level performance to aligned internal representations.
#Reasoning#Alignment#Interpretability#Research release
why featured
HKR-K passes on mechanism: causal tracing identifies ToM-relevant layers, then activation steering nudges them. But the post omits model, benchmark, and gains, and the value sits mostly in specialized internal-representation work, so hard-exclusion-technical-accessibility fail is
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H0·K1·R0
04:33
59d ago
X · @op7418· x-apiZH04:33 · 04·11
Claude Code's generated code quality improved noticeably, and the earlier lazy behavior is gone
User op7418 says Claude Code now produces noticeably better code and no longer shows the earlier “lazy” behavior in their usage. The post discloses no model version, update timing, task type, comparison samples, or reproducible setup. This is not an official update, but an anecdotal signal worth tracking.
#Code#Anthropic#op7418#Commentary
why featured
This is a user-side signal, not a product update. No model version, update date, task type, before/after example, or repro setup is disclosed; HKR-H and HKR-R are weakly present, HKR-K fails, so hard-exclusion-6 caps it below 40.
HKR breakdown
hook knowledge resonance
open source
41
SCORE
H1·K0·R1
04:16
59d ago
AI Era (新智元) · WeChat· rssZH04:16 · 04·11
The End of AI Is Theology: A 60-Year-Old Former Silicon Valley Executive-Priest Rewrites Claude's Soul, Rejects Pentagon Use
The headline says a 60-year-old former Silicon Valley executive turned priest rewrote Claude’s “soul” and rejected Pentagon military use. The body is empty, so the post does not disclose the person’s name, the Claude version, the mechanism behind “rewriting,” or whether the military refusal is a personal stance or Anthropic policy. This is a claim-heavy headline, not a fact-rich post.
#Anthropic#Pentagon#Commentary#Safety/alignment
why featured
HKR-H passes on the priest + Claude + Pentagon hook, and HKR-R hits the defense/alignment nerve. HKR-K fails because the body discloses no name, model version, mechanism, or policy source; hard-exclusion-zero-sourcing caps it below 40.
HKR breakdown
hook knowledge resonance
open source
41
SCORE
H1·K0·R1
03:05
59d ago
X · @op7418· x-apiZH03:05 · 04·11
Lobsters author Peter's Claude account was banned in the morning, then restored by Anthropic after he posted
Peter said his Claude account was banned this morning, and Anthropic restored it after he posted. The post confirms only the sequence of events; it does not disclose the ban reason, appeal path, or resolution time. The key missing detail is what triggered human review.
#Peter#Anthropic#Incident#Commentary
why featured
This is a single-case Claude account incident with a visible reversal, so HKR-H and HKR-R pass. HKR-K fails because the post gives no cause, appeal mechanics, or handling time, so it stays low-band all.
editor take
Anthropic restored Peter’s Claude account after he posted publicly, and that’s a bad look. If public pressure speeds reversals, the appeals path or risk controls are not holding up.
sharp
Peter’s Claude account was banned this morning, and Anthropic restored it after he posted publicly. That sequence is the only solid fact here; the body does not disclose the ban reason, the appeal route, the review time, or whether this was automated enforcement or a human mistake. My read is simple: a single false positive is normal; a public post triggering a reversal is the problem. Every major platform tolerates some error rate in trust-and-safety systems. OpenAI, Google, Meta, all of them have had mistaken suspensions or overbroad enforcement at one point or another. That part is not interesting. The bad signal is when the formal appeals path appears weaker than social-media escalation. Once users learn that posting on X gets attention faster than the in-product process, “policy enforcement” starts looking like ad hoc reputation management. This hits Anthropic harder than it would hit some peers because Claude is sold on reliability as much as model quality. Anthropic has spent the last year leaning into the idea that it is the careful lab, the enterprise-safe choice, the one with tighter controls. I do not have numbers here, so I am not claiming a systemic failure from one anecdote. Still, enterprise buyers will read this and ask two immediate questions: are account-level controls tied to the same risk systems that govern API usage, and is there any real review SLA after a false positive? The title gives a strong hint that something failed; the article gives none of the operational details needed to judge how bad it is. There is also a broader product context that is missing from the snippet. Over the last year, frontier labs have shifted from pure output moderation toward account and workflow enforcement, because agents changed the threat model. Tool use, persistent sessions, long-running tasks, and bulk automation create abuse patterns that a simple response filter will not catch. Once you widen enforcement from “block this answer” to “freeze this account,” the blast radius gets much larger. A mistaken refusal is annoying; a mistaken suspension breaks trust fast. If Anthropic has recently tightened abuse detection around agentic use, then more edge-case suspensions would not surprise me. What does bother me is the apparent speed of the reversal after public attention. That suggests the system may not be separating legitimate high-value usage from risky behavior very well, or at least the review path is not credible without external pressure. I should be careful here: this is thin material. I have not verified what Peter was doing before the ban, and I have not seen any official explanation from Anthropic. So the strong claim is not “Anthropic has a widespread suspension problem.” The stronger and fairer claim is narrower: Anthropic now has a transparency problem around enforcement. If the company wants Claude to be trusted inside real workflows, it needs to publish clearer suspension categories, review channels, and expected turnaround. Without that, the safety story starts to depend on brand goodwill alone, and that erodes quickly once people see reversals happen in public.
HKR breakdown
hook knowledge resonance
open source
61
SCORE
H1·K0·R1
01:49
59d ago
X · @op7418· x-apiZH01:49 · 04·11
A new real-time interactive world model, Waypoint-1.5
Waypoint-1.5 is described as a new real-time interactive world model. The RSS snippet confirms two facts: character motion looks smooth, and it can interact with weapons. The key missing part is the realtime metric; the post does not disclose the developer, latency, frame rate, resolution, or interaction mechanism.
#Multimodal#Vision#Product update
why featured
HKR-H passes on the real-time interactive world-model hook. HKR-K and HKR-R miss because the post gives no latency, FPS, resolution, interaction method, developer, or reproducible test, so it stays in all rather than featured.
editor take
The post shows two things: smooth motion and weapon interaction. Without latency, FPS, or resolution, I won’t call this a realtime world model yet.
sharp
The post gives only two facts: Waypoint-1.5 shows smooth character motion and weapon interaction. It does not disclose the developer, end-to-end latency, FPS, resolution, clip length, or interaction mechanism. Without those, “realtime interactive world model” is still a marketing label, not a technical category. I’m cautious with demos like this for a reason. In the past year, a lot of “world model” clips have hidden the hard part. One pattern is a short autoregressive rollout that looks responsive because the dead time is edited out. Another is interaction built as a narrow state machine: the character can grab or swing a weapon, but the environment is not being modeled with stable, persistent state. The title claims interactivity; the body does not explain whether the system maintains world state, predicts action-conditioned futures, or just triggers predefined behaviors. The comparison set is obvious. When people discussed DeepMind’s Genie 2 or Decart-style realtime generated environments, the first technical questions were always latency, controllable duration, and consistency under repeated actions. NVIDIA’s Cosmos pushed the “world foundation model” framing, but that line still sits far from player-grade closed-loop realtime interaction. I haven’t found any hard numbers for Waypoint-1.5, so I can’t place it against those systems in a serious way. My pushback is simple: AI Twitter keeps labeling “interactive-looking video” as a world model too quickly. To earn that term, a team should at least publish three things: action-to-photon latency, stability over sustained interaction, and consistency tests for object manipulation. Right now we have only a title and a short snippet. That makes this a promising demo direction, not evidence that a new realtime world-model bar has been cleared.
HKR breakdown
hook knowledge resonance
open source
59
SCORE
H1·K0·R0
01:14
59d ago
Synced (机器之心) · WeChat· rssZH01:14 · 04·11
CVPR Highlight | NUDT proposes a new method for UAV self-navigation and target lock-on
A CVPR Highlight paper from NUDT proposes a UAV method aimed at self-navigation and target lock-on; only these two tasks are confirmed from the title. The RSS snippet is empty, and the post does not disclose the model design, training data, benchmarks, success rate, or latency. The key point is whether one method closes the loop across navigation and target lock, rather than improving a single perception step.
#Robotics#Vision#NUDT#CVPR
why featured
There is a click hook, so HKR-H passes, but HKR-K and HKR-R fail because the post discloses only the paper label and task names, with no model, dataset, benchmark, success rate, or latency. The story also fits hard-exclusion-technical-accessibility fail for this audience, so it’s
HKR breakdown
hook knowledge resonance
open source
43
SCORE
H1·K0·R0
01:14
59d ago
Synced (机器之心) · WeChat· rssZH01:14 · 04·11
With 100,000 hours of human data and no alignment, Lingchu Intelligence's Psi-R2 tops MolmoSpaces
The title says Lingchu Intelligence trained Psi-R2 on 100,000 hours of human data, skipped alignment, and topped MolmoSpaces. The body is empty, so model size, benchmark score, and the MolmoSpaces task setup are not disclosed. The key missing piece is reproducible detail; only the title is available.
#Benchmarking#灵初智能#Benchmark
why featured
HKR-H and HKR-R pass because the title combines 100k human hours, a no-alignment claim, and a leaderboard result. HKR-K fails: the body is empty, with no params, scores, task setup, or reproduction details, so hard-exclusion-zero-sourcing caps it below 40.
HKR breakdown
hook knowledge resonance
open source
40
SCORE
H1·K0·R1
01:05
59d ago
● P1QbitAI (量子位) · WeChat· rssZH01:05 · 04·11
Liu Zhuang and Danqi Chen team open-source Vero, a general visual reasoning RL framework, reaching SOTA with zero thinking data
Princeton researchers including Liu Zhuang and Danqi Chen open-sourced Vero, an RL framework for visual reasoning, and report beating Qwen3-VL-8B-Thinking on 23 of 30 benchmarks. The post says Vero uses 600K samples filtered from 59 datasets, task-routed rewards, and single-stage RL across six task groups. The key point is the mechanism mix: no private thinking data, but the post does not disclose training cost or base model configuration.
#Reasoning#Vision#Alignment#Princeton University
why featured
Featured on HKR-H/K/R: the zero-thinking-data claim is a strong hook, and the post includes concrete benchmark and method details. I keep it in the low 80s because training cost, base model choice, and full reproduction conditions are not disclosed.
editor take
Vero beats Qwen3-VL-8B-Thinking on 23 of 30 benchmarks with 600K samples, but I wouldn’t call this an open-source Gemini moment. It looks more like disciplined systems work finally catching up to a wу
sharp
Vero’s strongest signal is not the “zero thinking data” line. It is that the team connected three pieces that open visual RL has kept treating separately: 600K filtered samples, task-routed rewards, and a single-stage RL recipe. Beating Qwen3-VL-8B-Thinking on 23 of 30 benchmarks says that combination works, at least in the 8B class. My read is simple: visual reasoning is less bottlenecked by some secret proprietary reasoning sauce than people like to claim. A lot of the gap still sits in data distribution and reward engineering. That matters because open visual RL has had the same failure mode for a year. It can get good on one narrow slice — math diagrams, charts, OCR-heavy QA — then fall apart on grounding, spatial search, counting, or open-ended visual instruction following. The reason is not mysterious. These tasks have very different reward surfaces. Multiple choice cares about exact final answers. Grounding cares about spatial alignment. Open description needs a judge model. If you mix them naively, you do not get generalization; you get interference. Vero at least acknowledges that directly and builds the reward stack around it. Task-routed rewards sound mundane, but this is exactly the sort of systems detail many papers hand-wave away. I do have some pushback on the headline framing. “Zero thinking data” is catchy, but the article does not disclose the key ingredients needed to judge how much credit belongs to Vero itself. We do not get the base model configuration. We do not get training duration, rollout budget, sampling settings, or the cost profile of the verifier stack. We do not know how much of the lift came from the RL framework and how much came from choosing a strong initialization. Without that, the result is directionally impressive but still hard to place. “No private thinking data” is not the same claim as “closed labs’ post-training stacks no longer matter.” I don’t buy the stronger version. That distinction is important. OpenAI, Google, and Anthropic did not get visual reasoning by adding chain-of-thought traces alone. Their gains have also come from tool use, output filtering, refusal policy tuning, evaluator design, and a lot of dataset curation. Vero shows that you can get strong visual reasoning gains without proprietary thought traces. It does not show that the rest of the closed-model playbook has become irrelevant. The competitive context makes the result more credible, though. Qwen’s visual line has already pushed down the barrier for open multimodal post-training, especially on chart, OCR, and STEM mixtures. I have not verified the full Qwen3-VL-8B-Thinking release details while writing this, but based on the article, Vero is beating a model that was already optimized for reasoning rather than a plain untuned base. That is much more meaningful than beating a raw checkpoint. There is also a broader pattern here: a lot of visual RL work from the last year relied on single-domain datasets and simple format-based rewards, then looked great on in-domain benchmarks and weak across tasks. Vero’s “59 datasets filtered into 600K samples” is a reminder that scale alone is not the point. Filtered and balanced scale is the point. Text-model post-training went through the same lesson. I’m especially interested in the claim that broad data coverage is the main driver. That sounds plausible, but I still want to see stronger ablations. Did broad coverage teach transferable strategies, or did it mainly reduce overfitting to a few verifier types? Those are very different outcomes. If it is the former, Vero has found a durable recipe for general visual reasoning. If it is the latter, then this is more about training stability and benchmark hygiene than about a real jump in reasoning ability. The article snippet is not enough to settle that. There is also a very practical concern: task-routed rewards are elegant on paper and expensive in practice. Open-ended tasks require an external LLM judge. Math and grounding need their own validators. In many RL pipelines, the evaluation chain becomes harder to operate than the model forward pass itself. Open-sourcing the code is excellent, but practitioners will immediately ask different questions: what is reward cost per sample, what throughput did they achieve, and how sensitive is the setup to judge drift? The article does not say. Still, I think Vero marks a real shift in research posture. Visual reasoning has often been framed as something that will just emerge from bigger multimodal bases. Vero argues for a more engineering-heavy route: stop mythologizing the base model, and get serious about coverage, filtering, reward routing, and training design. That is very similar to what happened in text models over the last year, where post-training stopped being the finishing layer and started becoming the capability definition itself. So my stance is positive, with limits. I would not frame this as open source catching closed models in full. The evidence here is not strong enough for that. I would frame it as something more useful: visual RL is starting to look like a reproducible method instead of a bag of isolated tricks. If the project later publishes the missing training details, the base model setup, stronger ablations, and out-of-distribution tests, this stops being a nice research result and turns into a recipe other teams will copy. That is when it will matter much more.
HKR breakdown
hook knowledge resonance
open source
88
SCORE
H1·K1·R1
01:05
59d ago
● P1QbitAI (量子位) · WeChat· rssZH01:05 · 04·11
OpenClaw-style methods reach multimodal generation, with a 6B model beating Nano Banana 2 on some tasks
A team led by Shanghai AI Laboratory introduced GEMS, adding Agent Loop, Memory, and Skills to multimodal generation, and reports that 6B Z-Image-Turbo beats Nano Banana 2 on some tasks. The post reports +14.22 average gains on 5 mainstream tasks and +8.92 over the best baseline on 4 downstream tasks; the paper and code are public, but the post does not disclose Nano Banana 2's full setup.
#Agent#Multimodal#Memory#Shanghai AI Laboratory
why featured
Strong HKR-H/K/R: the hook is a 6B multimodal model beating Nano Banana 2, and the post includes mechanism plus testable deltas (+14.22 / +8.92) with paper and code. It stays below P1 because the article does not disclose the full Nano Banana 2 comparison setup.
editor take
GEMS pushes a 6B model past some leaderboard slices, but I wouldn't call this a model overtake yet. It looks more like test-time scaffolding wrapped as multimodal progress.
sharp
GEMS reports that 6B Z-Image-Turbo gains +14.22 on average across five mainstream tasks and +8.92 over the best baseline on four downstream tasks; my read is that this validates agent-style orchestration in multimodal generation, not that a 6B base model suddenly jumped a generation. My core take is simple: this looks like inference-time structure beating raw model size. The three pieces here are Agent Loop, compressed Memory, and on-demand Skills. That recipe already worked in coding agents. OpenClaw, Claude Code, and similar systems showed that once a task allows retry, critique, and revision, smaller models can buy a lot of score through process. Moving that pattern into image generation is logical. The easy mistake is to narrate a system win as a model win. Those are different claims. A system win comes from extra rounds, extra tokens, extra routing, and extra selection. A model win means the underlying parameters got stronger. I don't fully buy the “6B beats Nano Banana 2” framing yet because the setup disclosure is thin. The post says the paper and code are public, but the article body does not disclose Nano Banana 2's full configuration. On GenEval2, was the comparison single-turn or multi-turn? How many image samples were allowed? Did both sides get memory accumulation? How long were the skill prompts? Was there any reranking or human filtering? None of that is in the article. In multimodal generation, sample budget and reranking can swing scores hard. Give the same base model four tries instead of one and you can get a very different headline. The post says there is a tradeoff between average generation rounds and performance, but it does not give the round distribution. That omission matters. The broader context is familiar. A lot of the strongest agent progress over the last year came from inference-time scaling, not from pretraining suddenly teaching a model entirely new skills. OpenHands, OpenClaw, and coding agents in general got mileage from loops, tools, and memory compression. Multimodal generation is heading to the same place. Once the task becomes “draft image, inspect image, rewrite prompt, regenerate” rather than “one shot output,” system design starts to matter more than base model size. I buy that direction because it maps to real workflows. I do not buy the smoother story that therefore a 6B open model has overtaken a closed model in any broad sense. Show the total cost: rounds, latency, token load, and calls. The Memory piece is the most durable part here in my view. Keeping factual constraints while compressing chain-of-thought into experience is not a cosmetic choice; it is a cost and stability choice. Multi-turn generation breaks when context grows into noise. If hierarchical compression actually preserves the right constraints over long loops, that is more valuable than one benchmark bump. This also lines up with what agent builders learned elsewhere: summary memory often helps more than raw transcript retention. My pushback is that the article gives no failure cases. How much useful detail gets lost in compression? Does the memory transfer across tasks, or only within a narrow prompt family? The post doesn't say. I also only half-buy the Skills story as presented. On-demand expert instructions can absolutely make outputs look smarter. A well-written aesthetic or creative skill library can improve composition, lighting, and scene intent fast. But example images are the easiest thing to cherry-pick in this category. Without blind human eval, trigger precision, or error rates for bad skill routing, this section reads more like a good demo than a settled result. So my practical takeaway is this: GEMS is a sign that multimodal generation is entering its agent phase, where the unit of competition shifts from single-pass image quality to total closed-loop task completion cost. That is important. A lot of open image systems will soon compete less on parameter count and more on who can wire critic, memory, skills, and tooling together. But if the paper's public story stops at average gains and does not show the compute bill behind them, it is still one step short of an engineering decision. I haven't checked the appendix myself. Based on the article alone, the evidence is not enough for me to accept the “6B overtake” headline at face value.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1
01:05
59d ago
● P1QbitAI (量子位) · WeChat· rssZH01:05 · 04·11
A Chinese embodied model reached global No.1 as a 100,000-hour human dataset for robots was released
Psibot says it released a 100,889-hour human-plus-robot manipulation dataset, and that Psi-R2 ranked first on AllenAI’s MolmoSpace benchmark. The post lists 95,472 hours of human data, 5,417 hours of robot data, 1,000 open-sourced hours, 294 scenes, 4,821 tasks, and 1,382 objects; Psi-W0 adds 30% failure samples, and Psi-R2 latency drops from 2.2s to under 100ms. The key point is the data loop and benchmark framing: the post claims nearly 10x higher success, but does not disclose task setup, full baselines, or statistics.
#Robotics#Multimodal#Benchmarking#Psibot
why featured
HKR-H/K/R all pass: the data scale, failure-sample mix, and latency cut are concrete and discussable. I keep it at 80 because the No.1 ranking and near-10x success claim lack task setup, full baselines, and statistical detail in the body.
editor take
Psibot put 100,889 hours on the table, and I only buy half the pitch. The data scale is real; the “world No.1” and “10x success” framing is not proven yet.
sharp
Psibot released a 100,889-hour manipulation dataset and says Psi-R2 ranked first on MolmoSpace. My read is pretty simple: the important part is not the No.1 claim, but that someone is finally pushing embodied pretraining data toward a scale that starts to matter. The shaky part is the “nearly 10x higher success rate” line. The article does not disclose task splits, full baselines, variance, or whether the comparison used the same robot, control loop, camera setup, and recovery rules. Here is the part I do buy. A mix of 95,472 hours of human data and 5,417 hours of robot data is an aggressive ratio, and it points at the right bottleneck. Embodied AI has not been blocked by a lack of model branding. It has been blocked by a lack of dense, diverse, messy data that still maps back into control. Most reusable manipulation datasets over the past year have been in the hundreds to low thousands of hours. Once you get into five digits, you are playing a different game. The comparison to Nvidia’s EgoScale at 20,000 hours is a fair directional marker, even if the modalities are not identical. I also like that they trained Psi-W0 with 30% failure samples. That is more grounded than the usual “world model” pitch. Robots do not fail because they never saw success. They fail because they never learned what slip, jam, missed contact, or partial grasp looks like in the action loop. A policy trained only on clean demonstrations often learns a narrow trajectory, not recovery behavior. A lot of manipulation demos from the last year looked great in videos and broke fast in deployment for exactly that reason. Still, I have two serious reservations. First, what exactly did MolmoSpace measure here? The article says Psi-R2 beat PI and DreamZero and posted nearly 10x higher success, but it gives no task list, no episode length, no success definition, no repeat count, no significance statistics. AllenAI benchmarks are useful, and I am not dismissing them. But robotics leaderboards have the same problem language model leaderboards do: benchmark framing can quietly do a lot of work. Change the object set, camera pose, replanning allowance, or controller frequency, and rankings stop being directly comparable. Without the full table, “world first” is marketing, not evidence. Second, the latency claim needs conditions. The article says inference dropped from 2.2 seconds to under 100 milliseconds through DiT caching, Torch compilation, and quantization. I believe that kind of engineering gain is possible. What I do not know is what that 100 ms actually includes. Resolution, hardware, action horizon, and whether this is model-forward latency or end-to-end system latency are all undisclosed. In robotics, those are not footnotes. Reused visual embeddings, low-level closed-loop control, and collision checking can completely change the practical result. Too many teams report “model latency” as if it were “robot latency.” I do not buy that shortcut. Put this in industry context and the strategy looks familiar. Figure, Physical Intelligence, and Skild have all spent the last year pushing some version of the same thesis: broad, heterogeneous action data matters more than elegant small-data pipelines. Psibot’s framing here is closest to the early Physical Intelligence pitch as I remember it: use large, mixed pretraining to learn wide representations, then compress human behavior into something the robot body can execute. The article says fewer than 100 real robot trajectories are enough for finetuning. If they can show that on public tasks, that will matter more than the leaderboard placement. Deployment cost is the real metric. Factory buyers do not care whether you are first on a benchmark. They care whether changing a gripper, a box SKU, or a station requires 20 trajectories or 500. I also think the article oversells the open-source angle. Only 1,000 hours are open-sourced so far. In embodied AI that is not trivial; it is actually generous by current standards. But it is still two orders of magnitude smaller than the full 100,889-hour claim. If the company wants an ecosystem to extend the data flywheel, the release has to include more than video. The hard part of open embodied data is not uploading files. It is standardizing collection protocols, sensor sync, action formats, and quality-control tooling so outside teams can plug into the same pipeline. Without that, “open source” is a signal, not an infrastructure layer. One more piece of context outside the article: the field has gotten very comfortable with using video prediction as a proxy for physical understanding. I have never fully bought that. Strong future-frame generation does not guarantee stable control. Predicting a plausible rollout does not mean you can do insertion, compliant contact, or long-horizon recovery. Psibot at least seems aware of this gap, because it is not only talking about video generation. It is bringing in tactile data, 3D hand pose, and explicit failure examples. That pushes the work closer to executable behavior rather than pretty rollouts. So my verdict is split. The data-scale move is real and deserves attention. The article’s “global first” and “instant fame” framing does not. What Psibot needs next is boring evidence: full benchmark tables, reproducible evaluation scripts, more open hours, and deployment curves across changing scenes and hardware. If those show up, this starts to look like a serious embodied-data infrastructure play. If they do not, then this was a strong PR package attached to a promising but still unproven system.
HKR breakdown
hook knowledge resonance
open source
86
SCORE
H1·K1·R1

more

feeds

admin