ax@ax-radar:~/feed $ tail -f signal.log
41 srcsignal 1208%cycle 04:32

hot events · 2026-05-02

12 signals · updated 3m ago
live · 217 today·policy v2
LATENT SPACEAnthropic pulls Fable and Mythos after US e…96·LATENT SPACEAnthropic launches Claude Fable 5, its firs…88·HACKER NEWS FRONTPAGDid Anthropic ask for its own export contro…82·HACKER NEWS FRONTPAGAnthropic flies senior technical staff to D…82·AI HOT (CURATED POOLWSJ: OpenAI weighs steep price cuts and pla…82·HACKER NEWS FRONTPAGBram Cohen: Claude is turning into an assho…78·R/LOCALLLAMAXiaomi serves MiMo V2.5 at 1000–3000 tps wi…78·IMPORT AI (JACK CLARAI learns to game society's rules, and Anth…78·MIT TECHNOLOGY REVIEGoogle DeepMind is worried about what happe…78·DWARKESH PATELThe sample efficiency black hole: AI models…78·LATENT SPACECognition launches FrontierCode: a coding b…78·HACKER NEWS FRONTPAGGabriel Weinberg argues with data that “eve…78·LATENT SPACEAnthropic pulls Fable and Mythos after US e…96·LATENT SPACEAnthropic launches Claude Fable 5, its firs…88·HACKER NEWS FRONTPAGDid Anthropic ask for its own export contro…82·HACKER NEWS FRONTPAGAnthropic flies senior technical staff to D…82·AI HOT (CURATED POOLWSJ: OpenAI weighs steep price cuts and pla…82·HACKER NEWS FRONTPAGBram Cohen: Claude is turning into an assho…78·R/LOCALLLAMAXiaomi serves MiMo V2.5 at 1000–3000 tps wi…78·IMPORT AI (JACK CLARAI learns to game society's rules, and Anth…78·MIT TECHNOLOGY REVIEGoogle DeepMind is worried about what happe…78·DWARKESH PATELThe sample efficiency black hole: AI models…78·LATENT SPACECognition launches FrontierCode: a coding b…78·HACKER NEWS FRONTPAGGabriel Weinberg argues with data that “eve…78·LATENT SPACEAnthropic pulls Fable and Mythos after US e…96·LATENT SPACEAnthropic launches Claude Fable 5, its firs…88·HACKER NEWS FRONTPAGDid Anthropic ask for its own export contro…82·HACKER NEWS FRONTPAGAnthropic flies senior technical staff to D…82·AI HOT (CURATED POOLWSJ: OpenAI weighs steep price cuts and pla…82·HACKER NEWS FRONTPAGBram Cohen: Claude is turning into an assho…78·R/LOCALLLAMAXiaomi serves MiMo V2.5 at 1000–3000 tps wi…78·IMPORT AI (JACK CLARAI learns to game society's rules, and Anth…78·MIT TECHNOLOGY REVIEGoogle DeepMind is worried about what happe…78·DWARKESH PATELThe sample efficiency black hole: AI models…78·LATENT SPACECognition launches FrontierCode: a coding b…78·HACKER NEWS FRONTPAGGabriel Weinberg argues with data that “eve…78·
RSS live
2026-05-02 · Sat
08:12
44d ago
● P1r/LocalLLaMA· rssEN08:12 · 05·02
Qwen3.6-27B achieves 72 tokens per second on RTX 3090 with vLLM
Reddit user One_Slip1455 released a native Windows vLLM launcher for Qwen3.6-27B, reaching 72 tok/s on an RTX 3090. It reports 64.5 tok/s at ~25k tokens, 53.4 tok/s at 127k ctx on one GPU, and 160k ctx with PP=2 on 2×3090. The key detail is no WSL or Docker, an OpenAI-compatible endpoint, and an INT4 quant path.
#Inference-opt#Tools#Qwen#vLLM
why featured
HKR-H/K/R all pass: native Windows on an RTX 3090 is the hook, the post gives tok/s and ctx figures, and it hits local-inference cost concerns. Reddit single-source limits it to the lower featured band.
editor take
Two Reddit headlines claim fast single-GPU Qwen3.6-27B inference, but the body is 403; treat this as an engineering lead, not a benchmark.
sharp
Two LocalLLaMA headlines point to fast single-GPU Qwen3.6-27B inference, but the readable article body is only a Reddit 403 block. I would not treat this as a release, a benchmark, or independent validation. I’d treat it as an early community engineering signal. One headline claims Qwen3.6-27B reaches 72 tok/s on an RTX 3090 using native Windows vLLM, with no WSL and no Docker, plus a portable launcher and installer. The other claims Qwen3.6 27B FP8 runs at 80 TPS on a single RTX 5000 PRO 48GB with 200k tokens of BF16 KV cache. Both come from reddit-localllama, so the member count is 2, but the source base is not two independent outlets. The two angles are different enough to matter. The RTX 3090 post is about deployment friction: native Windows vLLM, no WSL, no Docker, and a packaged launcher. That targets a very specific pain point for local AI users. The RTX 5000 PRO post is about long-context feasibility: FP8 weights, 48GB VRAM, 200k BF16 KV cache, and 80 TPS. One says “more people can run this.” The other says “a workstation card can hold a serious context window.” Together, they show the local-inference conversation moving from “can a 27B model run locally” to “can it run comfortably on common desktop and workstation setups.” I buy that shift. I do not buy the numbers yet. The body does not disclose the command, batch size, prompt length, generation length, quantization recipe, vLLM version, CUDA version, driver version, attention backend, chunked prefill settings, or whether the reported speed is decode-only. “72 tok/s” and “80 TPS” can mean very different things in local inference. A single-user decode test, batched throughput, a short-output average, and a warm-cache demo can all be written as tokens per second. Without reproducible conditions, the numbers are headline claims, not usable benchmarks. The 200k BF16 KV cache claim needs extra care. The headline gives the context size and cache precision, but not the throughput curve across context length. Long-context inference is not a binary property. A model can accept a large context and still become unpleasant once prefill, attention, memory fragmentation, or cache pressure shows up. The RTX 3090 headline also does not state context length. A 24GB card running a 27B-class model has tight memory economics, especially if the claim involves FP8 or lower precision. The 72 tok/s figure is very unlikely to describe the same condition as the 200k-token RTX 5000 PRO result. The Windows-native vLLM angle is the part I take most seriously. vLLM’s center of gravity has long been Linux server setups. Local users have leaned on WSL2, Docker, llama.cpp, Ollama, LM Studio, TensorRT-LLM variants, and community launchers. If native Windows vLLM is stable enough for a portable installer, that matters more than a speed screenshot. Many corporate desktops block Docker. Some IT policies make WSL painful. A packaged Windows path can expand the test surface for internal assistants, document QA, log analysis, and coding tools where one decent local GPU beats API procurement friction. The obvious pushback: LocalLLaMA has a habit of turning “it runs on my box” into a performance story. That community is useful because people actually test hardware, but titles often omit the exact conditions that determine whether a number generalizes. Different prompts, sampling settings, context lengths, and warm-up behavior can move token rates a lot. I would not put 72 tok/s into a buying memo. I would not use 80 TPS for capacity planning. I would not compare either number against hosted APIs without a reproduction script. The practical read for AI teams is narrower and still useful. Qwen’s 20B-30B class appears to be entering a zone where single-card local use is no longer a hobby-only story. The useful workloads are low-concurrency and privacy-sensitive: internal code help, ticket triage, document search augmentation, local data exploration, and offline evaluation. The missing items are the ones that decide whether this becomes operational: GitHub repo, installer hash, pinned dependencies, bench command, model file, quantization path, driver matrix, and third-party reruns. Until those exist, this is a radar ping, not a benchmark.
HKR breakdown
hook knowledge resonance
open source
87
SCORE
H1·K1·R1

more

feeds

admin