FEATUREDarXiv · cs.AI· atomEN17:59 · 06·05
→How Reliable Are LLMs When Playing Dice?
The study tests 8 state-of-the-art models on two discrete-probability datasets, with and without Chain-of-Thought prompting; average accuracy reaches 0.96 on standard problems, falls to 0.59 on counterintuitive ones, and drops by up to 34% when prompts include misleading suggestions.
#Reasoning#Benchmarking#Research release#Benchmark
why featured
HKR-H/K/R all pass: the dice setup is novel, the accuracy drops are concrete, and the topic hits reasoning robustness. The task is narrow and only arXiv-summary depth is available, so it stays near the featured threshold.
editor take
Dice problems puncture the math-reasoning story: 8 models hit 0.96 on standard items, then fall to 0.59 on counterintuitive ones.
sharp
The uncomfortable part is not that probability is hard. It is that surface familiarity still does too much work. Across 8 state-of-the-art models, average accuracy is 0.96 on standard discrete-probability exercises, then drops to 0.59 on counterintuitive ones. Disguising canonical formulations cuts performance by over 20%. Adding misleading suggestions cuts it by up to 34%.
That is a bad look for the Chain-of-Thought story. The paper tests each model with and without CoT, but the abstract does not disclose model-level CoT gains. From the numbers given, CoT looks closer to fluent template execution than robust probability-space checking. Set beside high scores on GSM8K or MATH-style benchmarks, simple dice problems become the cleaner autopsy tool.
HKR breakdown
hook ✓knowledge ✓resonance ✓