π Leaderboard
| # | Model | Strategy | Overall β | Easy | Medium | Hard | Sent. | Para. | Names | Foreign | π |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Loading⦠| |||||||||||
π About StrawberryBench
The Problem
In 2023, a viral moment exposed a curious weakness in ChatGPT: it could not correctly
count the letter r in strawberry. The correct answer is
3. This failure is not a quirk β it stems from how LLMs tokenize text.
BPE tokenizers decompose words into sub-word units, making character-level counting
genuinely non-trivial.
The Benchmark
StrawberryBench contains 847 questions across seven tiers: a five-stage progression (easy, medium, hard, sentence, and paragraph) and two domain checks (names and foreign). Each question asks a model to count the occurrences of a specific letter. ~25% of questions have a zero count (letter absent).
Evaluation
Models are evaluated with three prompt strategies: zero-shot, few-shot (3-shot), and chain-of-thought. Scoring is exact-match on the integer count, with fuzzy parsing for written-out number words. All models are called via OpenRouter for consistency.