StrawberryBench

The definitive benchmark for evaluating letter-counting ability in Large Language Models.
Last updated: β€”

πŸ€— Dataset ⭐ GitHub
β€”
Models Evaluated
β€”
Test Questions
β€”
Top Score
3
r's in "strawberry"

πŸ† Leaderboard

Provider:
Strategy:
# Model Strategy Overall ↓ Easy Medium Hard Sent. Para. Names Foreign πŸ“
Loading…
Showing β€”  Β·  Sorted by Overall accuracy  Β·  All scores are exact-match accuracy on the full test set
Easy (3–6 chars) Medium (7–12 chars) Hard (13+ chars) Sentence Paragraph

πŸ“– About StrawberryBench

πŸ”€

The Problem

In 2023, a viral moment exposed a curious weakness in ChatGPT: it could not correctly count the letter r in strawberry. The correct answer is 3. This failure is not a quirk β€” it stems from how LLMs tokenize text. BPE tokenizers decompose words into sub-word units, making character-level counting genuinely non-trivial.

πŸ“Š

The Benchmark

StrawberryBench contains 847 questions across seven tiers: a five-stage progression (easy, medium, hard, sentence, and paragraph) and two domain checks (names and foreign). Each question asks a model to count the occurrences of a specific letter. ~25% of questions have a zero count (letter absent).

πŸ§ͺ

Evaluation

Models are evaluated with three prompt strategies: zero-shot, few-shot (3-shot), and chain-of-thought. Scoring is exact-match on the integer count, with fuzzy parsing for written-out number words. All models are called via OpenRouter for consistency.