StrawberryBench

🏆 Leaderboard

Provider:

Strategy:

#	Model	Strategy	Overall ↓	Easy	Medium	Hard	Sent.	Para.	Names	Foreign	🍓
Loading…

Showing — · Sorted by Overall accuracy · All scores are exact-match accuracy on the full test set

Easy (3–6 chars) Medium (7–12 chars) Hard (13+ chars) Sentence Paragraph

📖 About StrawberryBench

🔤

The Problem

In 2023, a viral moment exposed a curious weakness in ChatGPT: it could not correctly count the letter r in strawberry. The correct answer is 3. This failure is not a quirk — it stems from how LLMs tokenize text. BPE tokenizers decompose words into sub-word units, making character-level counting genuinely non-trivial.

📊

The Benchmark

StrawberryBench contains 847 questions across seven tiers: a five-stage progression (easy, medium, hard, sentence, and paragraph) and two domain checks (names and foreign). Each question asks a model to count the occurrences of a specific letter. ~25% of questions have a zero count (letter absent).

🧪

Evaluation

Models are evaluated with three prompt strategies: zero-shot, few-shot (3-shot), and chain-of-thought. Scoring is exact-match on the integer count, with fuzzy parsing for written-out number words. All models are called via OpenRouter for consistency.