ObviousBench

Where obvious tasks still break language models

Adam Allcock

Draft: This website and result table are placeholders until the paper-sweep artifacts are frozen.

Introduction

ObviousBench measures short, objective prompts that a careful human can solve directly, but language models can still answer incorrectly.

The headline score is answer correctness. Format and strict-compliance scores are reported separately, so a correct answer with extra prose is not treated as the main benchmark failure.

The public site will link only frozen, release-safe artifacts after the report, dataset, and code paths are intentionally published.

Leaderboard

Draft placeholder layout. Replace with frozen paper-sweep results before public launch.

Rank	Model	Correct	95% CI	Strict	Cost
1	Example model	--	--	--	--

Rank	Model	Strict	Correct	Notes
1	Example model	--	--	Draft placeholder

Benchmark

Current task families include character counting, spelling transforms, arithmetic, word counting, ordering, format compliance, negation, and constraint awareness.

Deterministic scorers are used for the release-safe benchmark artifacts.

Prompt to model answer, scored separately for answer correctness and strict compliance. — Evaluation split

Try Yourself

A lightweight demo can be added after the public prompt examples are selected. Until then, this page intentionally avoids exposing unpublished benchmark items.

Demo pending

Report

The report link will be added after the arXiv-ready PDF and citation metadata are frozen.

Report pending

Artifacts

Public Dataset

Release link pending. No unpublished item cards or private human-baseline data are committed here.

Code

Release link pending. The public code path will be linked only after the publication repo or branch is approved.

Citation

BibTeX pending until arXiv metadata exists.