ObviousBench

Where obvious tasks still break language models

Draft: This website and result table are placeholders until the paper-sweep artifacts are frozen.

Introduction

ObviousBench measures short, objective prompts that a careful human can solve directly, but language models can still answer incorrectly.

The headline score is answer correctness. Format and strict-compliance scores are reported separately, so a correct answer with extra prose is not treated as the main benchmark failure.

The public site will link only frozen, release-safe artifacts after the report, dataset, and code paths are intentionally published.

Leaderboard

Draft placeholder layout. Replace with frozen paper-sweep results before public launch.

Rank Model Correct 95% CI Strict Cost
1 Example model -- -- -- --

Benchmark

Current task families include character counting, spelling transforms, arithmetic, word counting, ordering, format compliance, negation, and constraint awareness.

Deterministic scorers are used for the release-safe benchmark artifacts.

Evaluation split
Prompt to model answer, scored separately for answer correctness and strict compliance.

Try Yourself

A lightweight demo can be added after the public prompt examples are selected. Until then, this page intentionally avoids exposing unpublished benchmark items.

Report

The report link will be added after the arXiv-ready PDF and citation metadata are frozen.

Artifacts

Public Dataset

Release link pending. No unpublished item cards or private human-baseline data are committed here.

Code

Release link pending. The public code path will be linked only after the publication repo or branch is approved.

Citation

BibTeX pending until arXiv metadata exists.