Skip to content

Benchmarks

Kumiki's claims — "fewer tokens than React" and "an LLM can learn it from the spec alone" — are measured, not asserted. Two suites live in packages/benchmarks; every number on this page comes from them and can be reproduced with the commands at the bottom.

  • Size comparison — how compact is a Kumiki app, and a Kumiki edit, versus the equivalent React? Deterministic; re-measured 2026-06-11.
  • Learning cost — given only the spec, can an LLM write a program that parses, typechecks, and builds in a single pass? Cross-vendor (Claude / Codex / Gemini), re-scored 2026-06 against the current compiler.

Size comparison (Kumiki vs React)

Baselines: the same TodoMVC written twice — 02-todomvc/app.kumiki and a plain React App.tsx. Tokenized with gpt-tokenizer.

Whole file

charsLOC (code)cl100k tokenso200k tokens
Kumiki4,7101161,3601,364
React7,3572361,8871,926
React ÷ Kumiki1.56×2.03×1.39×1.41×

The same app costs an LLM ~1.4× fewer tokens and ~2× fewer lines in Kumiki.

Edit scenarios

Whole-file size matters once; edit size matters every iteration. Four realistic feature changes were applied to both implementations, and the resulting unified diffs measured:

ScenarioImpl+lines−linescharscl100k
01 add priority fieldKumiki44568175
React83483150
02 strict validationKumiki74432112
React4232288
03 add archived stateKumiki19111,540442
React2791,475417
04 dark themeKumiki2561,315401
React48122,153638
TotalKumiki55253,8551,130
React87264,4331,293

Totals favor Kumiki — 1.58× fewer added lines, 1.14× fewer tokens — but not uniformly: small, localized React edits (scenarios 01–02) are cheaper because JSX changes one attribute in place, while Kumiki replaces whole definitions. Kumiki wins where a change crosses state + UI + logic (03, 04), which is where edits get risky in React.

Edit representation (full file vs patch vs op stream)

Kumiki's AI-editing verbs (add / replace / remove) send whole definitions. Measured across the same four scenarios, the op stream costs 26% of the tokens of rewriting the full file (1,544 vs 5,896 cl100k) — but a plain unified text patch is smaller still (1,130). The op stream's value is not minimal bytes; it's that each op is independently checkable and never produces a syntactically broken file.

Learning cost (writing Kumiki from the spec alone)

Each task gives a model only docs/spec/ + a task spec and asks for a single-pass .kumiki program — no example apps, no compiler in the loop, no retries. The harness then scores parse / typecheck / build. Protocol details and fairness notes (including why an earlier 4/4 Claude run was discarded) are in learning-cost/summary.md.

TaskVendorLOCcl100kparsetypecheckbuild
v1 Pomodoro (~60 LOC)Claude59384
v2 Kanban (~200 LOC)Claude1781,421
Codex2431,881
Gemini1521,314
v3 Issue Tracker (~600 LOC)Claude6295,325
Codex6746,417
Gemini4404,995
v4 Project Mgmt (~900 LOC)Claude1,0299,552
Codex8778,703
Gemini2944,397

What the table says:

  • Mid-size apps build from the spec alone, in one pass. Every vendor builds v2; two of three build the ~600-LOC v3.
  • Codex builds everything it attempted, including the ~880-LOC v4 — the only vendor to survive the largest task. Claude holds through v3, then trips on an unsupported match pattern at v4 scale; Gemini degrades earliest.
  • The benchmark is a compiler test too. The runs surfaced two real defects — built-in tiles that crashed at build (#61) and rules the spec stated only by example (#62). Both are fixed, and the table above is scored against the patched compiler. The three remaining ❌ are genuine authoring errors the toolchain correctly rejects.

Reproducing

sh
pnpm --filter @kumikijs/benchmarks measure            # whole-file sizes + ratios
pnpm --filter @kumikijs/benchmarks measure:scenarios  # per-scenario patch sizes
pnpm --filter @kumikijs/benchmarks measure:ops        # full-file vs patch vs op stream
pnpm --filter @kumikijs/benchmarks eval <output.kumiki>  # score a learning-cost output

To refresh a learning-cost vendor column, run the model on the committed prompt (vN-*/codex-prompt.txt / gemini-prompt.txt), save its output under results/<Vendor>/output.kumiki, and re-run eval.