AI evidence: one-shot transcript portfolio
This page is the operational layer of kerf’s AI evidence: a portfolio of one-shot transcripts that re-derive the complete example apps from a one-paragraph prompt and nothing else. The model is pointed at llms.txt and the AI usage guide — no kerf code in its context window beforehand. The transcripts are the experiment; the running apps at the bottom of each page are that exact code.
The precedent is Built by an AI · Pomodoro — the original proof that the docs kerf ships are sufficient for an off-the-shelf model to produce a working app. This portfolio extends that to five more shapes covering the harder cases: drag-and-drop interaction, focus survival across re-renders, streaming list updates, caret preservation during contenteditable diff, and a multi-region store-backed UI.
The transcripts
Section titled “The transcripts”- Kanban — three columns, ten cards, drag across columns. Exercises pointer events +
delegate()+data-morph-skipon the dragging card. - TodoMVC — the canonical reactive-UI test bed. Exercises
defineStore+each+ filter/clear actions + focus survival on insert/delete. - Dashboard — a live-updating ticker grid with one tick per second. Exercises
signal+computed+effect+batchagainst a deterministic update loop. - Markdown editor — split-pane editor with live preview. Exercises
raw()+ sanitization + caret survival in acontenteditable-adjacent re-render. - Streaming chat — token-by-token bot reply with composer caret preservation. Exercises
each()keyed-list reconcile during a stream + the textarea caret-survival contract.
Honest accounting
Section titled “Honest accounting”Each page follows the same rule the Pomodoro precedent set: show what the model produced, as it produced it. When the produced code has bugs or needs cleanup, the diff is documented verbatim. The page does not pretend the model is infallible — it documents what an off-the-shelf model with kerf’s published docs can do, and what it can’t.
How a transcript is produced
Section titled “How a transcript is produced”The reproducibility recipe is identical for every page:
- Pin the kerf version. Recorded in the page frontmatter so future readers know which
llms.txtrevision the model saw. - Pin the model. Currently Claude Opus 4.7 with the 1M-context variant where it matters. Future runs against new models append rather than replace, so the portfolio doubles as a regression suite when a model gets better or worse at kerf.
- Pin the prompt. A single paragraph that names the app, the interaction, and the doc URLs the model should fetch. No kerf-specific identifiers in the prompt itself — the prompt could be given to any reactive framework’s docs.
- Run the prompt in a clean session. No prior kerf context in the model’s conversation. The model has only what it learned at pretraining time plus whatever the cited URLs fetch.
- Capture the raw output verbatim. If the model produced something that doesn’t compile or doesn’t run, the page documents that and shows the minimum diff that made it run.
- Stand the running app up at the bottom of the page from the produced code.
Status
Section titled “Status”| App | Prompt drafted | Transcript captured | Page published |
|---|---|---|---|
| Kanban | ✅ | ⏳ | scaffold |
| TodoMVC | ✅ | ⏳ | scaffold |
| Dashboard | ✅ | ⏳ | scaffold |
| Markdown editor | ✅ | ⏳ | scaffold |
| Streaming chat | ✅ | ⏳ | scaffold |
The five prompts are drafted in their respective pages; the transcript capture is the remaining work. Tracked in KF-170 — the one-shot transcript portfolio sub-ticket of the AI-evidence epic (Hot Sheet is local-only, so the ticket reference is for provenance only).
Caveats
Section titled “Caveats”- Model behavior drifts. Each transcript records its model + run date so the reader can interpret. When a new frontier model lands, the experiment gets re-run; older runs are kept in a history section per page rather than overwritten.
- The prompts are not blind. They name the app shape and the doc URLs. They do not name the framework as “kerf” beyond the URL — but a clever model could infer from the URLs themselves. The blind-prompt question is the harder operational measurement, tracked separately under the empirical-benchmark sub-ticket (the krausest-style cross-framework grid).
- One run per cell ≠ statistical significance. A single one-shot is an existence proof, not a sampling distribution. The portfolio shows the model can do this; the empirical benchmark establishes how often it does.