← Back

Prompt IDE

A library that saves a working prompt as a reproducible artifact, so the prompts that produce consistent results stop getting lost.

Project overview

Type: Product · 6-week 0→1 · web app + browser extension (for capture)
Project type: 0→1 · Creator & Developer Tool · AI Workflows · Knowledge Capture
Role: Lead Product Designer · Interaction Design · UX Research · Experiment Design
Methods: JTBD interviews · task analysis · usability testing · A/B
Tools: Otter.ai · Figma + Dev Mode · Cursor · Claude Code · Skills.md · Amplitude · Statsig
Case thesis: Designing a prompt library that captures a working prompt as a reproducible artifact (its model, parameters, and a proof output) in one action, so the prompts that produce consistent results stop getting lost.

The context

As AI gets more agentic, some argue that authoring good prompts is becoming obsolete. Consistency-critical work tells a different story: in image generation, a specific phrasing plus parameters reproduces a look; in music, creators reuse prompts to hold a style; in structured LLM tasks, a tuned prompt produces reliable output. In all three, small changes reproduce or break a result, so a prompt that works is a discovery worth keeping. Today those discoveries scatter across notes apps, docs, and screenshots.

The problem

Creators already save prompts and still cannot reproduce the results. In research, prompts stored in notes and docs reproduced the original result on later reuse only 31% of the time (behavioral, reproduction test), because the saved text dropped the conditions that made it work — the model and version, the parameters, and any proof it had worked. 72% of participants had lost a prompt they knew had produced a result they wanted (attitudinal, survey).

The goal

Turn a working prompt into an asset a creator can reproduce and reuse, measured by capture rate, reproduction success on reuse, and reuse and retention, rather than by how many prompt fields the tool could store.

Empathize — Creators already saved prompts and reproduced the result only 31% of the time, because the text dropped the conditions that made it work

In this section: Research foundation · Key insights

Research foundation (method)

Phase 1 — Creator interviews (n=20, ~40 min, image, music, and LLM users, recruited via dscout, transcribed in Otter.ai): how they save prompts today and when reuse fails.
Phase 2 — Task analysis of the generate → save → reuse loop: mapped where the working prompt is lost, which located the break at the moment of capture (mid-generation) and at the conditions dropped when saving by hand.
Phase 3 — Reproduction test (n=26): participants reused a prompt they had saved in their usual way and we recorded whether the result came back.
Phase 4 — Survey: ~1,067 invited; 190 analyzable responses (17.8% response rate); ~129 completed every item (12.1% completion rate). Attitudinal percentages are computed on the 190 analyzable responses; question types are labeled per question in the appendix.
Phase 5 — Usability testing of the artifact: validated the three-column layout and the one-action capture against the full-metadata form before the pilot.
Phase 6 — Prototype pilot (Amplitude-instrumented, 65 creators, ~32 per A/B arm, spring 2025): behavior on capture and reuse.

Key insights

1. The prompt text alone is not reproducible. The same string returns a different result on a different model, version, or parameter set, so a saved string without its conditions failed to reproduce. Triangulation:

Behavioral: prompts saved as plain text reproduced the result 31% of the time on reuse.
Verbatim (C5, image, Midjourney) — coded: Lost conditions: "I have the prompt in my notes, but I can't remember the version or the settings, and now it gives me something else."

2. The reproducibility context is specific and knowable, and it is exactly what gets dropped. What makes a prompt repeat is the model and version, the parameters (aspect ratio, stylize, chaos, seed, and the like), and at least one example output as evidence. These are the fields creators omitted when saving by hand.

Behavioral: prompts that happened to include the model and parameters reproduced at roughly 2.5× the rate of bare text (≈78%).
Verbatim (C12, music, Suno) — coded: Style drift: "Same prompt, but on the new model version it just doesn't sound like the track I made before, and I never wrote down which version it was."

3. Capture happens mid-generation or it does not happen. Creators were in a flow when a prompt worked, and anything that pulled them out to fill a form meant the prompt went unsaved.

Verbatim (C18, LLM, ChatGPT) — coded: Capture friction: "When it finally works I want to keep going, not stop and document it, so I tell myself I'll save it later and never do."

Dashboard — Why saved prompts stop working

Why saved prompts stop working
Scope: reproduction test (n=26) + survey (n=190)
Guiding question: Why can't creators reproduce their own saved prompts?

  Reproduced result from a plain-text save .... 31%
  Reproduced when model + params were saved ... ~78%
  Participants who lost a working prompt ...... 72%

Key Insight: Reproducibility lives in the conditions around the prompt, and
hand-saving drops exactly those conditions.

Define — A prompt had to be saved with the conditions that reproduce it, in one action, or it would not be saved at all

In this section: POV · How Might We · Principles · Insight→decision map

POV statement. A creator needs to save a working prompt together with the conditions that reproduce it (model, parameters, and a proof output) in a single action mid-generation, because the prompt text alone does not reproduce the result and saving it later does not happen.

How Might We

How might we capture a working prompt with its model, parameters, and output in one action, without pulling the creator out of their flow?
How might we make a saved prompt's example results trustworthy as evidence it still works?
How might we let a prompt grow into a family of variants without re-entering everything?

Design principles (each traceable to an insight)

Reproducibility is the unit. A saved prompt carries its model, version, and parameters alongside its text. (Insight 1, 2)
Capture in one action. Saving happens in the moment with conditions auto-attached; richer description is added later. (Insight 3)
Evidence is first-class. Example outputs sit beside the prompt as proof it produced a result, with a Stable or work-in-progress status. (Insight 2)
A prompt is a family. Variants of a prompt live together, so a creator keeps a base plus its variations.
Each format at its best. Images show as grids, music plays inline, and LLM prompts show their expected output as readable sheets.

Insight → decision map

Insight (from Empathize)	Concrete design decision
Plain text reproduced only 31% of results	Capture auto-attaches the platform, model and version, and parameters, so the saved prompt carries its conditions
Capture fails if it breaks the creative flow	A one-action capture grabs the prompt and the output just generated; description, tags, and status are added later
Creators iterate a prompt into many versions	A prompt holds variants as a family, with example results per variant, instead of separate scattered entries

Ideate & Craft — The prompt became a reproducible artifact: the spec, the proof, and the family of variants

In this section: Design execution · How capture works · Before → after · Other deliverables

Design execution

The library opens on a searchable grid of the creator's prompts (image and title), with tags and base filters by platform such as Midjourney, Suno, and ChatGPT, all user-creatable. Opening a prompt shows a three-column artifact:

Left — the reproducible spec. Title, short description, target model, and the base prompt to copy or edit, then a longer description and uses, a Stable or work-in-progress status, the model and version, parameters such as aspect ratio, stylize, and chaos where they apply, and tags.
Center — the proof. Example results produced with the prompt: images as a single column or 2×2 and 3×3 grids, music as inline players for each sample, and LLM prompts as readable output sheets stacked like pages, each showing what the prompt should generate.
Right — the family and the view. A control for how the examples display, and below it the variants: the same prompt with variations and their own results, so a creator keeps a base prompt and its family together.

One-action capture is the spine: when a result lands, a single capture saves the prompt with its platform, model, parameters, and that output attached, and the creator enriches it later. Prompts copy and export in one tap.

How capture works (the mechanism, made explicit)

One-action capture runs through a browser extension with per-platform hooks: on a supported surface (Midjourney, Suno, ChatGPT) it reads the model, version, and parameters the platform exposes and attaches them with the output the creator just got. Where a platform exposes nothing machine-readable, capture falls back to paste-and-parse, pulling parameters out of the prompt string itself. The honest cost of this design is that each new platform is an integration — coverage grows one surface at a time — which is why the capture targets the conditions that are knowable and lets the creator enrich the rest later.

Before → after

	Before (notes / docs / screenshots)	After (Prompt IDE)
What gets saved	The prompt text	The prompt plus model, parameters, and a proof output
Reproducing later	Guess the missing settings	Conditions travel with the prompt
Iterations	Scattered separate entries	A base prompt and its variants as one family
Trust it still works	Unknown	Example results and a Stable status

Other deliverables

Built in Figma with Dev Mode handoff: the three-column artifact layout, the per-format result views (image grid, inline music player, LLM output sheets), the one-action capture flow, and the variant-family model.

Dashboard — One-action capture fills the library (A/B: rejected full-metadata vs shipped one-action)

One-action capture fills the library
Scope: Last 30 days · pilot · full-metadata variant (A) vs one-action (B)
Guiding question: Which capture model produces a library worth reusing?

  Prompts captured per active session ...... 0.4 (A) → 2.6 (B)
  Median library size at 2 weeks ........... 3 (A) → 21 (B)
  Reproduction success on reuse ............ 84% (shipped one-action)

Key Insight: Attaching conditions at capture, in one action, filled the
library; requiring the full form up front left it near-empty.

Prototype / Test — Requiring full metadata at save made each prompt reproducible and kept the library empty; one-action capture filled it

In this section: The experiment · What it taught

Because reproducibility needs the conditions, the first build asked creators to fill the model, version, all parameters, a description, and example results when saving a prompt. It produced complete, reproducible entries, so it was A/B tested against one-action capture in Statsig across the pilot.

The failed variant. The full-metadata form made each saved prompt 95% complete, but creators would not stop mid-generation to fill it, so they captured a median of 0.4 prompts per session and the library stayed near-empty. A library that is not captured into has nothing to reproduce, so completeness on the few entries did not matter.

Complete entries, empty library
Scope: Statsig A/B · pilot · ~32 creators / arm
Guiding question: Which capture model produces a library worth reusing?

  Variant A — Full metadata required at save
    Per-prompt completeness ......... 95%
    Prompts captured per session .... 0.4
    Median library size at 2 weeks .. 3

  Variant B — One-action capture, enrich later
    Per-prompt completeness (initial) 78%
    Prompts captured per session .... 2.6
    Median library size at 2 weeks .. 21

Read with care: ~32 creators per arm. The capture-rate and library-size gaps
are large and consistent; the 30-day retention figure (in Outcomes) is
directional.

Key Insight: Auto-attaching the conditions and letting enrichment wait got
creators to capture, so the library filled and stayed reproducible.

What it taught. A capture tool lives on the friction of the moment of capture; demanding completeness up front guarantees an empty library, while attaching the knowable conditions automatically keeps reproducibility without the cost. The one-action model shipped, with enrichment available afterward.

Outcomes & reflections

In this section: Causal chain · Limitations · Competitive context · Reflections

Causal chain (pilot, 65 creators)

Two comparisons run here, and the table keeps them separate. The thesis is real-world: prompts saved as plain text reproduced 31% of the time, and the tool's captured-with-conditions artifacts reproduced 84% on reuse. The experiment is the A/B: one-action capture (the shipped Variant B) against the full-metadata form (the rejected Variant A). One-action capture attached the model, parameters, and output in a single step, so prompts captured per session rose 0.4 → 2.6 and the library filled (3 → 21 entries at two weeks); because the conditions traveled with each prompt, creators reused what they saved (prompts reused per creator per month 1.1 → 4.7) and trusted the Stable status, lifting day-30 retention 38% → 57% for the one-action cohort.

Metric	Baseline	Result	What it compares
Reproduction success on reuse	31% (plain text)	84% (tool)	real-world vs shipped tool
Prompts captured per active session	0.4 (full-metadata A)	2.6 (one-action B)	rejected variant vs shipped
Median library size at 2 weeks	3 (A)	21 (B)	rejected variant vs shipped
Prompts reused per creator / month	1.1 (A cohort)	4.7 (B cohort)	rejected variant vs shipped
Day-30 retention (directional)	38% (A cohort)	57% (B cohort)	rejected variant vs shipped

Reading note: only the reproduction row is a tool-vs-real-world before/after; the rest are the shipped one-action design against the rejected full-metadata variant. The win is that the one-action design filled a reproducible library where the thorough-but-heavy one stayed empty.

Scale note: a reusable prompt compounds — a creator who reproduces a working result builds on it instead of rediscovering it, so the value of the library grows with every capture rather than resetting each session.

Limitations (stated, because a portfolio claim is only as strong as what it concedes)

Small pilot. 65 creators, ~32 per A/B arm. The capture and library effects are large and consistent; the 30-day retention lift is directional.
Self-reported and short-horizon. Reproduction was judged in a structured test, not at scale, and retention is measured over the pilot window only.
Capture coverage is per-platform. Auto-attaching conditions depends on a hook for each surface; an unsupported platform drops to paste-and-parse, which only recovers what is in the prompt string.
Stored outputs carry their own terms. Example images and audio come from platforms with their own usage terms, so the library treats them as the creator's evidence, not redistributable assets.

Competitive context

Prompt management is a populated category in 2026 — PromptLayer, LangSmith, Langfuse, Braintrust, Helicone, Vellum — but those are developer and production-LLM platforms: prompt versioning, evaluation, tracing, and deployment for engineering teams shipping LLM apps. Prompt IDE sits in a different place: the individual creator, working across image, music, and LLM, who needs a working prompt captured as a reproducible artifact with proof, not a versioned string in a production registry. The wedge is reproducibility-as-artifact plus one-action capture for creators, where the dev tools optimize evaluation and deployment for teams. The two do not compete so much as serve different users — and naming that is what keeps the case from looking like it ignored a crowded field.

Reflections (transferable principles)

The reusable unit in generative work is the prompt plus the conditions that reproduce it; saving the text alone preserves the wording and loses the result.
A capture tool is won or lost at the moment of capture: attach the knowable context automatically and let enrichment wait, because anything that interrupts the creative flow leaves the library empty. (The A/B is the evidence: the complete-but-heavy form produced an empty library.)
On the question of whether prompt authoring is becoming obsolete: as long as a specific phrasing plus parameters reproduces a look, a sound, or a structured output, the reproducible prompt is an asset worth keeping — auto-optimization changes how prompts are written, not whether a working result is worth being able to reproduce.
Treating example outputs as first-class evidence, with a stability status, turns a stored prompt into something a creator can trust to reuse, which is what makes a library worth returning to.