← Back

Proposals with AI

Evidence-led proposal generation for freelancers on Upwork and Freelancer.com.


Project overview

  • Type: Product · 10-week 0→1
  • Project type: 0→1 · Generative AI · Activation & Win-Rate · Marketplace
  • Role: Lead Product Designer · Generative-AI UX · UX Research · Experiment Design
  • Methods: JTBD interviews · corpus analysis · Wizard-of-Oz prototype · A/B
  • Tools: Otter.ai · Figma + Dev Mode · Cursor · Claude Code · Amplitude · Statsig
  • Case thesis: Designing a proposal generator that retrieves a freelancer's relevant past work and cites it against a specific job brief, in minutes, so the speed of AI raises win rate instead of flooding the platform with generic text.

The context

Remote work pushed millions onto freelance marketplaces, and the same wave put generative AI in every applicant's hands. Mass-produced proposals are now trivial to send, which floods job posts and pushes platforms like Upwork to penalize low-effort, templated submissions. A freelancer competing for a post is up against dozens of applicants and a client skimming for a reason to reply within seconds.

The problem

Freelancers spend roughly 25% of their working hours writing proposals (attitudinal, survey self-report), yet the typical reply rate on a sent proposal sits around 6% (behavioral, corpus + public benchmarks ~5–10%). The obvious AI shortcut makes this worse: generating a fluent proposal from the job brief alone produces text a client recognizes as mass-applied, so volume rises while signal drops.

The goal

Raise replies earned per hour spent on proposals, by cutting drafting time and lifting the quality signal that earns a client reply — measured as reply rate and replies-per-hour, not as proposals generated.


Empathize — Proposals that cited a specific relevant past project replied at nearly 3× the rate of generic ones

In this section: Research foundation · Key insights

Research foundation (method)

  • Phase 1 — Freelancer interviews (n=50, ~40 min, recruited via dscout, transcribed in Otter.ai): active freelancers across design, writing, and development, on how they decide what to write and when they give up.
  • Phase 2 — Proposal corpus analysis (n=540 real proposals, anonymized, contributed by 38 of the interviewees): each proposal coded against a written codebook for structure and for whether it cited a specific named past project relevant to the brief, then matched to the outcome the freelancer reported. (Outcomes here are self-reported; the behavioral check came later in the instrumented pilot.)
  • Phase 3 — Survey: 1,220 panelists invited via dscout; 200 analyzable responses (16.4% response rate); 137 completed every item (11.2% completion rate). Attitudinal percentages below are computed on the 200 analyzable responses; items placed late in the instrument are noted where their base is the 137 completers. Question types (select-all vs single-select) are labeled per question in the appendix.
  • Phase 4 — Prototype pilot (Amplitude-instrumented, web, 22 freelancers, Apr–May 2025): funnel data from brief intake through submission and client reply.

Key insights

1. The time sink is repetitive rewriting, and most of it is low-value. Freelancers rebuilt the same introduction, portfolio summary, and availability lines for every post, spending a median 14 minutes per custom proposal. A quarter of working hours went to this.

  • Verbatim (P12, UX writer, 4 yrs freelance) — coded: Repetitive effort: "I'm rewriting the same three paragraphs for the hundredth time, just changing two sentences for the actual job."

2. Specific, relevant proof is what earns the reply. In the 540-proposal corpus, proposals that cited a specific named past project relevant to the brief reported a reply rate of 11%, against 4% for proposals that opened with generic capability claims. The signal a client responds to is demonstrated, matched experience.

  • Verbatim (P7, Shopify developer, 6 yrs freelance) — coded: Specific proof: "The replies I get are always when I name the exact thing I've already built for someone else."
  • Raw example: winning proposals opened with lines like "I built the same Shopify-to-QuickBooks sync you're describing for [client] last year," while losing ones opened with "I am a detail-oriented professional with 5+ years of experience."

3. Freelancers want AI help and fear it sounding generic enough to get them flagged. The pull toward AI assistance and the fear of generic output point the same way: they will use AI only if it preserves their voice and specificity. Triangulation with the corpus:

  • Attitudinal (n=200): 70% of respondents stressed personalization as essential; 58% said they worried AI text would read as spam to clients.
  • Verbatim (P31, brand designer, 3 yrs freelance) — coded: AI distrust: "If it sounds like a robot wrote it, I've already lost — and now I'm scared the platform thinks I'm spam."
  • Behavioral: the corpus already showed generic openings replying at less than half the rate of specific ones, so their fear matched the data.

Dashboard — What separates a reply from silence

What separates a reply from silence
Scope: 540 proposals · contributed by 38 freelancers · self-reported outcomes
Guiding question: Which proposals get a client reply?

  Reply rate by proposal opening
    Cites a specific relevant past project ...... 11%
    Generic capability claim .................... 4%
    Overall baseline ............................ 6%

Key Insight: The reply gap tracks specificity of proof, so a generator
that only produces fluent text would push freelancers toward the 4% pattern.

Define — The generator had to retrieve the freelancer's relevant proof and cite it in their own voice

In this section: POV · How Might We · Principles · Insight→decision map

POV statement. Freelancers need to send a proposal that cites their specific relevant experience against this specific brief, in minutes, because reply rate depends on demonstrated proof and marketplaces penalize generic text.

How Might We

  1. How might we let a freelancer produce an evidence-led proposal in under three minutes?
  2. How might we ensure every generated draft cites at least one concrete, relevant past project from that freelancer's own history?
  3. How might we keep the freelancer's voice so the draft reads as theirs and survives client and platform scrutiny?

Design principles (each traceable to an insight)

  • Evidence first. A draft cannot be sent until it cites at least one specific, relevant past project. (Insight 2)
  • Voice preserved. Generation pulls from the freelancer's own phrasing and outcomes, and edits are one tap away. (Insight 3)
  • Measure win rate per hour. Speed is in service of reply rate, and the funnel reports both. (the goal)
  • Transparency. The freelancer always sees which lines were drafted and which were their own. (Insight 3)

Insight → decision map

Insight (from Empathize) Concrete design decision
25% of hours lost to rewriting the same boilerplate The profile holds reusable intro, availability, and portfolio blocks the generator assembles, so the freelancer never rewrites them
Specific cited proof replies at 11% vs 4% generic The generator retrieves relevant past projects from the profile and requires at least one citation before a draft can be sent
70% demand personalization; 58% fear AI-as-spam Drafts use the freelancer's own outcome phrasing, and an inline marker shows which sentences are AI-drafted versus theirs

Ideate & Craft — The profile became a structured evidence library the generator could pull from

In this section: Concept validation (Wizard-of-Oz) · Design execution · Before → after · Other deliverables

Concept validation before building retrieval (Wizard-of-Oz)

Before engineering any retrieval, I tested whether evidence-led, cited drafts would actually read as authentic to the freelancers themselves. In a Wizard-of-Oz round with 8 freelancers, a researcher manually matched each pasted brief to projects from that freelancer's own history and hand-wrote a cited opening, presented in the prototype as if the system had generated it. Each freelancer saw two openings for the same brief — a cited one and a generic one — without being told which was which.

What it surfaced: freelancers preferred the cited opening in 7 of 8 sessions and repeatedly tagged the generic one as "not how I'd pitch," confirming two things the build depended on — that a matched-citation opening reads as theirs, and that the matching logic (brief requirement → relevant past project) was tractable enough to automate. This de-risked the engineering: I was building retrieval to reproduce a result I had already seen work by hand.

Design execution

The product turns the freelancer's history into structured, retrievable evidence:

  • Evidence library — onboarding captures past projects as cards with client type, deliverable, tools, and a measurable outcome, plus reusable intro and availability blocks.
  • Brief intake — the freelancer pastes the job post; the system extracts requirements and matches them to the most relevant evidence cards.
  • Draft with citations — the generator writes an opening built on a specific matched project, then fills supporting paragraphs, marking AI-drafted lines so the freelancer can revise in their own words.
  • Proposal angles — one tap reframes the same evidence as "budget-friendly," "fast turnaround," or "premium quality" to fit the client's stated priority.

Before → after

Before (manual / generic-AI) After (Proposals with AI)
Where evidence comes from Recalled and retyped each time Retrieved from the freelancer's evidence library
Opening line Generic capability claim A specific matched past project, cited
Time per proposal ~14 min ~4 min
Voice All-or-nothing AI-drafted lines marked for the freelancer to revise

Other deliverables

Built in Figma with Dev Mode handoff: the evidence-card component set, the brief-to-evidence matching state, the citation-required send gate, and the AI-versus-yours line marking.

Dashboard — Drafting time collapses while the citation gate holds

Drafting time collapses while the citation gate holds
Scope: Last 30 days · web · Proposals with AI pilot (22 freelancers)
Guiding question: Did faster drafting still produce evidence-led proposals?

  Median time per proposal ........ 14 min → 4 min   (−71%)
  Proposals citing a specific project
    Before (freelancer's own) ..... 31%
    After (generator + send gate) . 94%

Key Insight: The send gate raised cited-proof proposals to 94% while
drafting time fell, so speed and the reply-driving signal moved together.

Prototype / Test — Brief-only generation was fast and fluent, and its reply rate stayed flat; evidence-matched drafts moved it

In this section: The experiment · What it taught

The first build generated a full proposal from the job brief alone, with no retrieval from the freelancer's history. It was the fastest path to a finished draft, so it was A/B tested against the evidence-matched generator in Statsig across the pilot.

The failed variant. Brief-only drafts produced a finished proposal in a median of 3 minutes, the fastest of any variant, and freelancers rated them the most fluent. Their reply rate came in at 5%, flat against the 6% manual baseline, and two pilot accounts had proposals hidden by Upwork's low-quality filter. Fluent, brief-derived text reproduced the generic-opening pattern the corpus already linked to a 4% reply rate.

Brief-only is fastest and does not move replies
Scope: Statsig A/B · web · Proposals with AI pilot · 2 variants
Base: ~280 proposals across 22 freelancers (~11 freelancers / arm)
Guiding question: Which variant produces proposals that earn client replies?

  Variant A — Brief-only generation
    Median draft time .............. 3 min
    Reply rate ..................... 5%   (vs 6% manual baseline)
    Flagged by platform filter ..... 2 accounts

  Variant B — Evidence-matched generation
    Median draft time .............. 4 min
    Reply rate ..................... 12%
    Flagged by platform filter ..... 0

Read with care: with ~11 freelancers per arm and replies clustered within
freelancer, this is a DIRECTIONAL result, not a statistically powered one.
The decision to ship Variant B rested on (a) the consistent direction of the
effect, (b) its agreement with the 11% the corpus had already predicted for
cited proof, and (c) the asymmetric downside of Variant A — platform flags
that put a freelancer's account at risk.

Reconciling the numbers. The self-reported corpus predicted ~11% for cited-proof proposals; the instrumented pilot measured 12% for the evidence-matched variant. The behavioral result landing on top of the earlier self-report is what gives the finding its weight — the prediction and the measurement agree across two independent methods.

What it taught. In a generative product, fluency is cheap and easy to mistake for quality; the metric that mattered was whether the output earned a client reply, and only retrieved, specific evidence did. The evidence-matched generator shipped.


Outcomes & reflections

In this section: Causal chain · Limitations · Reflections

Causal chain (pilot, 22 freelancers, web, Apr–May 2025)

Drafting time fell (median 14 min → 4 min, −71%), and because each draft led with cited proof, reply rate per proposal roughly doubled (6% → 12%) — matching both the corpus prediction and the A/B result. Replies convert downstream at a stable rate, so interviews per 100 proposals rose 4 → 8 and hires per 100 proposals rose ~2 → ~4 (hire counts are small-n extrapolations; treat as directional).

Normalizing to effort, replies earned per hour of proposal work rose roughly 7× — the product of a 3.5× speed gain (14→4 min) and a 2× reply-rate gain (6%→12%). I report this decomposition rather than a single headline multiple on purpose: the speed term only creates value when it is paired with the reply-rate term. The brief-only variant proved the opposite case — speed alone, with a flat reply rate, produced no gain in replies and drew platform flags. (Supuesto: replies-per-hour normalizes outcomes to a fixed hour of proposal effort; the multiple holds whether a freelancer reinvests freed time into more proposals or sends the same number at higher quality.)

Among the 22 pilot freelancers, Week-4 retention was 18 of 22, and the median user reported submitting proposals to better-fit posts because screening took less effort.

Metric Before After Δ
Median drafting time per proposal 14 min 4 min −71%
Reply rate per proposal 6% 12% +6 pts (~2×)
Interviews per 100 proposals 4 8 +4 (~2×)
Hires per 100 proposals (small-n) ~2 ~4 ~2×
Replies earned per hour on proposals baseline ~7× speed (3.5×) × signal (2×)

Funnel note: reply → interview converts at ~67% and interview → hire at ~50% in both periods, so the gains come from the top of the funnel (more replies) and better-fit targeting, not from inflated conversion rates.

Scale note: at a freelancer's volume of dozens of proposals a month, moving reply rate from 6% to 12% while cutting the hours spent is the difference between a thin pipeline and a full one.

Limitations (stated, because a portfolio claim is only as strong as what it concedes)

  • Underpowered pilot. n=22, with the A/B at ~11 freelancers per arm and replies clustered within freelancer. Results are directional. The ship decision rested on direction + cross-method agreement + asymmetric platform-flag risk, not on statistical significance.
  • Self-report in two phases. Corpus outcomes and survey attitudes are self-reported. The instrumented pilot funnel is the behavioral check, and it agreed with the self-reported corpus — but the corpus on its own would not carry the claim.
  • Cold start. The product's value depends on a populated evidence library, so a brand-new freelancer with no history gets the least benefit — exactly the segment the 6% baseline hurts most. Mitigation explored in onboarding: seeding evidence from non-marketplace work (employment, coursework, personal projects). This remains the primary open problem for the next phase.
  • Confidentiality. Evidence cards can draw on client work under NDA, so onboarding captures client type rather than client name and strips identifying detail. This is partially solved and needs hardening before scale.

Reflections (transferable principles)

  • In a generative product, fluency is the cheapest output and a poor proxy for value; the design has to optimize the downstream outcome the user wants, which here was a client reply earned by specific proof.
  • Personalization in an AI tool is a retrieval problem before it is a writing problem: the system can only cite the freelancer's relevant proof if the product first captured that proof as structured, matchable evidence.
  • When a tool writes into a marketplace with its own quality enforcement, the freelancer's win rate and the platform's anti-spam signals push in the same direction, so designing for genuine specificity satisfies both at once.
  • A small pilot is worth more when its limits are stated than when they are hidden: the directional result held up precisely because it agreed with an independent, earlier prediction from the corpus.