simmer

v3.0.0

Iterative artifact refinement with investigation-first judge board - constructs problem-specific judges that read the code, understand the problem, and propose evidence-based improvements

refinementiterationcriteriascoringjudgegeneratorartifactimprovementworkspaceevaluatorpipelineoptimization
Install Command /plugin install 2389-research/simmer
View Source
01

Documentation

Full plugin documentation and usage guide

Simmer

You wrote a prompt. It works. But is it good? Simmer runs your artifact through multiple rounds of criteria-driven refinement โ€” each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.

Read the story behind Simmer โ†’

Iterative artifact refinement โ€” take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.

Installation

/plugin marketplace add 2389-research/claude-plugins
/plugin install simmer@2389-research

What This Plugin Provides

One skill (simmer) with four subskills that run the refinement loop:

  • Setup โ€” identify the artifact (file or workspace), elicit 2-3 quality criteria, determine evaluation method
  • Generator โ€” produce an improved version based on the judge's ASI (Actionable Side Information โ€” the single highest-leverage fix)
  • Judge โ€” score the candidate 1-10 per criterion, produce the ASI
  • Reflect โ€” record the trajectory, track the best candidate across iterations

Trigger Phrases

Say any of these to start a simmer loop:

"simmer this", "refine this", "hone this", "iterate on this", "make this better", "improve this over a few rounds", "polish this", "tighten this up", "optimize this pipeline", "find the best model for this task", "tune this configuration"

Any request to iteratively improve an artifact triggers simmer.

When to Use Simmer vs. Cookoff vs. Omakase-Off

SituationUseWhy
You have an artifact and want it bettersimmerIterative refinement against criteria
You want parallel code implementations competingcookoffParallel generation + fixed criteria + pick winner
You don't know what you want yetomakase-offParallel designs + react + discover preference
Rule of thumb: Simmer = serial iteration. Cookoff/omakase-off = parallel exploration.

Quick Example: Single File

User: "Simmer this" [pastes a pitch email]

Claude: I'm using the simmer skill to set up iterative refinement.

What does "better" mean for this? I'd suggest:
1. Value prop clarity
2. Tone match
3. Call to action strength

User: "Those work"

[Iteration 0: Judge scores seed โ€” 4.0/10]
[Iteration 1: Generator fixes value prop โ†’ 5.3/10]
[Iteration 2: Generator fixes CTA โ†’ 6.3/10]
[Iteration 3: Generator fixes tone โ†’ 7.3/10]

Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0    | 4         | 5    | 3   | 4.0       | seed       |
| 1    | 7         | 5    | 4   | 5.3       | specific problem statement |
| 2    | 7         | 6    | 6   | 6.3       | low-friction CTA |
| 3    | 7         | 7    | 8   | 7.3       | peer-sharing tone |

Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?

Quick Example: Workspace with Evaluator

User: "Simmer this pipeline โ€” find the best setup for this task"

Claude inspects the workspace, finds evaluate.sh and config.json,
and proposes criteria + evaluation method:

  Evaluator: ./evaluate.sh
  Criteria: accuracy, cost efficiency, latency
  Search space: models, prompt text, pipeline topology

User: "Looks good, coverage is the priority. 5 iterations."

[Iteration 0: Run evaluator on seed, judge scores โ€” 3.7/10]
[Iteration 1: Generator swaps to cheaper model โ†’ 5.3/10]
[Iteration 2: Generator splits into 2-step chain โ†’ 7.0/10]
[Iteration 3: Generator adds few-shot examples โ†’ 7.7/10]
...

Best candidate: iteration 4 (8.1/10)

Works On Anything

Artifact typeSuggested criteria
Document / specclarity, completeness, actionability
Creative writingnarrative tension, specificity, voice consistency
Email / commsvalue prop clarity, tone match, call to action strength
Prompt / instructionsinstruction precision, output predictability, edge case coverage
API designcontract completeness, developer ergonomics, consistency
Pipeline / workflowcoverage, efficiency, noise
Configuration / infracorrectness, resource efficiency, maintainability

Evaluation Modes

ModeWhen to use
Judge-only (default)Text artifacts โ€” judge scores against criteria
RunnableCode/pipelines โ€” judge interprets script output
HybridBoth โ€” run script AND judge results against criteria
No format contract on evaluator output. The judge reads whatever your script produces โ€” test results, metrics, error logs, anything.

Judge Board

Simmer auto-selects between a single judge and a multi-judge board based on complexity:

  • Simple (short email, tweet, โ‰ค2 criteria) โ†’ single judge, fast
  • Complex (3 criteria, long artifact, code, pipelines) โ†’ judge board with deliberation

The board constructs three judges tailored to your specific problem โ€” not from a fixed menu, but by reading your artifact, criteria, and constraints and designing judges with diverse perspectives. An extraction prompt gets different judges than a DND adventure hook.

Judges investigate before scoring โ€” they read the evaluator script, ground truth, prior candidates, and config files to understand the problem deeply. A judge who reads the evaluator discovers scoring mechanics on iteration 0 instead of learning them through 3 iterations of trial and error.

If a single-judge run hits a plateau (3 iterations without improvement), simmer offers to upgrade to the board mid-run with 2 extra iterations.

Defaults and Safety

Default iteration count: 3 rounds per batch. After each batch, simmer asks whether to continue. You can request a specific count ("simmer this for 10 rounds") or stop early at any prompt.

Regression safety: The reflect subskill tracks the best candidate seen so far. If a new iteration scores lower than the current best, the best-so-far is preserved โ€” the loop never loses progress. At the end, result.md always contains the highest-scoring candidate, not just the latest one.

Advanced Features

FeatureWhen you need it
Workspace targetsRefining a multi-file directory โ€” iterations tracked as git commits so you can diff any two rounds
Runnable evaluatorsYour artifact has a test script โ€” point simmer at it (python evaluate.py) and the judge interprets output
Background constraintsThe generator needs to know what's available (models, budget, latency targets) to make realistic choices
Output contractsValid output has a defined shape (e.g., JSON schema) โ€” violations score 1/10, forcing format fixes first
Validation commandsA cheap pre-check (./validate.sh) catches broken pipelines in seconds before the full evaluator runs
Search space trackingExplicit bounds on what to explore โ€” reflect tracks tried vs. untried regions so the judge steers toward gaps
See the v2 design spec for full details on each feature.

Output Directory Structure

Single-file mode (default output dir: docs/simmer):

docs/simmer/
  iteration-0-candidate.md     # Seed (original artifact)
  iteration-1-candidate.md     # Each improved candidate
  iteration-2-candidate.md
  iteration-3-candidate.md
  trajectory.md                # Running score table
  result.md                    # Final best candidate (highest score, not necessarily latest)

Workspace mode:

./pipeline/                    # Target directory (modified in place)
  [project files]              # Tracked via git commits per iteration

docs/simmer/                   # Tracking files (separate from workspace)
  trajectory.md                # Running score table

Workspace iterations are tracked as git commits rather than separate files.

How It Works

  • Focused improvement โ€” each iteration targets one direction (the ASI), not everything at once. Compound gains over scattered edits.
  • Context isolation โ€” generator doesn't see scores, judge doesn't see previous scores. Each role gets only the context it needs to avoid bias.
  • The generator is the search strategy โ€” in workspace mode, the generator decides what to change (swap a model, restructure a pipeline, tune a prompt). The ASI guides direction, the generator executes.

See the design spec for the full architecture.

Related Skills

Part of the test-kitchen family, but independently installable:

  • test-kitchen:omakase-off โ€” parallel design exploration
  • test-kitchen:cookoff โ€” parallel implementation competition
  • simmer โ€” iterative refinement

Documentation

---

If Simmer helped you ship something better than your first draft, a โญ helps us know it's landing.

Built by 2389 ยท Part of the Claude Code plugin marketplace

02

Quick Install

Get started in seconds

1
Add the marketplace /plugin marketplace add 2389-research/claude-plugins
2
Install this plugin /plugin install 2389-research/simmer
3
You're good to go Skills auto-trigger when relevant
Back to Marketplace