css-development
v1.0.0CSS development workflows with Tailwind composition, semantic naming, and dark mode by default
Iterative artifact refinement with investigation-first judge board - constructs problem-specific judges that read the code, understand the problem, and propose evidence-based improvements
/plugin install 2389-research/simmer
Full plugin documentation and usage guide
You wrote a prompt. It works. But is it good? Simmer runs your artifact through multiple rounds of criteria-driven refinement โ each round, a panel of judges reads your code, understands the problem, and proposes specific improvements.
Read the story behind Simmer โ
Iterative artifact refinement โ take any artifact or workspace and hone it over multiple rounds using criteria-driven feedback.
/plugin marketplace add 2389-research/claude-plugins
/plugin install simmer@2389-research
One skill (simmer) with four subskills that run the refinement loop:
Say any of these to start a simmer loop:
"simmer this", "refine this", "hone this", "iterate on this", "make this better", "improve this over a few rounds", "polish this", "tighten this up", "optimize this pipeline", "find the best model for this task", "tune this configuration"
Any request to iteratively improve an artifact triggers simmer.
| Situation | Use | Why |
|---|---|---|
| You have an artifact and want it better | simmer | Iterative refinement against criteria |
| You want parallel code implementations competing | cookoff | Parallel generation + fixed criteria + pick winner |
| You don't know what you want yet | omakase-off | Parallel designs + react + discover preference |
User: "Simmer this" [pastes a pitch email]
Claude: I'm using the simmer skill to set up iterative refinement.
What does "better" mean for this? I'd suggest:
1. Value prop clarity
2. Tone match
3. Call to action strength
User: "Those work"
[Iteration 0: Judge scores seed โ 4.0/10]
[Iteration 1: Generator fixes value prop โ 5.3/10]
[Iteration 2: Generator fixes CTA โ 6.3/10]
[Iteration 3: Generator fixes tone โ 7.3/10]
Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0 | 4 | 5 | 3 | 4.0 | seed |
| 1 | 7 | 5 | 4 | 5.3 | specific problem statement |
| 2 | 7 | 6 | 6 | 6.3 | low-friction CTA |
| 3 | 7 | 7 | 8 | 7.3 | peer-sharing tone |
Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?
User: "Simmer this pipeline โ find the best setup for this task"
Claude inspects the workspace, finds evaluate.sh and config.json,
and proposes criteria + evaluation method:
Evaluator: ./evaluate.sh
Criteria: accuracy, cost efficiency, latency
Search space: models, prompt text, pipeline topology
User: "Looks good, coverage is the priority. 5 iterations."
[Iteration 0: Run evaluator on seed, judge scores โ 3.7/10]
[Iteration 1: Generator swaps to cheaper model โ 5.3/10]
[Iteration 2: Generator splits into 2-step chain โ 7.0/10]
[Iteration 3: Generator adds few-shot examples โ 7.7/10]
...
Best candidate: iteration 4 (8.1/10)
| Artifact type | Suggested criteria |
|---|---|
| Document / spec | clarity, completeness, actionability |
| Creative writing | narrative tension, specificity, voice consistency |
| Email / comms | value prop clarity, tone match, call to action strength |
| Prompt / instructions | instruction precision, output predictability, edge case coverage |
| API design | contract completeness, developer ergonomics, consistency |
| Pipeline / workflow | coverage, efficiency, noise |
| Configuration / infra | correctness, resource efficiency, maintainability |
| Mode | When to use |
|---|---|
| Judge-only (default) | Text artifacts โ judge scores against criteria |
| Runnable | Code/pipelines โ judge interprets script output |
| Hybrid | Both โ run script AND judge results against criteria |
Simmer auto-selects between a single judge and a multi-judge board based on complexity:
The board constructs three judges tailored to your specific problem โ not from a fixed menu, but by reading your artifact, criteria, and constraints and designing judges with diverse perspectives. An extraction prompt gets different judges than a DND adventure hook.
Judges investigate before scoring โ they read the evaluator script, ground truth, prior candidates, and config files to understand the problem deeply. A judge who reads the evaluator discovers scoring mechanics on iteration 0 instead of learning them through 3 iterations of trial and error.
If a single-judge run hits a plateau (3 iterations without improvement), simmer offers to upgrade to the board mid-run with 2 extra iterations.
Default iteration count: 3 rounds per batch. After each batch, simmer asks whether to continue. You can request a specific count ("simmer this for 10 rounds") or stop early at any prompt.
Regression safety: The reflect subskill tracks the best candidate seen so far. If a new iteration scores lower than the current best, the best-so-far is preserved โ the loop never loses progress. At the end, result.md always contains the highest-scoring candidate, not just the latest one.
| Feature | When you need it |
|---|---|
| Workspace targets | Refining a multi-file directory โ iterations tracked as git commits so you can diff any two rounds |
| Runnable evaluators | Your artifact has a test script โ point simmer at it (python evaluate.py) and the judge interprets output |
| Background constraints | The generator needs to know what's available (models, budget, latency targets) to make realistic choices |
| Output contracts | Valid output has a defined shape (e.g., JSON schema) โ violations score 1/10, forcing format fixes first |
| Validation commands | A cheap pre-check (./validate.sh) catches broken pipelines in seconds before the full evaluator runs |
| Search space tracking | Explicit bounds on what to explore โ reflect tracks tried vs. untried regions so the judge steers toward gaps |
Single-file mode (default output dir: docs/simmer):
docs/simmer/
iteration-0-candidate.md # Seed (original artifact)
iteration-1-candidate.md # Each improved candidate
iteration-2-candidate.md
iteration-3-candidate.md
trajectory.md # Running score table
result.md # Final best candidate (highest score, not necessarily latest)
Workspace mode:
./pipeline/ # Target directory (modified in place)
[project files] # Tracked via git commits per iteration
docs/simmer/ # Tracking files (separate from workspace)
trajectory.md # Running score table
Workspace iterations are tracked as git commits rather than separate files.
See the design spec for the full architecture.
Part of the test-kitchen family, but independently installable:
test-kitchen:omakase-off โ parallel design explorationtest-kitchen:cookoff โ parallel implementation competitionsimmer โ iterative refinement---
If Simmer helped you ship something better than your first draft, a โญ helps us know it's landing.
Built by 2389 ยท Part of the Claude Code plugin marketplace
Get started in seconds
/plugin marketplace add 2389-research/claude-plugins
/plugin install 2389-research/simmer
Skills auto-trigger when relevant