Who's talking · 25 min · Prompt Night n24
Carlos Libardo
Shippit × Voltbras

Carlos Libardo.

AI engineer · founder

  • Voltbras · founder & CTO · 2018 LatAm's first EV charging platform. 7 countries. 15M BRL raised. Led 15+ engineers.
  • Shippit · founder · today AI-native consultancy. Production agent systems.
  • Gallium · founding AI engineer Multi-agent brain behind on-brand content generation. Automated quality scoring.
  • Way2 Technology · AI consultant Agent orchestration inside one of Brazil's largest energy-sector platforms.

Agent orchestration · RAG · LLM-as-judge · Durable workflows · DDD · Clean Architecture · TDD

Tonight SDDTDDSkillsMulti-agent orchestrationLive bake-off
  • Spec & failing test — the guardrail AI can't fake
  • Subagents + worktrees — 4 min, not 16
  • Live bake-off — 4 frameworks, 1 prompt, 1 merges
  • Skills + MCP — brain & hands, same loop any IDE
  • Receipts — tx hash, not a demo
Prompt Night n24 · Founder Haus · Floripa

My phone shipped a smart contract while I was asleep..

Base mainnet. 04:17 BRT. Receipt in eight minutes.

Multi-agent workflows · any platform

One prompt in.
A team of agents out..

AI took over the code. Next: agents take over the prompting.

Carlos Libardo  ·  Shippit  ·  25 minutes  ·  works with any AI coding CLI.

SDD · 01  ·  live bake-off vs the rest

The spec is the guardrail..

Same prompt. Three modes. Three worktrees. Models fill ambiguity with confident guesses — the spec removes it, the failing test pins it.

→ The prompt (identical for all three)

add reply-rate by hour of day to campaign analytics

Execute mode

Agent codes immediately. No plan, no spec.

What you get

A chart on the page. Looks right. Metric is silently wrong — numerator and denominator count different events.

134 lines · 0 tests · 18 min

Gain: fastest to something on screen.
Cost: slowest to correct. No test catches the bug. No reviewer catches it. Ships Monday, dies Wednesday.

Plan mode

Agent proposes a plan in chat. You approve, then it codes.

What you get

Happy path is correct. Timezone chain handled. Edge cases you didn't think to name fall back to defaults — and you'll never know.

426 lines · 2 tests · 6 min

Gain: right thinking, right shape, atomic commits.
Cost: plan is prose, not an executable contract. No failing test pins behavior. Save-to-disk is opt-in — default lives in chat.

SDD + superpowers

Brainstorm → spec → plan → failing tests → green. On disk.

What you get

A spec that names the bug from card 1 — in plain English — before any code runs. Edge cases enumerated. Commits link back to acceptance criteria.

779 lines · 10 tests · 8 min

Gain: a durable contract the next agent or human can ship against in 6 months.
Cost: 6× the artifact. Overkill if you're throwing it away.

8 min vs 18. Discipline shipped faster than execute. Execute pays its tax mid-flight — in debug loops.

TDD · 02

Failing test first.
Always..

// step 1 — agent commits a failing test
test("rejects expired session cookie", async () => {
  const res = await app.request("/api/me", {
    headers: { cookie: `sid=${expiredSid}` },
  });
  expect(res.status).toBe(401);
});
// → FAIL: received 200

// step 2 — agent fixes implementation
- const session = await db.get("sessions", sid);
+ const session = await db.get("sessions", sid);
+ if (!session || session.expiresAt < Date.now()) return c.json({}, 401);

// step 3 — green. atomic commit.
$ git log -1 --oneline
3f9a1c2  fix(auth): reject expired session cookie

Red proves the prompt was understood. Green proves it's done.

Skills · 03

Skills are the new function calls..

---
name: brainstorming
description: Use before any creative work —
  features, components, functionality.
  Explores intent and design first.
---

# Brainstorming Ideas Into Designs

1. Explore project context
2. Ask clarifying questions
3. Propose 2-3 approaches w/ trade-offs
4. Present design, get approval
5. Write spec → invoke writing-plans

Lazy-loaded catalog

Agent indexes descriptions. Pulls bodies only on match. 200+ workflows. Zero context bloat.

Source: superpowers:brainstorming — scoped this deck.

Anatomy

  • Frontmatter = always-loaded one-liner. Routing layer.
  • Body = procedure, fetched on match.
  • references/ = even lazier — loaded only if body asks.
Orchestration · 04  ·  Builders: repo link in the chat.

4 minutes, not 16..

Map a 200k-line repo. One conductor fans out 4 specialists. Synthesize once.

Sequential
16 min
Parallel
4 min

codebase map · deps scan · history read · extern search  →  RESEARCH.md

Pay for the slowest agent, not the sum.

Orchestration · 04a  ·  two ways to parallelize

Subagents.
Or agent teams..

Subagents · Task tool

Main agent spawns N helpers. Each gets a prompt. Each runs in isolation. Each reports back. No cross-talk.

Best for: focused lookups, code search, structured reports. Lower token cost.

Agent Teams · Claude Code v2.1.32+

Independent Claude instances. Shared task list. Teammates message each other directly. Lead synthesizes.

Best for: parallel review, competing hypotheses, cross-layer refactors. Higher token cost.

# Subagents — main agent calls Task tool, gets summaries back.
# Agent Teams — flip the flag, then ask the lead to spawn a team.
$ export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
$ claude
> Create a team of 3 reviewers for PR #142: security,
  performance, test coverage. Have them debate findings.

Subagents = MapReduce. Agent Teams = a small crew that argues. Same physics, different tokens. Docs: code.claude.com/docs/en/agent-teams.

Orchestration · 04a.5  ·  the filesystem problem

One repo.
Three agents. Boom..

Without worktrees

agent-a agent-b agent-c │ │ │ └──────────┼──────────┘ ▼ ┌──────────────┐ │ one repo │ │ one branch │ │ one index │ └──────────────┘ │ ▼ package.json conflict dirty index · branch thrash 💥

With worktrees

agent-a agent-b agent-c │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │worktree│ │worktree│ │worktree│ │ a │ │ b │ │ c │ └────┬───┘ └────┬───┘ └────┬───┘ └──────────┼──────────┘ ▼ ┌──────────┐ │ .git │ one history └──────────┘

Same physics as threads sharing memory. Need isolation.

git worktree add ../agent-a feature/sdd-mode
git worktree add ../agent-b feature/tdd-mode
git worktree add ../agent-c feature/yolo-mode

Same repo, N filesystems. Merge later, in one place. No worktree, no parallel.

E2E · 05

The browser is the final gate..

$ bun test:e2e tests/grant-submission.spec.ts

  ✓ submits proposal and routes to /pending          1.8s
  ✓ judge agent scores within 90s                    62.3s
  ✓ score persists to base-sepolia                    9.1s
  ✓ score reads back from on-chain                    3.4s

  4 passed (76.6s)

# agent reads the green output, marks task complete,
# opens PR. no human in the verify loop.

Playwright

playwright.dev — scripted, deterministic E2E. You author selectors, actions, assertions; CI runs the suite headless. The bun test:e2e harness above is Playwright. The contract the agent must satisfy before merge.

agent-browser

agent-browser.dev — Vercel Labs CLI that hands the live browser to an agent. No script — the agent reasons step-by-step. Use to validate flows that don't have a spec yet, or to author the Playwright test from a recorded session.

Playwright = the gate. agent-browser = how the agent walks up to it. Type-checks lie. Unit tests lie. On-chain assertions don't.

Pick the dialect  ·  four guardrail frameworks

Four frameworks.
One physics..

FrameworkPick when…Skip if…Killer move
Superpowers You want the agent to run itself — brainstorm → design → TDD → ship — gated. One-shot edits or exploratory hacking. Hard gate: no code until design is approved, then subagents execute the plan autonomously.
GSD Long-running AI work where context rot would kill quality. Throwaway prototypes with no follow-up. Fresh-context subagents per role — main window stays clean.
OpenSpec You want AI to commit to a spec before guessing at code. You don't want spec artifacts in your repo. Per-change folder — proposal + spec + tasks — archived into history.
Spec Kit You want spec-as-contract portable across any AI CLI. You want speed over ceremony. constitution → specify → clarify → plan → tasks → implement.
Claude Code Codex Cursor Windsurf OpenCode Gemini CLI

= my personal pick after months of daily use. Don't take my word for it — clone one tonight, run it on a real ticket, feel the difference. The patterns travel; the CLI doesn't matter. Your repo, your call.

Receipt · ipe-city · bake-off

Same product. Four runs.
One merged to main..

Branch Framework Commits Files LOC Tests Coverage Outcome
speckit Spec Kit 35 254 29K ~280 (78 E2E) n/a Deployed · closed
full-vision-roadmap GSD 97 1,098 181K ~1,500 n/a Deployed · closed
superpower Superpowers 66 987 197K ~1,440 n/a Deployed · closed
ralph-unified Ralph Loop 17 117 13.8K 570 93.66% Merged to main

Read the numbers

GSD & Superpowers generated 13×–14× the code of the winner (181K & 197K LOC). Both deployed; neither merged. Spec Kit stayed compact (29K) but lighter on tests. Ralph-unified shipped 13.8K LOC, 570 tests, and the only measured coverage (93.66%) — merged to main.

Ralph Loop

Huntley pattern: while :; do claude -p "@PROMPT.md"; done. Same prompt re-fed until criteria met. Used here to review the three branches and compile the best of each. Honest receipt: crashed mid-loop many times, needed manual restarts — but the only branch I trusted on main.

Receipt · ipe-city audit loop

The agent audited its own contracts..

solidity-security gas-optimization secrets-scanner dependency-supply-chain

Same agent. Same loop. 4 audit skills, run on every PR.

6 contracts shipped to Base. Identical bytecode on Sepolia. Audit findings fixed before merge — Solidity 0.8.24, Foundry.

Live · ipe-city · base mainnet

Every evaluation → a tx hash users can click..

ipe-city home — Submit, View, Operator

Three doors

Submit · View · Operator. Same shape across all 3 framework branches.

ipe-city evaluation — real AI scores across 4 dimensions plus confirmed Base Sepolia tx hash

The receipt

AI scores → tx hash → Basescan. Clickable on the card.

ipe-city operator dashboard — Operational, Base Sepolia, 6 contracts deployed

The proof

Operational · Base Sepolia · 6 contracts. All on Basescan.

agent-reviewer-ralph.run.app — open mid-demo, click any tx.

Receipt · OutreachAI

One weekend.
V0 in production..

Stack — boring on purpose

Bun monorepo · Convex · Expo web
Mastra agents · gpt-5

Team LinkedIn outreach. Each teammate's account, their voice — agents research leads, write the message, time the follow-up.

Spec first. Failing test second. Agent third. Sleep fourth.

26
specs in specs/
2.2k
typescript files
~$40
total spend, weekend
OutreachAI · live · outreach-ai.shippit.app

Every teammate, their own agent..

OutreachAI channels — Instagram, LinkedIn, Email, WhatsApp, X, Reddit with health and daily cap

Channels · health + caps

6 channels, one cockpit. Instagram, LinkedIn, Email, WhatsApp, X, Reddit. Per-account daily cap, warmup curve, healthy / near-cap / issues at a glance.

Each provider (Unipile, Voyager, Gmail, X OAuth) abstracted away. The rep sees one safety surface, not six dashboards.

OutreachAI campaign detail — sequence progress, safety ramp, warmup day

Campaign · sequence + ramp

AI Advisory for CTO Leaders. 46 of 1070 leads through the sequence. Safety ramp, warmup day 14, daily limit curve — every send auditable.

A campaign = strategy (ICP × agent × sequence × channel) replayed against a lead list. Pausable mid-flight without breaking the others.

OutreachAI agents — voice, playbooks, published versions per teammate

Agents · voice + playbooks

Carlos — Founder Pitch. Gabriel — Shippit Discovery. Bruno — AI Consultancy CTO. Each teammate has a versioned agent: voice, facts, playbook.

The agent doesn't replace the rep — it is the rep, in delegated form. Voice survives. Authenticity isn't lost in templating.

Design has a spec too

Brand as code.
Three ways in..

/gsp · get shit pretty

Brand contract in markdown — tokens, voice, banned words, layout grammar. Same agent that ships code ships on-brand UI.

Claude Design · Anthropic Labs

Iterative design agent with a few tricks up its sleeve. Reach for it if you like the Claude app aesthetic.

  • How: Tailwind v4 tokens + dark-first React components, agent iterates on the surface
  • Production-tested by the Claude Code team
  • Fork, repaint with your own token layer (e.g. /gsp)
  • Tell: strong embedded DNA — after a few runs you'll see the Claude-app fingerprint in the initial outputs.
  • anthropic.com/news/claude-design

Figma MCP · designs as context

MCP server bridges Figma into the agent. Pull frames, variables, components directly into code-gen prompts.

  • How: agent calls get_code / get_image on a selection
  • Variables → CSS tokens, auto-layout → flex / grid
  • Pixel-true handoff without screenshot prompts
  • figma.com/dev-mode-mcp
ToolFormOwnsReach for when…
/gspMarkdown contract + skillsBrand identity, voice, generative outputYou want one source of truth across deck, app, docs.
Claude DesignIterative design agent + React componentsAI-native UI primitivesYou like the Claude app look — heads-up, its DNA bleeds into the initial outputs.
Figma MCPMCP serverPixel-true handoffDesigners already finalized screens in Figma.
Two surfaces · always-on vs on-demand

Skills load lazy.
CLAUDE.md is always there..

Both are just markdown. Difference is when they load.

Static prompt — CLAUDE.md / AGENTS.md

Loaded once at session start. Always in context. Costs tokens every turn. Use for rules that apply always: stack, conventions, hard constraints, identity.

  • Project root → every session reads it
  • Bloat = context tax on every reply
  • Keep tight: who you are, what not to do

Dynamic capability — Skills

Frontmatter (name + description) sits in a one-line index. Body only loads when the description matches the task — or you type /skill-name.

  • 200+ skills can sit dormant, ~0 cost
  • Agent reads YAML → decides → loads SKILL.md
  • references/ = even lazier (loaded only if body says so)
  • Docs: code.claude.com/docs/en/skills

Progressive disclosure. Static prompt = identity. Skills = procedures the agent fetches when it needs them.

Self-learning · the recursion

Correct it once.
It stops happening..

The loop

A self-improvement skill watches for corrections. Classifies them. Writes the lesson to the right file — silently. Next session reads it back.

  • Reactive — user corrects → log lesson
  • Proactive — session start → read lessons → apply
  • Analytical — "what did you learn" → review + table

Triage — where the lesson lands

FileFor
.self-improvement/lessons.mdproject-specific mistakes
KNOWLEDGE.mdcross-project platform gotchas
CLAUDE.mdpermanent rules / conventions

Real example: Oracle on Apple Silicon needs Colima, not Docker Desktop. Captured once in KNOWLEDGE.md → every future session avoids the 2-hour rabbit hole.

Skills that update memory. Memory that feeds skills. Agent gets sharper per session — not per training run.

Same markdown · every IDE · every SaaS

Skills travel.
And they plug in..

Travel — same SKILL.md, different IDE

Frontmatter (name + description) is the contract. The body is markdown. One file ships unchanged on Claude Code · Cursor · Replit · Lovable.

Plug in — MCP · API key · /chrome

Skills are the brain. MCP is the hands. Three rungs to reach anything.

  • MCP · modelcontextprotocol/servers — Gmail, Slack, Linear, Stripe, Postgres as typed actions
  • API key in .env — skill calls REST when no MCP yet
  • /chrome MCP — drives a real browser when there's no API at all

Three rungs, zero dead ends. Native MCP first, REST with a key second, browser-driving third. If a human can reach it, the agent can.

Drop-in skills · web3 / solidity

Your agent stops
hallucinating ABIs..

ResourceWhat it gives you
ethskills.com 24 markdown skills · corrects stale ETH training data
OpenZeppelin/openzeppelin-skills Contract patterns from the OG · battle-tested
OpenZeppelin/openzeppelin-mcp Natural language → contract scaffold via MCP
coinbase/agentkit Agent → onchain actions · any framework
trailofbits/skills Trail of Bits audit shelf · vulnerability detection + workflows
auditmos/skills audit-signature · audit-liquidation-dos · audit-external-call · audit-arithmetic
The 4 we ran on ipê-city
solidity-security Reentrancy · access control · overflow · delegatecall — block merge on finding
solidity-gas-optimization 80+ RareSkills patterns · storage packing · calldata · short-circuit
secrets-scanner Leaked keys / tokens in .env, fixtures, deploy scripts · pre-commit gate
dependency-supply-chain-security npm audit · vulnerable / typosquatted packages · Foundry lib pinning

Same shape as the shelf we used to audit ipê-city. Drop the folder in, the agent reads the skills, ABI hallucinations stop.

Swap the dialect · keep the pattern

Same loop.
Different paint..

FrameworkSweet spotTradeoff
Claude Agent SDK · direct Thin layer over the model · max control · ships with Claude Code primitives You wire orchestration · no batteries
LangChain / LangGraph Biggest ecosystem · graph-based control flow · every provider Heavy abstraction · churny API
mastra.ai TS-first · agents + workflows + evals + memory in one stack · Vercel-native Young · Node-only
CrewAI Role-based crews · fastest path to multi-agent demo · Python Less control · opinionated
Agno Python · multimodal-first · perf-tuned · lean runtime Smaller community

Orchestration is a slot. Pick the dialect that matches your runtime — the loop underneath is the one we walked through tonight.

The dials · 04b  ·  cost as a knob

Three dials.
Or your $40 becomes $400..

Model split

Opus plans & reviews. Sonnet executes. ~5× cost gap, ~0× quality gap if the plan is good.

Context budget

/compact on long sessions. CLAUDE.md for state worth keeping. New chat when the cache window slips past 5 minutes — token cost cliff, not a curve.

Token compression

caveman hook drops articles + filler. rtk trims tool schemas. ~30–60% off long sessions.

Cost is a knob, not a verdict. Tune the split, isolate the checkouts, compress the chatter — same output, fraction of the spend.

Receipts · pillars + all the links

Steal everything.
Then ask AI for more..

The pillars · disciplines underneath the dialects

SDD · spec first. The contract is the guardrail.
TDD · failing test second. Red proves the prompt was understood.
Clean Arch · boundaries the agent can't accidentally cross.
SOLID · small surfaces, swap-friendly. AI edits one thing at a time.
Framework / skillWhere to get it
Spec Kit github.com/github/spec-kit
Superpowers github.com/obra/superpowers · npx superpowers init
GSD gsd-build/get-shit-done · gsd-2 · gsd-1 · npx gsd-pi
OpenSpec github.com/Fission-AI/OpenSpec
/gsp — get shit pretty jubscodes/get-shit-pretty · gsp.jubs.studio · npx get-shit-pretty
Simply Shippit shippit.app · coming soon (internal use only) — keep in touch

This is a startkit.

Anything not on this slide — your agent can find a fitter alternative in one prompt. The shelf is infinite. Ask for the skill that matches your stack.

Real-world orchestration · 01  ·  work in progress

Remove me as the
bottleneck..

One person can only sit in front of one Claude Code session at a time. That's the ceiling.

vlad-cockpit is the orchestration layer I'm building on top of Claude Code — so many agents can work in parallel, in different repos, while I review instead of type.

Still early. Rough edges everywhere. Sharing the direction, not the polish.

The shift

  • From: me typing prompts, one at a time
  • To: me reviewing work from a fleet of agents
  • Trigger: a schedule, a GitHub issue, a webhook — not me
  • I unblock when something needs judgment. The rest runs.
Real-world orchestration · 02  ·  what it gives you today

Four wins, even half-built..

One screen, every agent

Every session in flight, on one page. Status, what it's working on, when it last moved. No tab-juggling, no "where did that one go".

Cost X-ray

Spend per project, per model, per session. Find the leak before it bleeds. Decide which agents earn their keep.

Scheduled agents

"Audit dependencies every Monday at 7am." Set it, walk away. Work happens whether I'm at the keyboard or not.

Phone-first access

Same cockpit on phone, laptop, tablet. Approve, take over, or walk away — from anywhere. Push notifications when an agent needs a human.

github.com/carloslibardo/vlad

vlad-cockpit · live · still WIP

A dozen agents. One pane of glass..

vlad-cockpit live sessions

Every agent at a glance

A dozen agents working in parallel — across different projects. One card each: what it's doing, how long it's been at it, how much it's spent.

Triage takes 5 seconds, not 5 minutes. Tap a card to drop in.

vlad-cockpit cost breakdown

Where the money went

Spend broken down by project, model, and session — over the last 7 days. The big consumer, the cheap experiment, the runaway.

No more "I think it cost about…". You see it. You decide.

vlad-cockpit cron scheduler

Agents that fire themselves

Pick a project, a budget, a prompt, and a cadence — hourly, daily, weekly. The agent runs without me remembering to start it.

Recurring work stops depending on my memory.

Coming soon · simply shippit  ·  think · build · check · ship

Four verbs. GitHub keeps score..

What the other SDD frameworks don't solve

  • Team coordination. Most SDD assumes one dev. Simply Shippit treats milestone = feature, issue = AC, PR closes the loop.
  • AC traceability. Every test, every review, every commit links back to the issue's acceptance criteria. No drift.
  • The CHECK layer. "Tasks done" ≠ "goals met." /shippit:check runs verify-before-shipping with evidence.
  • Session rituals. /shippit:gm & /shippit:gn sync state with GitHub each day, across the team.

Pipeline

/shippit:plan  →  scan · research · propose
/shippit:build  →  implement · test
/shippit:check  →  review · verify
/shippit:ship  →  verify · test · commit · pr

22 skills · 7 agents · GitHub-native state. Plugin for Claude Code. M1 dropping soon.

Tonight · Builder track · back of the room

Open the laptop.
Ship tonight..

You're one prompt away from whatever you want to build. The tools exist. The patterns exist. The only variable left is whether you open the laptop tonight.

$npx superpowers init
Claude Code Codex Cursor OpenCode
Thanks · scan to open the deck

Ship the receipt,
not the demo..

QR code — open the deck

Scan or visit
techlibs.github.io/
prompt-night/deck/

Carlos Libardo · Shippit
shippit.app · luma.com/eoums7t5

Questions in the open. Builder track at the back.