<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://bluntmachetti.github.io/aftershock/feed.xml" rel="self" type="application/atom+xml" /><link href="https://bluntmachetti.github.io/aftershock/" rel="alternate" type="text/html" /><updated>2026-06-16T08:21:36+00:00</updated><id>https://bluntmachetti.github.io/aftershock/feed.xml</id><title type="html">Aftershock — Field Log</title><subtitle>The build journey of Aftershock — a disaster-response society of Qwen agents for the Qwen Cloud Global AI Hackathon. Progress, architecture decisions, and what the numbers (including the negative results) said.</subtitle><author><name>Kenny Ademolu</name></author><entry><title type="html">The fix that would have only fooled the scoreboard — and the one tuning that actually paid.</title><link href="https://bluntmachetti.github.io/aftershock/2026/06/16/the-fix-that-would-have-only-fooled-the-scoreboard.html" rel="alternate" type="text/html" title="The fix that would have only fooled the scoreboard — and the one tuning that actually paid." /><published>2026-06-16T21:00:00+00:00</published><updated>2026-06-16T21:00:00+00:00</updated><id>https://bluntmachetti.github.io/aftershock/2026/06/16/the-fix-that-would-have-only-fooled-the-scoreboard</id><content type="html" xml:base="https://bluntmachetti.github.io/aftershock/2026/06/16/the-fix-that-would-have-only-fooled-the-scoreboard.html"><![CDATA[<p>In the <a href="/aftershock/2026/06/15/build-the-ruler-first.html">last post</a> I built a <em>ruler</em> for
<a href="https://github.com/bluntmachetti/aftershock"><strong>Aftershock</strong></a> — paired ablations, a sign test, a
power curve — before tuning the Qwen agent society at all. This post is what happened when I finally
used it: a session of four levers, each gated by the ruler.</p>

<p>I expected the ruler to catch <em>noise</em> — a win that’s really a coin flip. It did that. But it also kept
catching something subtler and, honestly, more embarrassing: levers that would <strong>work</strong> — move the
exact number I aimed them at — and <strong>change nothing that matters.</strong> I came within one code edit of
shipping a “fix” whose entire effect would have been to make a metric look better while the system
behaved identically.</p>

<p>This is a post about <em>outcome-neutral</em> metrics, and the discipline of asking, before you optimize a
number, whether the number does anything.</p>

<h2 id="the-setup-the-societys-real-story-is-cost-and-discipline-not-lives">The setup: the society’s real story is cost and discipline, not lives</h2>

<p>A quick recap of where the measurement left me. On Aftershock’s task, <strong>well-tuned scripted bots
match every LLM arm on lives saved</strong> — the negotiation <em>protocol</em>, not model IQ, carries the result.
And under genuine triage there’s <a href="/aftershock/2026/06/15/build-the-ruler-first.html">no detectable lives edge</a>
of the society over an uncoordinated swarm. So the society’s honest, defensible value isn’t <em>more
lives</em> — it’s <strong>cost-efficiency</strong> (small models, coordinated, at a fraction of a big model’s price)
and <strong>conformance</strong> (a deterministic measure of whether agents actually follow their doctrine).</p>

<p>That reframing matters for everything below: it means a “win” on lives is usually noise, and a win
worth shipping is one that moves <strong>cost</strong> or <strong>conformance</strong> — <em>without quietly costing the other.</em></p>

<h2 id="lever-1--does-discipline-cost-lives-an-old-scare-resolved">Lever 1 — does discipline cost lives? (an old scare, resolved)</h2>

<p>Early on I’d given every agent a two-tier playbook — a “doctrine” — and a checker that scores how
well they follow it. Writing the doctrine clearly <em>raised</em> conformance. But a single-seed run had a
worrying side effect: the doctrine run saved <strong>fewer</strong> lives (96 vs 113). Discipline isn’t free —
every instruction you add to a prompt is more to read and obey. I’d left it flagged as <em>outcomes
TBD.</em></p>

<p>n=1 is not a result, so I built a same-arm ablation: run the society <strong>with and without</strong> doctrine on
the same five seeds, and — because conformance is the deterministic, low-variance signal — make the
report <strong>lead with conformance</strong> and treat lives as the noisy secondary.</p>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Doctrine on vs off — paired, 5 seeds</div>
  <table class="rt">
    <thead>
      <tr><th>Signal</th><th>Off</th><th>On</th><th>Δ</th></tr>
    </thead>
    <tbody>
      <tr class="win"><td>team conformance</td><td>0.696</td><td>0.852</td><td><strong>+0.156</strong> &nbsp;(all 5 seeds)</td></tr>
      <tr><td>lives saved</td><td>100.4</td><td>104.2</td><td>+3.8 &nbsp;(within noise)</td></tr>
    </tbody>
  </table>
</div>

<p>The scare was a small-sample artifact. Doctrine raises conformance on <strong>every</strong> seed, and lives are
flat-to-slightly-<em>up</em> — the n=1 “−17 lives” was noise. Discipline here is not paid for in lives. That
clears doctrine as a real, free conformance lever and tells me where the rest of the session should
aim: at the one role that still won’t follow it.</p>

<h2 id="lever-2--the-agent-that-wont-follow-the-rules-and-the-guard-i-almost-built-to-cheat">Lever 2 — the agent that won’t follow the rules, and the guard I almost built to cheat</h2>

<p>That role is the infrastructure agent. Even with doctrine, it’s the one specialist that stays below
0.70 conformance. The diagnostic showed <em>why</em> — three distinct failures, not one:</p>

<ul>
  <li><strong>Urgency inflation:</strong> it marks every request urgency 9, even low-severity ones with a distant
deadline.</li>
  <li><strong>Resubmitting rejected work:</strong> it re-issues a decision the engine just rejected.</li>
  <li><strong>Impossible repairs:</strong> it calls <code class="language-plaintext highlighter-rouge">repair_road</code> on districts that aren’t blocked, or when there’s no
repair crew free.</li>
</ul>

<p>First I did the obvious thing: rewrote its prompt to tie each rule to the exact fields it already sees
in its observation. That <strong>cleanly fixed urgency</strong> (0.35 → 1.00 conformance on that rule) — a numeric
threshold the model can apply. But the <em>impossible-repairs</em> rule barely moved. Telling the flash model
“only repair a district on the BLOCKED line, and only if crew ≥ 1” just… didn’t take. Precondition
<em>gating</em> is harder to prompt than scalar <em>calibration.</em></p>

<p>So I reached for the backlog idea: a <strong>deterministic guard</strong> — enforce the precondition in code so an
invalid repair can never happen. And then, tracing the code to place it, I stopped.</p>

<p><strong>The engine already rejects those repairs.</strong> An invalid <code class="language-plaintext highlighter-rouge">repair_road</code> is validated and declined
<em>before</em> it consumes anything — zero resources, no effect on the world. And the conformance metric
counts the agent’s <em>attempt</em>. So a guard that intercepted the attempt would do exactly one thing:
make the conformance number go up by <strong>hiding the agent’s behavior from the record.</strong> The world would
play out identically — same lives, same cost — and the scoreboard would look better.</p>

<p>That is the cleanest example I’ve hit of an <strong>outcome-neutral metric</strong>: a number you can move without
changing anything real. Building the guard would have been <em>reward-hacking my own benchmark.</em> I didn’t
build it. The honest version of the question isn’t “how do I make the number go up” — it’s “is the
behavior actually fixable, and does fixing it matter?”</p>

<p>So I tested that instead: I gave just the infra agent a <strong>stronger model</strong> (<code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> instead of
<code class="language-plaintext highlighter-rouge">flash</code>) and measured, changing nothing else.</p>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Infra agent: flash vs plus (model isolated, 5 seeds)</div>
  <table class="rt">
    <thead>
      <tr><th>Signal</th><th>flash</th><th>plus</th><th>Δ</th></tr>
    </thead>
    <tbody>
      <tr class="win"><td>impossible-repair rule</td><td>0.560</td><td>0.957</td><td><strong>+0.40</strong></td></tr>
      <tr class="win"><td>infra conformance</td><td>0.863</td><td>0.986</td><td>+0.123 &nbsp;(all 5 seeds)</td></tr>
      <tr><td>lives saved</td><td>103.6</td><td>103.8</td><td>flat &nbsp;(p = 1.0)</td></tr>
      <tr class="lose"><td>cost / run</td><td>$0.041</td><td>$0.054</td><td><strong>+33%</strong></td></tr>
    </tbody>
  </table>
</div>

<p>So the stickiness was a <strong>model-capability floor</strong>, not a prompt bug: the bigger model gates the
preconditions the smaller one ignores. A real finding — but look at the trade. It buys near-perfect
conformance on an <em>outcome-neutral</em> rule (the bad repairs cost no lives to begin with) for <strong>+33%
cost.</strong> Paying a third more to perfect a metric that doesn’t move lives, on an arm whose whole pitch
is cost-efficiency, is the wrong default.</p>

<p>So it ships as an <strong>opt-in operating mode</strong> — one flag, <code class="language-plaintext highlighter-rouge">--role-model infrastructure=qwen3.5-plus</code> —
not the published default. Flip it when discipline matters more than dollars; leave it off for the
lives-per-dollar story. The <em>finding</em> (it’s a capability floor) is the durable result, independent of
which default you choose.</p>

<h2 id="lever-3--the-one-that-actually-paid-stop-re-sending-the-prompt">Lever 3 — the one that actually paid: stop re-sending the prompt</h2>

<p>With lives ruled out and conformance either free (doctrine) or expensive (infra model), I turned to
the axis the society can actually win on: <strong>cost.</strong> And here the harness did its best work as a
<em>profiler</em> before it did anything as a judge.</p>

<p>Where does the money go? The commander sends <strong>31,900 prompt tokens per run but only 2,300 of
completion.</strong> Decomposing one call: a <strong>941-token system prompt re-sent on every single tick</strong>, plus
only ~160 tokens of actual world observation. Across all six agents, <strong>~85% of a run’s prompt tokens
are the same static prefixes, re-sent every tick</strong> — roughly <strong>60% of total cost</strong> is the model
re-reading instructions it already read last tick.</p>

<p>The obvious fix is <em>caching</em> — most providers bill a repeated prefix at a discount. So I probed it: a
~1,600-token stable prefix, sent three times in a row, on both Qwen models. The response told me
plainly:</p>

<blockquote>
  <p><code class="language-plaintext highlighter-rouge">prompt_tokens_details: { text_tokens: 1589 }</code> — <strong>no <code class="language-plaintext highlighter-rouge">cached_tokens</code> field, no discount, full price
every call.</strong></p>
</blockquote>

<p>DashScope’s international compatible-mode endpoint doesn’t cache our prompts. That’s a <em>clean negative
result</em> worth knowing: it means our cost ledger is <strong>accurate, not pessimistic</strong> — there’s no free
accounting win hiding in cache hits. The only way to cut a re-sent prompt is to <strong>make it shorter.</strong></p>

<p>Doctrine is off-limits (it’s the conformance lever from Lever 1) and the role instructions are
behavioral, so the target was the <strong>output contract</strong> — the JSON-format boilerplate every agent
carries. I compacted the multi-line schema to one line, deduplicated rules that were stated twice, and
tightened the proposal descriptions — keeping every capability and field name, just terser. Commander
prefix: <strong>941 → 835 tokens.</strong> Then the paired A/B, against an identical baseline where <em>only the
contract differs:</em></p>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Trimmed contract vs full — paired, 5 seeds</div>
  <table class="rt">
    <thead>
      <tr><th>Signal</th><th>full</th><th>trimmed</th><th>Δ</th></tr>
    </thead>
    <tbody>
      <tr class="win"><td>cost / run</td><td>$0.0411</td><td>$0.0353</td><td><strong>−14.0%</strong></td></tr>
      <tr class="win"><td>lives per dollar</td><td>2,522</td><td>3,046</td><td><strong>+21%</strong></td></tr>
      <tr><td>lives saved</td><td>103.6</td><td>107.6</td><td>+4.0 &nbsp;(up, not significant)</td></tr>
      <tr><td>conformance</td><td>0.916</td><td>0.879</td><td>−0.037 &nbsp;(p = 0.375, n.s.)</td></tr>
    </tbody>
  </table>
</div>

<p>A real, largely <em>deterministic</em> <strong>14% cost cut</strong> — the token reduction is fixed; only the small
completion varies — for a <strong>21% gain in lives-per-dollar</strong>, the headline efficiency number. Lives even
drifted up. The one thing to respect: conformance ticked down 0.037, but the sign test says that’s not
a credible effect (p=0.375), and it sits inside the run-to-run band conformance has always wandered in
(0.85–0.92). By the same bar I used to <em>reject</em> the +16-life ghost last post, I can’t <em>claim</em> this
conformance dip either — so I kept the trim and logged the dip as a watch-item, not a regression.</p>

<p>The deeper lesson hid in the disappointment: the pure-redundancy trim was tiny (~16 tokens). Most of
that 941-token prefix is <strong>irreducible</strong> — doctrine the agents need, decision vocabulary they act on,
a schema that keeps the JSON valid. The society’s ~$0.04/run is <em>mostly structural.</em> −14% is close to
the safe ceiling without touching doctrine or model tier. Worth banking; not a bottomless well.</p>

<h2 id="the-pattern-four-levers-one-bar">The pattern: four levers, one bar</h2>

<p>Before chasing each lever I made a workflow of read-only agents scope it against one question — <em>does
moving this metric change a real outcome?</em> It skipped four backlog ideas outright (a deadline-sort for
a failure mode that’s already near zero; reducing “redundant” bids the auction ignores for free; a
temperature sweep premised on a cost number that turned out wrong). Each was an outcome-neutral
mirage, the I1-guard pattern again: a number you can move that moves nothing else.</p>

<p>What <em>survived</em> the bar is exactly the society’s honest story:</p>

<ol>
  <li><strong>Doctrine</strong> raises conformance on every seed, free of lives cost.</li>
  <li><strong>Conformance has a price ceiling</strong> — the last stubborn role needs a bigger model, which costs
more than the (outcome-neutral) metric is worth as a default, so it’s a switch.</li>
  <li><strong>Cost is real and movable</strong> — −14% by not re-sending what the model already read, the cap set by
how much of the prompt is genuinely load-bearing.</li>
</ol>

<p>No grand lives breakthrough. But three claims I can defend in front of a judge, and a fistful of
features I <em>didn’t</em> ship because the measurement said they’d only flatter the scoreboard.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>Last post’s lesson was <em>a single measurement is a coin flip wearing a lab coat</em> — pair it, test it,
know when to stop. This session added the sharper one:</p>

<p><strong>Before you optimize a metric, prove the metric does something.</strong> The most seductive failure isn’t a
noisy win — it’s a <em>clean</em> win on a number that’s decoupled from reality. I could have made my
conformance score jump by intercepting decisions the engine already discards; it would have demoed
beautifully and meant nothing. The guard against that isn’t more statistics. It’s tracing the metric
back to an outcome — lives, dollars, missions — and refusing to celebrate a move that doesn’t reach
one.</p>

<p>Make it measurable. Believe only what survives the measurement. And check that the thing you’re
measuring is connected to the thing you actually care about.</p>

<p><strong>Try it live:</strong> <a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a> · <strong>Read the code:</strong> <a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a>
(the levers are in
<a href="https://github.com/bluntmachetti/aftershock/blob/main/docs/FIELD-NOTES.md"><code class="language-plaintext highlighter-rouge">docs/FIELD-NOTES.md</code></a>
§18–21; the cost trim is <code class="language-plaintext highlighter-rouge">--role-model</code> / the output contract)</p>

<p><em>Built with Qwen Cloud (<code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> / <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> / <code class="language-plaintext highlighter-rouge">qwen3-max</code> via DashScope) and Alibaba
Cloud ECS, for the Qwen Cloud Global AI Hackathon.</em></p>]]></content><author><name>Kenny Ademolu</name></author><summary type="html"><![CDATA[With the measurement harness built, I finally tuned Aftershock's Qwen agent society. The ruler kept saying a new kind of no: not 'that's noise' but 'that would work and still change nothing real.' I almost shipped a fix whose only effect was to make a conformance metric look better. A field report on outcome-neutral metrics — and the one lever (a −14% cost trim) that survived the bar.]]></summary></entry><entry><title type="html">Build the ruler first. It killed our biggest feature — and a +16-life win that wasn’t real.</title><link href="https://bluntmachetti.github.io/aftershock/2026/06/15/build-the-ruler-first.html" rel="alternate" type="text/html" title="Build the ruler first. It killed our biggest feature — and a +16-life win that wasn’t real." /><published>2026-06-15T22:00:00+00:00</published><updated>2026-06-15T22:00:00+00:00</updated><id>https://bluntmachetti.github.io/aftershock/2026/06/15/build-the-ruler-first</id><content type="html" xml:base="https://bluntmachetti.github.io/aftershock/2026/06/15/build-the-ruler-first.html"><![CDATA[<p>My backlog for <a href="https://github.com/bluntmachetti/aftershock"><strong>Aftershock</strong></a> had a clear headliner.
Of ~44 ideas for making the Qwen agent society save more lives, one was marked <em>the single biggest
lever</em>: a smarter resource auction (partial grants instead of all-or-nothing), hypothesized at
<strong>+10–15 lives a run</strong>. The obvious move was to build it.</p>

<p>I didn’t build it. I built a <em>ruler</em> first — the boring statistical plumbing that tells you whether
a change actually did anything. And the ruler earned its keep three times over in one sitting: it
<strong>killed the headliner</strong> before I wrote a line of it, it <strong>caught a +16-life win that wasn’t real</strong>,
and it made me <strong>put a caveat on my own flagship number</strong>.</p>

<p>This is a post about the least glamorous part of the project — measurement — and why it turned out
to be the most valuable.</p>

<h2 id="why-a-ruler-before-any-tuning">Why a ruler, before any tuning</h2>

<p>Aftershock’s engine is byte-deterministic: same seed, same disaster, same outcome. But the <em>agents</em>
are not. The published society result — <code class="language-plaintext highlighter-rouge">103.2 ± 23.6</code> lives — has a <code class="language-plaintext highlighter-rouge">±</code> for a reason, and the first
thing the ruler did was tell me exactly what that <code class="language-plaintext highlighter-rouge">±</code> is made of.</p>

<p>Two cheap experiments:</p>

<ul>
  <li><strong>Does Qwen Cloud honor a sampling <code class="language-plaintext highlighter-rouge">seed</code>?</strong> I threaded a deterministic per-call seed through every
request and ran the same world twice. The two runs diverged on the first tick, exactly like two
un-seeded runs. <strong>DashScope accepts <code class="language-plaintext highlighter-rouge">seed</code> but doesn’t make sampling reproducible</strong> — so the LLM
arms are irreducibly stochastic, and that <code class="language-plaintext highlighter-rouge">±</code> really is sampling noise, not measurement slop.</li>
  <li><strong>How much of the noise is the <em>model</em>, and how much is the <em>world</em>?</strong> Running each seed three
times and decomposing the variance: <strong>79% of it is which disaster you drew</strong> (between-seed), only
<strong>21% is the model’s run-to-run wobble</strong> (within-seed).</li>
</ul>

<p>That second number is the whole game. If most of your variance is the scenario, then comparing a
<em>mean</em> against a <em>mean</em> is mostly comparing coin flips — unless you <strong>pair</strong>: run the control and the
treatment on the <em>same</em> seeds and difference out the world. Concretely, to detect a +10-life effect
at decent power you need <strong>~10 paired seeds, versus ~45 unpaired</strong>. Same data, a quarter of the runs,
because pairing deletes the 79%.</p>

<p>So the ruler is: run both arms on identical seeds, take the per-seed difference, and judge it with a
sign test, a bootstrap confidence interval, and a power curve — <code class="language-plaintext highlighter-rouge">aftershock ablation</code>. Nothing fancy.
Just enough to stop me believing a number I shouldn’t.</p>

<h2 id="episode-1--the-diagnostic-killed-the-headliner">Episode 1 — the diagnostic killed the headliner</h2>

<p>The “biggest lever” exists to fix a specific pathology: a <strong>priority inversion</strong>. Under all-or-nothing
granting, a high-priority incident that needs three ambulances can lose the whole pool to a
later, lower-priority incident that happens to fit the two that are left. Partial grants would fix
that. Worth +10–15 lives — <em>if it happens.</em></p>

<p>So before building the fix, I built a diagnostic that reads the recorded auction outcomes and counts
how often it actually happens. The answer, across nine society runs:</p>

<p><strong>Zero.</strong></p>

<p>Then I made the world harder — I added a knob to tighten the resource pools — figuring scarcity would
surely manufacture the contention. Still zero. I cranked it to the harshest setting the sim allows
(every pool at 2, where a <em>fifth</em> of all incidents fail outright):</p>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Priority inversions found — the pathology the "biggest lever" fixes</div>
  <table class="rt">
    <thead>
      <tr><th>World</th><th>Society runs</th><th>Priority inversions</th></tr>
    </thead>
    <tbody>
      <tr class="win"><td>default</td><td>9</td><td><strong>0</strong> &nbsp;(of 59 contested losses)</td></tr>
      <tr class="win"><td>tight pools</td><td>6</td><td><strong>0</strong> &nbsp;(of 74 contested)</td></tr>
      <tr class="lose"><td>harshest (all pools = 2)</td><td>5</td><td><strong>6</strong> &nbsp;(of 794 losses — &lt;1%)</td></tr>
    </tbody>
  </table>
</div>

<p>Across the default and tight worlds: <strong>133 contested auction losses, every single one legitimate</strong> —
the loser always had lower-or-equal priority than the winner. The auction’s arbitration is just
<em>correct</em>: it serves the highest-priority incident first, and the agents don’t over-request in the
way that would let a low-priority bid steal the remainder. Only at brutal scarcity do a handful of
inversions appear, and even then they’re dwarfed 70-to-1 by plain shortage (the pool is simply empty
— there’s nothing for a partial grant to split).</p>

<p>The verdict wrote itself: <strong>don’t build the lever.</strong> It optimizes a problem this system doesn’t have.
That’s the harness paying for itself before it cost a thing — a feature un-built is a feature you
don’t have to maintain, document, or quietly regret.</p>

<p>(A humbling footnote: the <em>first</em> version of that inversion detector was dead code — by the
resolver’s ordering it could never actually fire — and an adversarial code review caught it. Even the
ruler needed measuring. I rebuilt it to reconstruct winners-versus-losers from the recorded grants,
and <em>then</em> trusted the zeros.)</p>

<h2 id="episode-2--the-16-that-wasnt">Episode 2 — the +16 that wasn’t</h2>

<p>Fine: the auction isn’t the bottleneck on the easy task. But the easy task is <em>easy</em> — ~96% of
incidents get resolved, so there’s barely any room for any change to move lives. The honest next
question: under genuine triage — a world where you <em>can’t</em> save everyone and the order you serve
people in decides who lives — does the society’s coordination actually beat the uncoordinated swarm
by more?</p>

<p>I dialed in that world (the harshest setting: lives saved ≈ lives lost) and ran a paired
society-vs-swarm ablation. Five seeds. The result looked <em>great</em>:</p>

<blockquote>
  <p><strong>Δ = +16.2 lives</strong>, 95% CI <strong>[+3.2, +29.6]</strong> — excludes zero. Society stable at 63–72 lives across
every seed; swarm volatile (32–66), collapsing on the two hard draws (39 and 32 lives).</p>
</blockquote>

<p>A clean story practically wrote itself: <em>coordination buys graceful degradation under stress.</em> The
harness’s own auto-verdict even printed <strong>“credible improvement.”</strong> I wanted to believe it.</p>

<p>But the same readout had two warnings I’d built in precisely so I couldn’t ignore them: the <strong>sign
test was non-significant</strong> (p=0.375) and the <strong>statistical power was 0.57</strong>. The confidence interval
said yes; the more conservative tests said <em>not yet.</em> The honest move with a disagreement like that
is not to pick the answer you like — it’s to add seeds. So I added six more (the harness resumes the
runs it already has, so only the new ones cost anything) and re-ran at eleven.</p>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Society vs swarm, harsh world — the same effect at n=5 and n=11</div>
  <table class="rt">
    <thead>
      <tr><th>Seeds</th><th>Δ lives</th><th>95% CI</th><th>Sign test</th><th>Power</th></tr>
    </thead>
    <tbody>
      <tr class="lose"><td>n = 5</td><td><strong>+16.2</strong></td><td>[+3.2, +29.6] — excludes 0</td><td>p = 0.375</td><td>0.57</td></tr>
      <tr class="win"><td>n = 11</td><td><strong>+4.7</strong></td><td>[−4.0, +14.7] — <em>includes 0</em></td><td>p = 1.0</td><td>0.15</td></tr>
    </tbody>
  </table>
</div>

<p>The effect <strong>evaporated.</strong> Four of the six new seeds came back with the <em>swarm beating the society</em>. The
entire +16 had been carried by two lucky draws where the swarm happened to faceplant; the “society is
rock-stable” story was the same small-sample mirage (at eleven seeds the society’s range is 32–80,
not the tidy 63–72 I’d seen at five). There is <strong>no detectable lives advantage</strong> here. The power
curve says confirming even the residual +4.7 would take ~88 seeds — which is the harness’s polite way
of saying <em>stop.</em></p>

<p>This is the single most important thing the ruler did. A +16-life result with a confidence interval
that excludes zero is <em>exactly</em> the kind of finding you screenshot, put in a slide, and ship. It was
noise. I caught it because I’d committed in advance to a test I couldn’t argue with after the fact.</p>

<h2 id="episode-3--turning-the-ruler-on-my-own-headline">Episode 3 — turning the ruler on my own headline</h2>

<p>If a +16 can be a ghost, what about the project’s flagship claim — that the coordination protocol is
worth <strong>+28 lives</strong>? I owed it the same scrutiny, so I ran the <em>already-published</em> numbers through the
same paired test. The result, to my relief, held up where the +16 didn’t — but not cleanly:</p>

<ul>
  <li><strong>Directionally solid:</strong> society ≥ swarm on <strong>all five</strong> seeds (four wins, one tie, zero losses).
Unlike the harsh-world ghost that went both ways, this gap is real in sign. The qualitative claim —
<em>the auction helps</em> — survives.</li>
  <li><strong>Magnitude soft:</strong> the precise “+28” is leaned on hard by a single seed (+88 there; the other four
average +12.5), the sign test is p=0.125, and the power is 0.42. To put a tight interval on the
<em>number</em> you’d want ~25 seeds.</li>
</ul>

<p>So I’m keeping the claim and adding a caveat I can defend: <em>the coordination protocol reliably helps;
the headline figure is an n=5 mean with a wide interval.</em> That’s a weaker sentence than “+28 lives,”
and it’s the true one.</p>

<h2 id="what-the-ruler-cost-and-what-it-bought">What the ruler cost, and what it bought</h2>

<p>The whole measurement campaign — 41 paired runs across every world — cost <strong>$1.41</strong> in Qwen Cloud
tokens, tallied from the run ledgers. For that, I got three results I’d have paid far more for in
wasted effort and overconfidence:</p>

<ol>
  <li><strong>Don’t build the “biggest lever.”</strong> It fixes a pathology that occurs &lt;1% of the time.</li>
  <li><strong>There’s no lives edge under triage.</strong> The +16 was a false positive; the honest story for the
society is cost-efficiency and discipline, not out-saving the swarm under scarcity.</li>
  <li><strong>Caveat the +28.</strong> Real in direction, soft in magnitude.</li>
</ol>

<p>None of those are features. Two are flat “don’ts” and one is a hedge. And they’re the most valuable
output of the session — because the alternative was building a feature against a problem that isn’t
there, and believing a win that wasn’t real, in a slide deck, in front of judges.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>In a stochastic system, <strong>a single measurement is a coin flip wearing a lab coat.</strong> The plumbing
that protects you from it — pair to kill the variance you don’t care about, judge with a test you
picked <em>before</em> you saw the answer, and let a power curve tell you when to stop — isn’t bureaucracy.
It’s the difference between knowing and hoping. Build that ruler before you build the thing you’re
itching to build, because the ruler’s best days are the ones where it tells you <em>no.</em></p>

<p>And hold your own dashboard to the same bar: my ablation tool’s auto-verdict cheerfully called the
+16 “credible,” and the sign test is the only reason I didn’t believe it. The tool that keeps you
honest still needs you to read all of it.</p>

<p>Make it measurable. Then believe only the part that survives the measurement.</p>

<p><strong>Try it live:</strong> <a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a> · <strong>Read the code:</strong> <a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a>
(the harness is <code class="language-plaintext highlighter-rouge">aftershock ablation</code> / <code class="language-plaintext highlighter-rouge">aftershock diagnose</code>; the findings are in
<a href="https://github.com/bluntmachetti/aftershock/blob/main/docs/FIELD-NOTES.md"><code class="language-plaintext highlighter-rouge">docs/FIELD-NOTES.md</code></a> §13–17)</p>

<p><em>Built with Qwen Cloud (<code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> / <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> / <code class="language-plaintext highlighter-rouge">qwen3-max</code> via DashScope) and Alibaba
Cloud ECS, for the Qwen Cloud Global AI Hackathon.</em></p>]]></content><author><name>Kenny Ademolu</name></author><summary type="html"><![CDATA[Before tuning Aftershock's Qwen agent society, I built the measurement: paired ablations, a power curve, and free diagnostics over the run records. Then the measurement did its job — it killed the backlog's biggest planned feature before I wrote a line of it, caught a +16-life 'win' that evaporated when I added seeds, and made me caveat my own headline number. A field report on not fooling yourself.]]></summary></entry><entry><title type="html">We drew the agent auction on the map. A review caught it pointing at the wrong district.</title><link href="https://bluntmachetti.github.io/aftershock/2026/06/15/we-drew-the-auction-on-the-map.html" rel="alternate" type="text/html" title="We drew the agent auction on the map. A review caught it pointing at the wrong district." /><published>2026-06-15T16:00:00+00:00</published><updated>2026-06-15T16:00:00+00:00</updated><id>https://bluntmachetti.github.io/aftershock/2026/06/15/we-drew-the-auction-on-the-map</id><content type="html" xml:base="https://bluntmachetti.github.io/aftershock/2026/06/15/we-drew-the-auction-on-the-map.html"><![CDATA[<p>The headline number in <a href="https://github.com/bluntmachetti/aftershock"><strong>Aftershock</strong></a> is that a
coordination protocol is worth +28 lives a run — the same five <code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> workers save 103
lives instead of 75 once you put an auction between them and the shared pool of ambulances, rescue
crews, and fire engines. The protocol-less swarm burns its turns racing for resources that are
already gone; the society resolves that contention <em>before</em> anyone acts.</p>

<p>That sentence is the entire project. And until last week you could only read it — in a scrolling
text feed of <code class="language-plaintext highlighter-rouge">granted</code> / <code class="language-plaintext highlighter-rouge">pool exhausted</code> rulings off to the side of the map. The thing that
matters most was the thing you couldn’t see happen.</p>

<p>So this build was about making the mechanism legible: a Mission Control rebuild of the map that
draws the auction itself, live, on the city. It’s also a small parable about how a <em>picture</em> can
lie exactly as easily as a number — and on a project whose whole pitch is honesty, the picture
doesn’t get a pass.</p>

<h2 id="from-map-to-mission-control">From “map” to mission control</h2>

<p>The old map was honest but generic: district blocks, mission pins, a panic gauge. Fine. But it
read like a dashboard, not an emergency operations center — and it buried the one event you’d want
a commander watching for.</p>

<p>The rebuild is an EOC command view. A condition state up top (RED/AMBER/BLUE/GREEN, driven by the
nearest deadline, worst severity, and public panic), a tile map of the city, and — the point of the
whole exercise — a <strong>contention overlay</strong>.</p>

<p>When two incidents reach for the last unit of a resource in the same tick, the map draws it: a
dashed line from the district that <em>lost</em> to the district that <em>won</em>, labeled with the resource
(<code class="language-plaintext highlighter-rouge">AMB CONTESTED</code>, <code class="language-plaintext highlighter-rouge">RPR CONTESTED</code>), a halo on both incidents. You watch the auction arbitrate the
Qwen agents’ requests in real time. The <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> commander and the kernel’s auction stop
being a feed you skim and become a thing you see resolve, tick by tick.</p>

<p>And it’s derived from data that was already there. Every tick record carries the agents’ typed
<code class="language-plaintext highlighter-rouge">resource_request</code> proposals and the auction’s rulings. The overlay is a pure function of those —
no new engine state, no new field, nothing for the deterministic core to even notice. Same pack,
same seed, byte-identical run, now with the contention drawn on top.</p>

<h2 id="the-part-a-review-caught">The part a review caught</h2>

<p>Here’s where it gets honest. The first version of that overlay was subtly wrong, and it took an
automated code review on the pull request to catch it.</p>

<p>The auction can grant the same resource to several incidents before the pool runs dry. Picture one
tick, three incidents reaching for ambulances against a pool that can only cover two of them:</p>

<table>
  <thead>
    <tr>
      <th>incident</th>
      <th>district</th>
      <th>ambulance</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>m1</td>
      <td>Harbor</td>
      <td><strong>granted</strong></td>
    </tr>
    <tr>
      <td>m2</td>
      <td>Hospital</td>
      <td><strong>granted</strong> (the last one)</td>
    </tr>
    <tr>
      <td>m3</td>
      <td>Market</td>
      <td><strong>lost</strong></td>
    </tr>
  </tbody>
</table>

<p>m3 lost. But <em>to whom?</em> My first cut linked every loser to the <strong>first</strong> winner it happened to
process — here, m1 in Harbor. So the map drew Market’s “you lost this” arrow at Harbor, when the
auction’s own ruling says plainly: <code class="language-plaintext highlighter-rouge">pool exhausted: ambulance granted to m2</code> — m3 lost the <em>last</em>
unit to m2, in Hospital. The picture pointed at the wrong district.</p>

<p>The reviewer flagged it as a misattribution, and it wasn’t theoretical — it was live in the demo
data. The very first contested tick of our headline society run drew the contention line at the
wrong district. A plausible-looking arrow, confidently wrong.</p>

<p>The fix was to stop guessing and read what the auction already said: parse the winner named in each
loser’s own ruling, and link to <em>that</em> district. It also made the code simpler — the whole
“figure out who the winner probably was” scaffolding just deleted.</p>

<p>I keep a rule for the benchmark: never claim a number you didn’t measure on identical worlds. This
was the same rule, wearing a different hat. <strong>A visualization is an inference too, and it earns the
same scrutiny as a number</strong> — more, even, because a clean line on a map <em>feels</em> like ground truth
in a way a table never does. The map can be confidently, beautifully wrong. On a project that puts
<code class="language-plaintext highlighter-rouge">REAL / MAPPED / INFERRED / SYNTHETIC</code> labels on every field precisely so nothing overclaims, an
overlay that quietly fingers the wrong district is the same sin in a prettier font.</p>

<h2 id="keeping-a-reskin-honest">Keeping a reskin honest</h2>

<p>It’s “just frontend,” but the failure modes of a frontend change on a project like this are
specific, so it got the engine treatment:</p>

<ul>
  <li><strong>The color contract is frozen.</strong> Society stays cyan, the baselines stay amber — the same coding
the side-by-side compare view leans on. The redesign added exactly one new color (a caution
yellow for contention) and reused the existing signals for everything else, so no other view
shifted a pixel.</li>
  <li><strong>The honesty surfaces stayed first-class.</strong> The reality strip — real first-on-scene latency vs.
the agents’ simulated response, the <code class="language-plaintext highlighter-rouge">INFERRED</code> badge on lives-at-risk, the never-fabricated-when-
null rule — all survived untouched, including on the live NYC Hurricane Ida pack with its borough
names and provenance.</li>
  <li><strong>The blast radius was provable.</strong> I pulled the rich incident markers into a shared module so the
compare view’s two synced maps render from the <em>exact same code</em> but never inherit the overlay —
the one map that should change changed, and the one that shouldn’t was untouched by construction,
not by hope.</li>
  <li><strong>Determinism never moved.</strong> Frontend-only, so the <code class="language-plaintext highlighter-rouge">aftershock verify</code> digest check couldn’t
regress — and it didn’t.</li>
</ul>

<p>Then it went out the way everything does: local → staging → production, verified at each hop, live
now at the public demo.</p>

<h2 id="the-takeaway">The takeaway</h2>

<p>The benchmark exists to make the <em>result</em> falsifiable: here’s the baseline, here are the identical
worlds, here’s the measured gain. This build was the same instinct aimed at the <em>mechanism</em>: make
the coordination you can’t otherwise see legible enough to watch — and then hold the picture to the
same bar as the numbers, including the part you only find when someone reviews it and asks “wait,
who actually won?”</p>

<p>Make the mechanism visible. Then make sure the visible thing is telling the truth.</p>

<p><strong>Try it live:</strong> <a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a> — load a society run and scrub to a contested
tick · <strong>Read the code:</strong> <a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a></p>

<p><em>Built with Qwen Cloud (<code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> / <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> / <code class="language-plaintext highlighter-rouge">qwen3-max</code> via DashScope) and Alibaba
Cloud ECS, for the Qwen Cloud Global AI Hackathon.</em></p>]]></content><author><name>Kenny Ademolu</name></author><summary type="html"><![CDATA[Aftershock's whole result rests on one mechanic: an auction that resolves resource contention between Qwen agents before they act. We rebuilt the observatory's map into a Mission Control view that draws that auction live — and a code review caught the overlay blaming the wrong winning district. Making the mechanism visible meant holding the picture to the same honesty bar as the numbers.]]></summary></entry><entry><title type="html">We added native function calling. The benchmark told us to turn it off.</title><link href="https://bluntmachetti.github.io/aftershock/2026/06/14/we-added-function-calling-the-benchmark-told-us-to-turn-it-off.html" rel="alternate" type="text/html" title="We added native function calling. The benchmark told us to turn it off." /><published>2026-06-14T16:00:00+00:00</published><updated>2026-06-14T16:00:00+00:00</updated><id>https://bluntmachetti.github.io/aftershock/2026/06/14/we-added-function-calling-the-benchmark-told-us-to-turn-it-off</id><content type="html" xml:base="https://bluntmachetti.github.io/aftershock/2026/06/14/we-added-function-calling-the-benchmark-told-us-to-turn-it-off.html"><![CDATA[<p>A confession about hackathon incentives. The Qwen Cloud judging rubric weights <em>“sophisticated
use of Qwen Cloud APIs”</em> at about 30%, and <a href="https://github.com/bluntmachetti/aftershock"><strong>Aftershock</strong></a>’s
agent society had, until last week, been talking to Qwen the unglamorous way: every agent returns
strict JSON, the engine parses and validates it. Functional, but it doesn’t <em>look</em> like you’re
using the platform’s fanciest toy. So I did the obvious thing and wired up <strong>native function
calling</strong> — per-role <code class="language-plaintext highlighter-rouge">tools</code>, <code class="language-plaintext highlighter-rouge">tool_choice</code>, <code class="language-plaintext highlighter-rouge">parallel_tool_calls</code>, a dedicated <code class="language-plaintext highlighter-rouge">no_op</code> idle tool,
the works.</p>

<p>Then I did the thing I keep telling everyone else to do, and which is the entire point of this
project: I benchmarked it before believing in it.</p>

<p>The benchmark told me to turn it off.</p>

<p>This post is that result — because a negative result you measured honestly is worth more than a
feature you shipped on faith.</p>

<h2 id="what-native-function-calling-replaced">What “native function calling” replaced</h2>

<p>In JSON mode, each agent’s system prompt carries a compact prose contract — <em>here are your
actions, here are the fields, return JSON</em> — and the model returns a JSON object the engine
validates. In tool mode, that same action vocabulary becomes a set of OpenAI-style <code class="language-plaintext highlighter-rouge">tools</code>
definitions sent on every request, and the model emits structured <code class="language-plaintext highlighter-rouge">tool_calls</code> instead. Same
decisions, same auction, same negotiation protocol underneath. The <em>only</em> thing that changed is
how the action space is described to the model and how the model hands its choices back.</p>

<p>I kept the change behind a flag (<code class="language-plaintext highlighter-rouge">--society-tools</code>) and a <code class="language-plaintext highlighter-rouge">force_tools</code> switch, so I could run the
<em>exact</em> same five paired seeds both ways and let the numbers decide. Same disasters, byte for
byte. Only the calling convention differs.</p>

<h2 id="the-numbers">The numbers</h2>

<div class="readout-table">
  <div class="rt-cap"><span class="sq"></span>Society mode · 5 paired seeds</div>
  <table class="rt">
    <thead>
      <tr>
        <th>Society mode</th>
        <th>Lives saved (μ ± σ)</th>
        <th>Missions failed</th>
        <th>Cost / run</th>
        <th>Latency / run</th>
      </tr>
    </thead>
    <tbody>
      <tr class="win">
        <td>JSON contracts <span class="win-badge">default</span></td>
        <td><strong>103.2 ± 23.6</strong></td>
        <td><strong>0.4</strong></td>
        <td><strong>$0.042</strong></td>
        <td>120 s</td>
      </tr>
      <tr class="lose">
        <td>Native function calling</td>
        <td>98.2 ± 23.2</td>
        <td>0.8</td>
        <td>$0.083</td>
        <td>297 s</td>
      </tr>
    </tbody>
  </table>
</div>

<p>Read that top to bottom. Tool calling held lives saved <strong>within the noise</strong> — 98.2 vs 103.2 sits
comfortably inside one standard deviation (±23) — while <strong>cost roughly doubled and latency rose
~2.5×</strong>. The one place it moved measurably in the <em>wrong</em> direction was missions failed: 0.8 vs
0.4, twice as many lapsed deadlines, plausibly downstream of the higher latency (the
<code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> commander even hit a couple of request timeouts under the heavier payloads).</p>

<p>If you’re tempted to read the 5-life dip as a real regression, the paired-seed view talks you out
of it — tool mode was lower on four seeds, higher on one:</p>

<table>
  <thead>
    <tr>
      <th>seed</th>
      <th>11</th>
      <th>23</th>
      <th>37</th>
      <th>42</th>
      <th>57</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>JSON lives</td>
      <td>140</td>
      <td>86</td>
      <td>81</td>
      <td>98</td>
      <td>111</td>
    </tr>
    <tr>
      <td>tool lives</td>
      <td>138</td>
      <td>79</td>
      <td>90</td>
      <td>87</td>
      <td>97</td>
    </tr>
  </tbody>
</table>

<p>That’s noise with a faint downward lean, not a signal. So the honest summary is brutal in its
simplicity: <strong>native function calling cost twice as much to do the same job, slightly worse.</strong></p>

<h2 id="why--and-why-no-amount-of-prompting-saves-it">Why — and why no amount of prompting saves it</h2>

<p>The cause isn’t a bad implementation or an untuned prompt. It’s structural, and it’s worth
internalizing if you build multi-agent systems.</p>

<p>A society run makes roughly <strong>240 model calls</strong> — six agents, dozens of ticks. In JSON mode the
action vocabulary rides along as a ~450-token prose contract. In tool mode the equivalent <code class="language-plaintext highlighter-rouge">tools</code>
schema is <strong>~1,000 tokens</strong>, and — this is the load-bearing sentence — <strong>it is re-sent on every
single one of those 240 calls.</strong> The schema is pure repeated input tokens. The most expensive
seat in the house, the <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> commander, ate 59% of the run’s cost on its own; the five
flash workers split the rest, their prompts dominated by that resent schema.</p>

<p>I genuinely tried to schema-trim my way out of it. I projected every option: strip the
auto-generated pydantic <code class="language-plaintext highlighter-rouge">title</code>/<code class="language-plaintext highlighter-rouge">default</code> noise, compact the descriptions, even gut them to empty
strings. The floor — descriptions deleted entirely — is about <strong>$0.069/run</strong>. Still above the JSON
path’s $0.042, and still above the single-<code class="language-plaintext highlighter-rouge">qwen3-max</code>-does-everything baseline (~$0.06). You can’t
trim a per-call tax down to nothing when you pay it 240 times.</p>

<h2 id="the-decision-default-off-available-and-measured">The decision: default off, available and measured</h2>

<p>So Aftershock ships with <strong>JSON contracts as the default</strong> — the cost-optimal path, and the one
the published headline (“a society of small models matches a <code class="language-plaintext highlighter-rouge">qwen3-max</code> solo agent at ~35% lower
cost”) actually reflects. Native function calling stays <strong>implemented, tested, and benchmarked</strong>,
one flag away (<code class="language-plaintext highlighter-rouge">aftershock run --arm society --society-tools</code>), with its full ablation published in
the repo and in <a href="https://github.com/bluntmachetti/aftershock/blob/main/docs/FIELD-NOTES.md"><code class="language-plaintext highlighter-rouge">docs/FIELD-NOTES.md</code></a>.</p>

<p>I’ll defend that as the <em>more</em> sophisticated use of the API, not the less. Shipping the fancy
feature by default because the rubric rewards fancy features is cargo-culting. Wiring it up,
measuring it against a real baseline on identical worlds, discovering it doesn’t pay for itself in
<em>your</em> regime, and making that an informed, reversible default — that’s engineering judgment.
“We used native function calling and measured exactly what it costs” is a stronger sentence than
“we used native function calling.”</p>

<h2 id="the-transferable-lesson-with-the-scope-caveat-that-matters">The transferable lesson (with the scope caveat that matters)</h2>

<p>The general rule: <strong>in a high-frequency multi-agent system, per-call overhead dominates — measure
it before you adopt it.</strong> Anything you attach to <em>every</em> request (tool schemas, verbose system
prompts, retrieved context, elaborate output formats) gets multiplied by your call count, and in a
society that number is large. The fancier API is not free; it’s free <em>per call</em> and you make
hundreds of calls.</p>

<p>And the caveat that keeps this honest: this is <strong>not</strong> “function calling is bad.” For a
single-agent assistant, or any app that makes a handful of calls per task, a 1,000-token schema is
a rounding error and the reliability and ergonomics of structured tool calls are well worth it —
the math flips entirely. Function calling earns its keep at low call counts. It just gets taxed to
death at high ones. Know which regime you’re in.</p>

<p>That’s the whole ethos of Aftershock in miniature: don’t ask whether the shiny thing <em>can</em> work —
measure when it <em>actually</em> pays, on worlds identical enough that the answer means something. This
time the shiny thing didn’t pay, and saying so out loud is the result.</p>

<p><strong>Try it live:</strong> <a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a> · <strong>Read the code:</strong> <a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a></p>

<p><em>Built with Qwen Cloud (<code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> / <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> / <code class="language-plaintext highlighter-rouge">qwen3-max</code> via DashScope) and Alibaba
Cloud ECS, for the Qwen Cloud Global AI Hackathon.</em></p>]]></content><author><name>Kenny Ademolu</name></author><summary type="html"><![CDATA[An honest ablation: we added Qwen Cloud native function calling to the Aftershock agent society, benchmarked it on identical seeded worlds, and found it cost ~2× for no lives benefit. Why per-call tool-schema overhead dominates in high-frequency multi-agent systems — and why JSON contracts stayed the default.]]></summary></entry><entry><title type="html">When does a society of small Qwen models beat one big model? Building Aftershock.</title><link href="https://bluntmachetti.github.io/aftershock/2026/06/12/building-an-agent-society-on-qwen-cloud.html" rel="alternate" type="text/html" title="When does a society of small Qwen models beat one big model? Building Aftershock." /><published>2026-06-12T16:00:00+00:00</published><updated>2026-06-12T16:00:00+00:00</updated><id>https://bluntmachetti.github.io/aftershock/2026/06/12/building-an-agent-society-on-qwen-cloud</id><content type="html" xml:base="https://bluntmachetti.github.io/aftershock/2026/06/12/building-an-agent-society-on-qwen-cloud.html"><![CDATA[<p>Most multi-agent demos are theater. Agents take turns talking, politely agree with each
other, duplicate half the work, and somewhere off-camera a single strong model with a decent
prompt quietly does the whole task better and cheaper. I’ve built enough of these to find the
genre uncomfortable. So when I started <a href="https://github.com/bluntmachetti/aftershock"><strong>Aftershock</strong></a>
for the Qwen Cloud hackathon, I didn’t want to ask <em>can</em> agents collaborate. I wanted to ask the
harder, more falsifiable question:</p>

<blockquote>
  <p><strong>When is a society of small models actually better than one big model — and can you prove it?</strong></p>
</blockquote>

<p>This post is the build journey: the bet, why Qwen Cloud’s model line-up made the architecture
possible, what the numbers said (including the parts that didn’t flatter the idea), and what I’d
tell anyone building agent societies next.</p>

<p>Live demo: <strong><a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a></strong> · Code: <strong><a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a></strong></p>

<h2 id="the-bet-disaster-response-scored-on-lives">The bet: disaster response, scored on lives</h2>

<p>To make “better” mean something, you need a task where coordination failures have a cost. So
Aftershock is a disaster-response simulator. An earthquake (or, in real-data mode, a hurricane)
hits a city. Missions appear on a map — a collapsed school with people trapped, a hospital down
to its last hours of generator fuel, flooded neighborhoods. Six agents with distinct roles —
incident commander, medical, fire &amp; rescue, logistics, infrastructure, public comms — have to
divide the work and fight over a fixed pool of ambulances, rescue crews, fire engines, and fuel
before deadlines expire.</p>

<p>Crucially, every run is <strong>scored</strong>: lives saved, response latency, missions failed, and cost.
Not vibes. Numbers.</p>

<h2 id="why-qwen-cloud-made-the-architecture-possible">Why Qwen Cloud made the architecture possible</h2>

<p>The whole thesis depends on small models being cheap enough that you can afford <em>many</em> of them.
Qwen Cloud’s tiered line-up is what made the design economical, so I leaned into a structure I
think of as <strong>cost-tiered cognition</strong>:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">qwen3.5-flash</code></strong> runs the five worker roles — fast, cheap, good enough for a typed role
decision every tick.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">qwen3.5-plus</code></strong> is the incident commander and arbitrator — a little more capable, sitting
at the one position where judgment under contention actually matters.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">qwen3-max</code></strong> plays two parts: the <em>solo baseline</em> (one big model doing everything) and the
<em>after-action analyst</em> that writes the post-run report.</li>
</ul>

<p>Every agent talks to Qwen Cloud through the OpenAI-compatible chat-completions endpoint on
DashScope-International and is required to return strict JSON. The simulator validates every
decision <em>before</em> it touches the world, and rejected decisions are fed back to the agent on its
next turn with a named reason. A per-tick token/cost ledger means Qwen usage is visible in the
results, not hidden behind the demo — you can read the exact dollars-per-run in the benchmark
tables.</p>

<p>This is the part I want to underline for anyone else building on Qwen Cloud: the price gap
between flash and max is the entire reason a “team of cheap specialists vs. one expensive
generalist” comparison is even interesting. Five flash workers + a plus commander cost about
<strong>$0.042 a run</strong>; one qwen3-max doing everything costs <strong>$0.065</strong>. The architecture question only
exists because the cheap tier is genuinely cheap.</p>

<h2 id="making-it-measurable-not-impressive">Making it measurable, not impressive</h2>

<p>The thing that kills most agent benchmarks is that every run sees a different world, so the
result is just a story. Aftershock is built around three properties that turn anecdotes into
evidence:</p>

<ol>
  <li><strong>Determinism.</strong> All randomness flows from one seeded RNG; IDs come from counters; the engine
never calls <code class="language-plaintext highlighter-rouge">time.now()</code> or <code class="language-plaintext highlighter-rouge">random</code> in the simulation path. Same seed + same agent decisions
= byte-identical run, replayable forever. Every arm faces <em>exactly</em> the same disasters.</li>
  <li><strong>A typed negotiation protocol.</strong> Agents don’t free-text at each other. They emit typed
proposals — resource requests, task handoffs, escalations, information shares — and an auction
resolves contention atomically every tick. Coordination is a <em>mechanic</em>, not a chat transcript.</li>
  <li><strong>Validation with rejection feedback.</strong> Small models love to invent entity IDs and repeat
invalid actions. Engine-side validation plus a short rejection memory in the next observation
fixed far more than extra prompting ever did.</li>
</ol>

<p>Then I ran the same seeded disasters four ways — scripted heuristics, one qwen3-max solo, a flat
swarm of five flash agents with <em>no</em> protocol, and the structured society — and let the numbers
talk.</p>

<h2 id="the-result-that-mattered">The result that mattered</h2>

<table>
  <thead>
    <tr>
      <th>arm</th>
      <th>models</th>
      <th>lives saved (mean±sd)</th>
      <th>missions failed</th>
      <th>cost/run</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>society</strong></td>
      <td>flash ×5 + plus commander</td>
      <td><strong>103.2 ± 23.6</strong></td>
      <td>0.4</td>
      <td>$0.042</td>
    </tr>
    <tr>
      <td>solo</td>
      <td>qwen3-max</td>
      <td>104.2 ± 13.6</td>
      <td>0.4</td>
      <td>$0.065</td>
    </tr>
    <tr>
      <td>swarm</td>
      <td>flash ×5 (no protocol)</td>
      <td>75.6 ± 15.4</td>
      <td>3.0</td>
      <td>$0.016</td>
    </tr>
    <tr>
      <td>scripted</td>
      <td>heuristics ($0)</td>
      <td>106.8 ± 18.0</td>
      <td>0.2</td>
      <td>$0.00</td>
    </tr>
  </tbody>
</table>

<p>Two findings, both <em>causal</em> because every arm faced byte-identical worlds:</p>

<ul>
  <li><strong>The coordination protocol is worth +28 lives a run.</strong> The same five <code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> models,
with vs. without the negotiation protocol, went from 75.6 to 103.2 lives saved and from 3.0 to
0.4 missions failed. The run records show why: the protocol-less swarm burned ~160 decisions
racing each other for empty resource pools; the society resolved that contention in the auction
<em>before</em> acting.</li>
  <li><strong>The society matches the flagship for a third less money.</strong> A team of cheap <code class="language-plaintext highlighter-rouge">qwen3.5-flash</code>
workers under a <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> commander saved as many lives as one <code class="language-plaintext highlighter-rouge">qwen3-max</code> doing
everything (103.2 vs 104.2 — inside the noise), at ~35% lower cost and over 1.5× faster, because
five small parallel calls beat one big sequential one.</li>
</ul>

<p>That’s the headline: <strong>with the right coordination structure, small Qwen models reach big-model
outcomes at lower cost.</strong></p>

<h2 id="the-honest-part-the-bit-im-proudest-of">The honest part (the bit I’m proudest of)</h2>

<p>Here’s what a hype demo would hide and I put on the front page instead: a well-tuned <strong>scripted</strong>
baseline, using the exact same negotiation protocol, stays competitive with every LLM arm. In
other words, <em>the protocol carries more of the result than the LLMs do</em>.</p>

<p>That’s not a failure of the project. It <strong>is</strong> the project. Agent societies don’t need more agents
talking — they need institutions: roles, contracts, arbitration, validation, measurement. The
LLMs are the easy part; the mechanism is the hard part.</p>

<p>I logged every behavioral finding, including the negative ones, in
<a href="https://github.com/bluntmachetti/aftershock/blob/main/docs/FIELD-NOTES.md"><code class="language-plaintext highlighter-rouge">docs/FIELD-NOTES.md</code></a>.
My favorite is the <strong>memory paradox</strong>: my first naive after-action-report memory loop (let the
qwen3-max analyst write free-text “lessons” and feed them into the next briefing) made outcomes
<em>worse</em> on paired controls. Lessons expressed outside the agents’ actual action space are just
noise. Memory has to live in the action space or it hurts.</p>

<h2 id="grounding-it-in-reality">Grounding it in reality</h2>

<p>Synthetic seeds prove the comparison, but I wanted the demo grounded. So Aftershock can compile
scenarios <strong>offline from real open incident data</strong> and show that incident stream’s <em>real</em>
first-on-scene latency on screen as the baseline. The flagship pack is <strong>Hurricane Ida over New
York, the night of 2021-09-01</strong>, built from FDNY EMS and Fire dispatch records via NYC Open Data:
2,003 real EMS incidents, ~16.5% of calls held, a 948-second mean response time vs. ~524 s on a
calm night two weeks earlier.</p>

<p>And it never overclaims. Every field on screen is labeled <strong>REAL / MAPPED / INFERRED /
SYNTHETIC</strong>: the demand arrival and the latency baseline are real; mission severity is mapped;
lives-at-risk is inferred; outcomes are a simulated model. I do not claim agents beat real
responders on lives saved — only that they face the <em>real demand the world actually produced</em>.
The compiler runs offline, so a real scenario is still byte-deterministic.</p>

<h2 id="shipping-it-on-alibaba-cloud">Shipping it on Alibaba Cloud</h2>

<p>The observatory — a React + FastAPI app where you can scrub any run tick-by-tick, inspect each
agent’s decisions, compare arms side by side, and start live runs — is deployed on an Alibaba
Cloud ECS instance with Docker Compose behind a Caddy HTTPS front door. The deployment is
reproducible straight from the public repo: clone, drop in your <code class="language-plaintext highlighter-rouge">DASHSCOPE_API_KEY</code>,
<code class="language-plaintext highlighter-rouge">docker compose up</code>. Run records stream over WebSockets, and an MCP spectator server exposes the
same data to any MCP client.</p>

<h2 id="what-id-tell-the-next-person-building-an-agent-society">What I’d tell the next person building an agent society</h2>

<ol>
  <li><strong>Decide what “better” means before you build, and make it a number.</strong> Lives, latency, cost —
on byte-identical worlds. Without that, you have a story, not a result.</li>
  <li><strong>Spend your model budget where judgment lives.</strong> Cost-tiered cognition (flash workers, a plus
arbitrator, a max analyst) got big-model outcomes at small-model prices on Qwen Cloud.</li>
  <li><strong>Coordination is a mechanic, not a conversation.</strong> A typed protocol + an auction beat more
prompting and more agents, by a wide margin.</li>
  <li><strong>Validate and give feedback.</strong> Reject invalid actions with reasons and remember them for one
tick; small models recover instead of corrupting the run.</li>
  <li><strong>Publish your negative results.</strong> The scripted-is-competitive finding and the memory paradox
are the most credible things in the whole project.</li>
</ol>

<p>The larger goal is to make multi-agent systems <em>falsifiable</em>: not “look how many agents are
talking,” but “here is the coordination mechanism, here is the baseline, and here is the measured
gain.” Qwen Cloud’s cheap-enough small models are what make that experiment affordable to run.</p>

<p><strong>Try it live:</strong> <a href="https://aftershock.redoubtlabs.dev">https://aftershock.redoubtlabs.dev</a> · <strong>Read the code:</strong> <a href="https://github.com/bluntmachetti/aftershock">https://github.com/bluntmachetti/aftershock</a></p>

<p><em>Built with Qwen Cloud (<code class="language-plaintext highlighter-rouge">qwen3.5-flash</code> / <code class="language-plaintext highlighter-rouge">qwen3.5-plus</code> / <code class="language-plaintext highlighter-rouge">qwen3-max</code> via DashScope) and Alibaba
Cloud ECS, for the Qwen Cloud Global AI Hackathon.</em></p>]]></content><author><name>Kenny Ademolu</name></author><summary type="html"><![CDATA[The build journey of Aftershock — cost-tiered cognition on Qwen Cloud, a +28-lives coordination result, the honest negative results, and a real NYC Hurricane Ida scenario.]]></summary></entry></feed>