Aftershock — Field Log

The fix that would have only fooled the scoreboard — and the one tuning that actually paid.

2026-06-16T21:00:00+00:00

In the last post I built a ruler for Aftershock — paired ablations, a sign test, a power curve — before tuning the Qwen agent society at all. This post is what happened when I finally used it: a session of four levers, each gated by the ruler.

I expected the ruler to catch noise — a win that’s really a coin flip. It did that. But it also kept catching something subtler and, honestly, more embarrassing: levers that would work — move the exact number I aimed them at — and change nothing that matters. I came within one code edit of shipping a “fix” whose entire effect would have been to make a metric look better while the system behaved identically.

This is a post about outcome-neutral metrics, and the discipline of asking, before you optimize a number, whether the number does anything.

The setup: the society’s real story is cost and discipline, not lives

A quick recap of where the measurement left me. On Aftershock’s task, well-tuned scripted bots match every LLM arm on lives saved — the negotiation protocol, not model IQ, carries the result. And under genuine triage there’s no detectable lives edge of the society over an uncoordinated swarm. So the society’s honest, defensible value isn’t more lives — it’s cost-efficiency (small models, coordinated, at a fraction of a big model’s price) and conformance (a deterministic measure of whether agents actually follow their doctrine).

That reframing matters for everything below: it means a “win” on lives is usually noise, and a win worth shipping is one that moves cost or conformance — without quietly costing the other.

Lever 1 — does discipline cost lives? (an old scare, resolved)

Early on I’d given every agent a two-tier playbook — a “doctrine” — and a checker that scores how well they follow it. Writing the doctrine clearly raised conformance. But a single-seed run had a worrying side effect: the doctrine run saved fewer lives (96 vs 113). Discipline isn’t free — every instruction you add to a prompt is more to read and obey. I’d left it flagged as outcomes TBD.

n=1 is not a result, so I built a same-arm ablation: run the society with and without doctrine on the same five seeds, and — because conformance is the deterministic, low-variance signal — make the report lead with conformance and treat lives as the noisy secondary.

Doctrine on vs off — paired, 5 seeds

Signal	Off	On	Δ
team conformance	0.696	0.852	+0.156 (all 5 seeds)
lives saved	100.4	104.2	+3.8 (within noise)

The scare was a small-sample artifact. Doctrine raises conformance on every seed, and lives are flat-to-slightly-up — the n=1 “−17 lives” was noise. Discipline here is not paid for in lives. That clears doctrine as a real, free conformance lever and tells me where the rest of the session should aim: at the one role that still won’t follow it.

Lever 2 — the agent that won’t follow the rules, and the guard I almost built to cheat

That role is the infrastructure agent. Even with doctrine, it’s the one specialist that stays below 0.70 conformance. The diagnostic showed why — three distinct failures, not one:

Urgency inflation: it marks every request urgency 9, even low-severity ones with a distant deadline.
Resubmitting rejected work: it re-issues a decision the engine just rejected.
Impossible repairs: it calls repair_road on districts that aren’t blocked, or when there’s no repair crew free.

First I did the obvious thing: rewrote its prompt to tie each rule to the exact fields it already sees in its observation. That cleanly fixed urgency (0.35 → 1.00 conformance on that rule) — a numeric threshold the model can apply. But the impossible-repairs rule barely moved. Telling the flash model “only repair a district on the BLOCKED line, and only if crew ≥ 1” just… didn’t take. Precondition gating is harder to prompt than scalar calibration.

So I reached for the backlog idea: a deterministic guard — enforce the precondition in code so an invalid repair can never happen. And then, tracing the code to place it, I stopped.

The engine already rejects those repairs. An invalid repair_road is validated and declined before it consumes anything — zero resources, no effect on the world. And the conformance metric counts the agent’s attempt. So a guard that intercepted the attempt would do exactly one thing: make the conformance number go up by hiding the agent’s behavior from the record. The world would play out identically — same lives, same cost — and the scoreboard would look better.

That is the cleanest example I’ve hit of an outcome-neutral metric: a number you can move without changing anything real. Building the guard would have been reward-hacking my own benchmark. I didn’t build it. The honest version of the question isn’t “how do I make the number go up” — it’s “is the behavior actually fixable, and does fixing it matter?”

So I tested that instead: I gave just the infra agent a stronger model (qwen3.5-plus instead of flash) and measured, changing nothing else.

Infra agent: flash vs plus (model isolated, 5 seeds)

Signal	flash	plus	Δ
impossible-repair rule	0.560	0.957	+0.40
infra conformance	0.863	0.986	+0.123 (all 5 seeds)
lives saved	103.6	103.8	flat (p = 1.0)
cost / run	$0.041	$0.054	+33%

So the stickiness was a model-capability floor, not a prompt bug: the bigger model gates the preconditions the smaller one ignores. A real finding — but look at the trade. It buys near-perfect conformance on an outcome-neutral rule (the bad repairs cost no lives to begin with) for +33% cost. Paying a third more to perfect a metric that doesn’t move lives, on an arm whose whole pitch is cost-efficiency, is the wrong default.

So it ships as an opt-in operating mode — one flag, --role-model infrastructure=qwen3.5-plus — not the published default. Flip it when discipline matters more than dollars; leave it off for the lives-per-dollar story. The finding (it’s a capability floor) is the durable result, independent of which default you choose.

Lever 3 — the one that actually paid: stop re-sending the prompt

With lives ruled out and conformance either free (doctrine) or expensive (infra model), I turned to the axis the society can actually win on: cost. And here the harness did its best work as a profiler before it did anything as a judge.

Where does the money go? The commander sends 31,900 prompt tokens per run but only 2,300 of completion. Decomposing one call: a 941-token system prompt re-sent on every single tick, plus only ~160 tokens of actual world observation. Across all six agents, ~85% of a run’s prompt tokens are the same static prefixes, re-sent every tick — roughly 60% of total cost is the model re-reading instructions it already read last tick.

The obvious fix is caching — most providers bill a repeated prefix at a discount. So I probed it: a ~1,600-token stable prefix, sent three times in a row, on both Qwen models. The response told me plainly:

prompt_tokens_details: { text_tokens: 1589 } — no cached_tokens field, no discount, full price every call.

DashScope’s international compatible-mode endpoint doesn’t cache our prompts. That’s a clean negative result worth knowing: it means our cost ledger is accurate, not pessimistic — there’s no free accounting win hiding in cache hits. The only way to cut a re-sent prompt is to make it shorter.

Doctrine is off-limits (it’s the conformance lever from Lever 1) and the role instructions are behavioral, so the target was the output contract — the JSON-format boilerplate every agent carries. I compacted the multi-line schema to one line, deduplicated rules that were stated twice, and tightened the proposal descriptions — keeping every capability and field name, just terser. Commander prefix: 941 → 835 tokens. Then the paired A/B, against an identical baseline where only the contract differs:

Trimmed contract vs full — paired, 5 seeds

Signal	full	trimmed	Δ
cost / run	$0.0411	$0.0353	−14.0%
lives per dollar	2,522	3,046	+21%
lives saved	103.6	107.6	+4.0 (up, not significant)
conformance	0.916	0.879	−0.037 (p = 0.375, n.s.)

A real, largely deterministic 14% cost cut — the token reduction is fixed; only the small completion varies — for a 21% gain in lives-per-dollar, the headline efficiency number. Lives even drifted up. The one thing to respect: conformance ticked down 0.037, but the sign test says that’s not a credible effect (p=0.375), and it sits inside the run-to-run band conformance has always wandered in (0.85–0.92). By the same bar I used to reject the +16-life ghost last post, I can’t claim this conformance dip either — so I kept the trim and logged the dip as a watch-item, not a regression.

The deeper lesson hid in the disappointment: the pure-redundancy trim was tiny (~16 tokens). Most of that 941-token prefix is irreducible — doctrine the agents need, decision vocabulary they act on, a schema that keeps the JSON valid. The society’s ~$0.04/run is mostly structural. −14% is close to the safe ceiling without touching doctrine or model tier. Worth banking; not a bottomless well.

The pattern: four levers, one bar

Before chasing each lever I made a workflow of read-only agents scope it against one question — does moving this metric change a real outcome? It skipped four backlog ideas outright (a deadline-sort for a failure mode that’s already near zero; reducing “redundant” bids the auction ignores for free; a temperature sweep premised on a cost number that turned out wrong). Each was an outcome-neutral mirage, the I1-guard pattern again: a number you can move that moves nothing else.

What survived the bar is exactly the society’s honest story:

Doctrine raises conformance on every seed, free of lives cost.
Conformance has a price ceiling — the last stubborn role needs a bigger model, which costs more than the (outcome-neutral) metric is worth as a default, so it’s a switch.
Cost is real and movable — −14% by not re-sending what the model already read, the cap set by how much of the prompt is genuinely load-bearing.

No grand lives breakthrough. But three claims I can defend in front of a judge, and a fistful of features I didn’t ship because the measurement said they’d only flatter the scoreboard.

The takeaway

Last post’s lesson was a single measurement is a coin flip wearing a lab coat — pair it, test it, know when to stop. This session added the sharper one:

Before you optimize a metric, prove the metric does something. The most seductive failure isn’t a noisy win — it’s a clean win on a number that’s decoupled from reality. I could have made my conformance score jump by intercepting decisions the engine already discards; it would have demoed beautifully and meant nothing. The guard against that isn’t more statistics. It’s tracing the metric back to an outcome — lives, dollars, missions — and refusing to celebrate a move that doesn’t reach one.

Make it measurable. Believe only what survives the measurement. And check that the thing you’re measuring is connected to the thing you actually care about.

Try it live: https://aftershock.redoubtlabs.dev · Read the code: https://github.com/bluntmachetti/aftershock (the levers are in docs/FIELD-NOTES.md §18–21; the cost trim is --role-model / the output contract)

Built with Qwen Cloud (qwen3.5-flash / qwen3.5-plus / qwen3-max via DashScope) and Alibaba Cloud ECS, for the Qwen Cloud Global AI Hackathon.

Build the ruler first. It killed our biggest feature — and a +16-life win that wasn’t real.

2026-06-15T22:00:00+00:00

My backlog for Aftershock had a clear headliner. Of ~44 ideas for making the Qwen agent society save more lives, one was marked the single biggest lever: a smarter resource auction (partial grants instead of all-or-nothing), hypothesized at +10–15 lives a run. The obvious move was to build it.

I didn’t build it. I built a ruler first — the boring statistical plumbing that tells you whether a change actually did anything. And the ruler earned its keep three times over in one sitting: it killed the headliner before I wrote a line of it, it caught a +16-life win that wasn’t real, and it made me put a caveat on my own flagship number.

This is a post about the least glamorous part of the project — measurement — and why it turned out to be the most valuable.

Why a ruler, before any tuning

Aftershock’s engine is byte-deterministic: same seed, same disaster, same outcome. But the agents are not. The published society result — 103.2 ± 23.6 lives — has a ± for a reason, and the first thing the ruler did was tell me exactly what that ± is made of.

Two cheap experiments:

Does Qwen Cloud honor a sampling seed? I threaded a deterministic per-call seed through every request and ran the same world twice. The two runs diverged on the first tick, exactly like two un-seeded runs. DashScope accepts seed but doesn’t make sampling reproducible — so the LLM arms are irreducibly stochastic, and that ± really is sampling noise, not measurement slop.
How much of the noise is the model, and how much is the world? Running each seed three times and decomposing the variance: 79% of it is which disaster you drew (between-seed), only 21% is the model’s run-to-run wobble (within-seed).

That second number is the whole game. If most of your variance is the scenario, then comparing a mean against a mean is mostly comparing coin flips — unless you pair: run the control and the treatment on the same seeds and difference out the world. Concretely, to detect a +10-life effect at decent power you need ~10 paired seeds, versus ~45 unpaired. Same data, a quarter of the runs, because pairing deletes the 79%.

So the ruler is: run both arms on identical seeds, take the per-seed difference, and judge it with a sign test, a bootstrap confidence interval, and a power curve — aftershock ablation. Nothing fancy. Just enough to stop me believing a number I shouldn’t.

Episode 1 — the diagnostic killed the headliner

The “biggest lever” exists to fix a specific pathology: a priority inversion. Under all-or-nothing granting, a high-priority incident that needs three ambulances can lose the whole pool to a later, lower-priority incident that happens to fit the two that are left. Partial grants would fix that. Worth +10–15 lives — if it happens.

So before building the fix, I built a diagnostic that reads the recorded auction outcomes and counts how often it actually happens. The answer, across nine society runs:

Zero.

Then I made the world harder — I added a knob to tighten the resource pools — figuring scarcity would surely manufacture the contention. Still zero. I cranked it to the harshest setting the sim allows (every pool at 2, where a fifth of all incidents fail outright):

Priority inversions found — the pathology the "biggest lever" fixes

World	Society runs	Priority inversions
default	9	0 (of 59 contested losses)
tight pools	6	0 (of 74 contested)
harshest (all pools = 2)	5	6 (of 794 losses — <1%)

Across the default and tight worlds: 133 contested auction losses, every single one legitimate — the loser always had lower-or-equal priority than the winner. The auction’s arbitration is just correct: it serves the highest-priority incident first, and the agents don’t over-request in the way that would let a low-priority bid steal the remainder. Only at brutal scarcity do a handful of inversions appear, and even then they’re dwarfed 70-to-1 by plain shortage (the pool is simply empty — there’s nothing for a partial grant to split).

The verdict wrote itself: don’t build the lever. It optimizes a problem this system doesn’t have. That’s the harness paying for itself before it cost a thing — a feature un-built is a feature you don’t have to maintain, document, or quietly regret.

(A humbling footnote: the first version of that inversion detector was dead code — by the resolver’s ordering it could never actually fire — and an adversarial code review caught it. Even the ruler needed measuring. I rebuilt it to reconstruct winners-versus-losers from the recorded grants, and then trusted the zeros.)

Episode 2 — the +16 that wasn’t

Fine: the auction isn’t the bottleneck on the easy task. But the easy task is easy — ~96% of incidents get resolved, so there’s barely any room for any change to move lives. The honest next question: under genuine triage — a world where you can’t save everyone and the order you serve people in decides who lives — does the society’s coordination actually beat the uncoordinated swarm by more?

I dialed in that world (the harshest setting: lives saved ≈ lives lost) and ran a paired society-vs-swarm ablation. Five seeds. The result looked great:

Δ = +16.2 lives, 95% CI [+3.2, +29.6] — excludes zero. Society stable at 63–72 lives across every seed; swarm volatile (32–66), collapsing on the two hard draws (39 and 32 lives).

A clean story practically wrote itself: coordination buys graceful degradation under stress. The harness’s own auto-verdict even printed “credible improvement.” I wanted to believe it.

But the same readout had two warnings I’d built in precisely so I couldn’t ignore them: the sign test was non-significant (p=0.375) and the statistical power was 0.57. The confidence interval said yes; the more conservative tests said not yet. The honest move with a disagreement like that is not to pick the answer you like — it’s to add seeds. So I added six more (the harness resumes the runs it already has, so only the new ones cost anything) and re-ran at eleven.

Society vs swarm, harsh world — the same effect at n=5 and n=11

Seeds	Δ lives	95% CI	Sign test	Power
n = 5	+16.2	[+3.2, +29.6] — excludes 0	p = 0.375	0.57
n = 11	+4.7	[−4.0, +14.7] — includes 0	p = 1.0	0.15

The effect evaporated. Four of the six new seeds came back with the swarm beating the society. The entire +16 had been carried by two lucky draws where the swarm happened to faceplant; the “society is rock-stable” story was the same small-sample mirage (at eleven seeds the society’s range is 32–80, not the tidy 63–72 I’d seen at five). There is no detectable lives advantage here. The power curve says confirming even the residual +4.7 would take ~88 seeds — which is the harness’s polite way of saying stop.

This is the single most important thing the ruler did. A +16-life result with a confidence interval that excludes zero is exactly the kind of finding you screenshot, put in a slide, and ship. It was noise. I caught it because I’d committed in advance to a test I couldn’t argue with after the fact.

Episode 3 — turning the ruler on my own headline

If a +16 can be a ghost, what about the project’s flagship claim — that the coordination protocol is worth +28 lives? I owed it the same scrutiny, so I ran the already-published numbers through the same paired test. The result, to my relief, held up where the +16 didn’t — but not cleanly:

Directionally solid: society ≥ swarm on all five seeds (four wins, one tie, zero losses). Unlike the harsh-world ghost that went both ways, this gap is real in sign. The qualitative claim — the auction helps — survives.
Magnitude soft: the precise “+28” is leaned on hard by a single seed (+88 there; the other four average +12.5), the sign test is p=0.125, and the power is 0.42. To put a tight interval on the number you’d want ~25 seeds.

So I’m keeping the claim and adding a caveat I can defend: the coordination protocol reliably helps; the headline figure is an n=5 mean with a wide interval. That’s a weaker sentence than “+28 lives,” and it’s the true one.

What the ruler cost, and what it bought

The whole measurement campaign — 41 paired runs across every world — cost $1.41 in Qwen Cloud tokens, tallied from the run ledgers. For that, I got three results I’d have paid far more for in wasted effort and overconfidence:

Don’t build the “biggest lever.” It fixes a pathology that occurs <1% of the time.
There’s no lives edge under triage. The +16 was a false positive; the honest story for the society is cost-efficiency and discipline, not out-saving the swarm under scarcity.
Caveat the +28. Real in direction, soft in magnitude.

None of those are features. Two are flat “don’ts” and one is a hedge. And they’re the most valuable output of the session — because the alternative was building a feature against a problem that isn’t there, and believing a win that wasn’t real, in a slide deck, in front of judges.

The takeaway

In a stochastic system, a single measurement is a coin flip wearing a lab coat. The plumbing that protects you from it — pair to kill the variance you don’t care about, judge with a test you picked before you saw the answer, and let a power curve tell you when to stop — isn’t bureaucracy. It’s the difference between knowing and hoping. Build that ruler before you build the thing you’re itching to build, because the ruler’s best days are the ones where it tells you no.

And hold your own dashboard to the same bar: my ablation tool’s auto-verdict cheerfully called the +16 “credible,” and the sign test is the only reason I didn’t believe it. The tool that keeps you honest still needs you to read all of it.

Make it measurable. Then believe only the part that survives the measurement.

Try it live: https://aftershock.redoubtlabs.dev · Read the code: https://github.com/bluntmachetti/aftershock (the harness is aftershock ablation / aftershock diagnose; the findings are in docs/FIELD-NOTES.md §13–17)

Built with Qwen Cloud (qwen3.5-flash / qwen3.5-plus / qwen3-max via DashScope) and Alibaba Cloud ECS, for the Qwen Cloud Global AI Hackathon.

We drew the agent auction on the map. A review caught it pointing at the wrong district.

2026-06-15T16:00:00+00:00

The headline number in Aftershock is that a coordination protocol is worth +28 lives a run — the same five qwen3.5-flash workers save 103 lives instead of 75 once you put an auction between them and the shared pool of ambulances, rescue crews, and fire engines. The protocol-less swarm burns its turns racing for resources that are already gone; the society resolves that contention before anyone acts.

That sentence is the entire project. And until last week you could only read it — in a scrolling text feed of granted / pool exhausted rulings off to the side of the map. The thing that matters most was the thing you couldn’t see happen.

So this build was about making the mechanism legible: a Mission Control rebuild of the map that draws the auction itself, live, on the city. It’s also a small parable about how a picture can lie exactly as easily as a number — and on a project whose whole pitch is honesty, the picture doesn’t get a pass.

From “map” to mission control

The old map was honest but generic: district blocks, mission pins, a panic gauge. Fine. But it read like a dashboard, not an emergency operations center — and it buried the one event you’d want a commander watching for.

The rebuild is an EOC command view. A condition state up top (RED/AMBER/BLUE/GREEN, driven by the nearest deadline, worst severity, and public panic), a tile map of the city, and — the point of the whole exercise — a contention overlay.

When two incidents reach for the last unit of a resource in the same tick, the map draws it: a dashed line from the district that lost to the district that won, labeled with the resource (AMB CONTESTED, RPR CONTESTED), a halo on both incidents. You watch the auction arbitrate the Qwen agents’ requests in real time. The qwen3.5-plus commander and the kernel’s auction stop being a feed you skim and become a thing you see resolve, tick by tick.

And it’s derived from data that was already there. Every tick record carries the agents’ typed resource_request proposals and the auction’s rulings. The overlay is a pure function of those — no new engine state, no new field, nothing for the deterministic core to even notice. Same pack, same seed, byte-identical run, now with the contention drawn on top.

The part a review caught

Here’s where it gets honest. The first version of that overlay was subtly wrong, and it took an automated code review on the pull request to catch it.

The auction can grant the same resource to several incidents before the pool runs dry. Picture one tick, three incidents reaching for ambulances against a pool that can only cover two of them:

incident	district	ambulance
m1	Harbor	granted
m2	Hospital	granted (the last one)
m3	Market	lost

m3 lost. But to whom? My first cut linked every loser to the first winner it happened to process — here, m1 in Harbor. So the map drew Market’s “you lost this” arrow at Harbor, when the auction’s own ruling says plainly: pool exhausted: ambulance granted to m2 — m3 lost the last unit to m2, in Hospital. The picture pointed at the wrong district.

The reviewer flagged it as a misattribution, and it wasn’t theoretical — it was live in the demo data. The very first contested tick of our headline society run drew the contention line at the wrong district. A plausible-looking arrow, confidently wrong.

The fix was to stop guessing and read what the auction already said: parse the winner named in each loser’s own ruling, and link to that district. It also made the code simpler — the whole “figure out who the winner probably was” scaffolding just deleted.

I keep a rule for the benchmark: never claim a number you didn’t measure on identical worlds. This was the same rule, wearing a different hat. A visualization is an inference too, and it earns the same scrutiny as a number — more, even, because a clean line on a map feels like ground truth in a way a table never does. The map can be confidently, beautifully wrong. On a project that puts REAL / MAPPED / INFERRED / SYNTHETIC labels on every field precisely so nothing overclaims, an overlay that quietly fingers the wrong district is the same sin in a prettier font.

Keeping a reskin honest

It’s “just frontend,” but the failure modes of a frontend change on a project like this are specific, so it got the engine treatment:

The color contract is frozen. Society stays cyan, the baselines stay amber — the same coding the side-by-side compare view leans on. The redesign added exactly one new color (a caution yellow for contention) and reused the existing signals for everything else, so no other view shifted a pixel.
The honesty surfaces stayed first-class. The reality strip — real first-on-scene latency vs. the agents’ simulated response, the INFERRED badge on lives-at-risk, the never-fabricated-when- null rule — all survived untouched, including on the live NYC Hurricane Ida pack with its borough names and provenance.
The blast radius was provable. I pulled the rich incident markers into a shared module so the compare view’s two synced maps render from the exact same code but never inherit the overlay — the one map that should change changed, and the one that shouldn’t was untouched by construction, not by hope.
Determinism never moved. Frontend-only, so the aftershock verify digest check couldn’t regress — and it didn’t.

Then it went out the way everything does: local → staging → production, verified at each hop, live now at the public demo.

The takeaway

The benchmark exists to make the result falsifiable: here’s the baseline, here are the identical worlds, here’s the measured gain. This build was the same instinct aimed at the mechanism: make the coordination you can’t otherwise see legible enough to watch — and then hold the picture to the same bar as the numbers, including the part you only find when someone reviews it and asks “wait, who actually won?”

Make the mechanism visible. Then make sure the visible thing is telling the truth.

Try it live: https://aftershock.redoubtlabs.dev — load a society run and scrub to a contested tick · Read the code: https://github.com/bluntmachetti/aftershock

Built with Qwen Cloud (qwen3.5-flash / qwen3.5-plus / qwen3-max via DashScope) and Alibaba Cloud ECS, for the Qwen Cloud Global AI Hackathon.

We added native function calling. The benchmark told us to turn it off.

2026-06-14T16:00:00+00:00

A confession about hackathon incentives. The Qwen Cloud judging rubric weights “sophisticated use of Qwen Cloud APIs” at about 30%, and Aftershock’s agent society had, until last week, been talking to Qwen the unglamorous way: every agent returns strict JSON, the engine parses and validates it. Functional, but it doesn’t look like you’re using the platform’s fanciest toy. So I did the obvious thing and wired up native function calling — per-role tools, tool_choice, parallel_tool_calls, a dedicated no_op idle tool, the works.

Then I did the thing I keep telling everyone else to do, and which is the entire point of this project: I benchmarked it before believing in it.

The benchmark told me to turn it off.

This post is that result — because a negative result you measured honestly is worth more than a feature you shipped on faith.

What “native function calling” replaced

In JSON mode, each agent’s system prompt carries a compact prose contract — here are your actions, here are the fields, return JSON — and the model returns a JSON object the engine validates. In tool mode, that same action vocabulary becomes a set of OpenAI-style tools definitions sent on every request, and the model emits structured tool_calls instead. Same decisions, same auction, same negotiation protocol underneath. The only thing that changed is how the action space is described to the model and how the model hands its choices back.

I kept the change behind a flag (--society-tools) and a force_tools switch, so I could run the exact same five paired seeds both ways and let the numbers decide. Same disasters, byte for byte. Only the calling convention differs.

The numbers

Society mode · 5 paired seeds

Society mode	Lives saved (μ ± σ)	Missions failed	Cost / run	Latency / run
JSON contracts default	103.2 ± 23.6	0.4	$0.042	120 s
Native function calling	98.2 ± 23.2	0.8	$0.083	297 s

Read that top to bottom. Tool calling held lives saved within the noise — 98.2 vs 103.2 sits comfortably inside one standard deviation (±23) — while cost roughly doubled and latency rose ~2.5×. The one place it moved measurably in the wrong direction was missions failed: 0.8 vs 0.4, twice as many lapsed deadlines, plausibly downstream of the higher latency (the qwen3.5-plus commander even hit a couple of request timeouts under the heavier payloads).

If you’re tempted to read the 5-life dip as a real regression, the paired-seed view talks you out of it — tool mode was lower on four seeds, higher on one:

seed	11	23	37	42	57
JSON lives	140	86	81	98	111
tool lives	138	79	90	87	97

That’s noise with a faint downward lean, not a signal. So the honest summary is brutal in its simplicity: native function calling cost twice as much to do the same job, slightly worse.

Why — and why no amount of prompting saves it

The cause isn’t a bad implementation or an untuned prompt. It’s structural, and it’s worth internalizing if you build multi-agent systems.

A society run makes roughly 240 model calls — six agents, dozens of ticks. In JSON mode the action vocabulary rides along as a ~450-token prose contract. In tool mode the equivalent tools schema is ~1,000 tokens, and — this is the load-bearing sentence — it is re-sent on every single one of those 240 calls. The schema is pure repeated input tokens. The most expensive seat in the house, the qwen3.5-plus commander, ate 59% of the run’s cost on its own; the five flash workers split the rest, their prompts dominated by that resent schema.

I genuinely tried to schema-trim my way out of it. I projected every option: strip the auto-generated pydantic title/default noise, compact the descriptions, even gut them to empty strings. The floor — descriptions deleted entirely — is about $0.069/run. Still above the JSON path’s $0.042, and still above the single-qwen3-max-does-everything baseline (~$0.06). You can’t trim a per-call tax down to nothing when you pay it 240 times.

The decision: default off, available and measured

So Aftershock ships with JSON contracts as the default — the cost-optimal path, and the one the published headline (“a society of small models matches a qwen3-max solo agent at ~35% lower cost”) actually reflects. Native function calling stays implemented, tested, and benchmarked, one flag away (aftershock run --arm society --society-tools), with its full ablation published in the repo and in docs/FIELD-NOTES.md.

I’ll defend that as the more sophisticated use of the API, not the less. Shipping the fancy feature by default because the rubric rewards fancy features is cargo-culting. Wiring it up, measuring it against a real baseline on identical worlds, discovering it doesn’t pay for itself in your regime, and making that an informed, reversible default — that’s engineering judgment. “We used native function calling and measured exactly what it costs” is a stronger sentence than “we used native function calling.”

The transferable lesson (with the scope caveat that matters)

The general rule: in a high-frequency multi-agent system, per-call overhead dominates — measure it before you adopt it. Anything you attach to every request (tool schemas, verbose system prompts, retrieved context, elaborate output formats) gets multiplied by your call count, and in a society that number is large. The fancier API is not free; it’s free per call and you make hundreds of calls.

And the caveat that keeps this honest: this is not “function calling is bad.” For a single-agent assistant, or any app that makes a handful of calls per task, a 1,000-token schema is a rounding error and the reliability and ergonomics of structured tool calls are well worth it — the math flips entirely. Function calling earns its keep at low call counts. It just gets taxed to death at high ones. Know which regime you’re in.

That’s the whole ethos of Aftershock in miniature: don’t ask whether the shiny thing can work — measure when it actually pays, on worlds identical enough that the answer means something. This time the shiny thing didn’t pay, and saying so out loud is the result.

Try it live: https://aftershock.redoubtlabs.dev · Read the code: https://github.com/bluntmachetti/aftershock

Built with Qwen Cloud (qwen3.5-flash / qwen3.5-plus / qwen3-max via DashScope) and Alibaba Cloud ECS, for the Qwen Cloud Global AI Hackathon.

When does a society of small Qwen models beat one big model? Building Aftershock.

2026-06-12T16:00:00+00:00

Most multi-agent demos are theater. Agents take turns talking, politely agree with each other, duplicate half the work, and somewhere off-camera a single strong model with a decent prompt quietly does the whole task better and cheaper. I’ve built enough of these to find the genre uncomfortable. So when I started Aftershock for the Qwen Cloud hackathon, I didn’t want to ask can agents collaborate. I wanted to ask the harder, more falsifiable question:

When is a society of small models actually better than one big model — and can you prove it?

This post is the build journey: the bet, why Qwen Cloud’s model line-up made the architecture possible, what the numbers said (including the parts that didn’t flatter the idea), and what I’d tell anyone building agent societies next.

Live demo: https://aftershock.redoubtlabs.dev · Code: https://github.com/bluntmachetti/aftershock

The bet: disaster response, scored on lives

To make “better” mean something, you need a task where coordination failures have a cost. So Aftershock is a disaster-response simulator. An earthquake (or, in real-data mode, a hurricane) hits a city. Missions appear on a map — a collapsed school with people trapped, a hospital down to its last hours of generator fuel, flooded neighborhoods. Six agents with distinct roles — incident commander, medical, fire & rescue, logistics, infrastructure, public comms — have to divide the work and fight over a fixed pool of ambulances, rescue crews, fire engines, and fuel before deadlines expire.

Crucially, every run is scored: lives saved, response latency, missions failed, and cost. Not vibes. Numbers.

Why Qwen Cloud made the architecture possible

The whole thesis depends on small models being cheap enough that you can afford many of them. Qwen Cloud’s tiered line-up is what made the design economical, so I leaned into a structure I think of as cost-tiered cognition:

qwen3.5-flash runs the five worker roles — fast, cheap, good enough for a typed role decision every tick.
qwen3.5-plus is the incident commander and arbitrator — a little more capable, sitting at the one position where judgment under contention actually matters.
qwen3-max plays two parts: the solo baseline (one big model doing everything) and the after-action analyst that writes the post-run report.

Every agent talks to Qwen Cloud through the OpenAI-compatible chat-completions endpoint on DashScope-International and is required to return strict JSON. The simulator validates every decision before it touches the world, and rejected decisions are fed back to the agent on its next turn with a named reason. A per-tick token/cost ledger means Qwen usage is visible in the results, not hidden behind the demo — you can read the exact dollars-per-run in the benchmark tables.

This is the part I want to underline for anyone else building on Qwen Cloud: the price gap between flash and max is the entire reason a “team of cheap specialists vs. one expensive generalist” comparison is even interesting. Five flash workers + a plus commander cost about $0.042 a run; one qwen3-max doing everything costs $0.065. The architecture question only exists because the cheap tier is genuinely cheap.

Making it measurable, not impressive

The thing that kills most agent benchmarks is that every run sees a different world, so the result is just a story. Aftershock is built around three properties that turn anecdotes into evidence:

Determinism. All randomness flows from one seeded RNG; IDs come from counters; the engine never calls time.now() or random in the simulation path. Same seed + same agent decisions = byte-identical run, replayable forever. Every arm faces exactly the same disasters.
A typed negotiation protocol. Agents don’t free-text at each other. They emit typed proposals — resource requests, task handoffs, escalations, information shares — and an auction resolves contention atomically every tick. Coordination is a mechanic, not a chat transcript.
Validation with rejection feedback. Small models love to invent entity IDs and repeat invalid actions. Engine-side validation plus a short rejection memory in the next observation fixed far more than extra prompting ever did.

Then I ran the same seeded disasters four ways — scripted heuristics, one qwen3-max solo, a flat swarm of five flash agents with no protocol, and the structured society — and let the numbers talk.

The result that mattered

arm	models	lives saved (mean±sd)	missions failed	cost/run
society	flash ×5 + plus commander	103.2 ± 23.6	0.4	$0.042
solo	qwen3-max	104.2 ± 13.6	0.4	$0.065
swarm	flash ×5 (no protocol)	75.6 ± 15.4	3.0	$0.016
scripted	heuristics ($0)	106.8 ± 18.0	0.2	$0.00

Two findings, both causal because every arm faced byte-identical worlds:

The coordination protocol is worth +28 lives a run. The same five qwen3.5-flash models, with vs. without the negotiation protocol, went from 75.6 to 103.2 lives saved and from 3.0 to 0.4 missions failed. The run records show why: the protocol-less swarm burned ~160 decisions racing each other for empty resource pools; the society resolved that contention in the auction before acting.
The society matches the flagship for a third less money. A team of cheap qwen3.5-flash workers under a qwen3.5-plus commander saved as many lives as one qwen3-max doing everything (103.2 vs 104.2 — inside the noise), at ~35% lower cost and over 1.5× faster, because five small parallel calls beat one big sequential one.

That’s the headline: with the right coordination structure, small Qwen models reach big-model outcomes at lower cost.

The honest part (the bit I’m proudest of)

Here’s what a hype demo would hide and I put on the front page instead: a well-tuned scripted baseline, using the exact same negotiation protocol, stays competitive with every LLM arm. In other words, the protocol carries more of the result than the LLMs do.

That’s not a failure of the project. It is the project. Agent societies don’t need more agents talking — they need institutions: roles, contracts, arbitration, validation, measurement. The LLMs are the easy part; the mechanism is the hard part.

I logged every behavioral finding, including the negative ones, in docs/FIELD-NOTES.md. My favorite is the memory paradox: my first naive after-action-report memory loop (let the qwen3-max analyst write free-text “lessons” and feed them into the next briefing) made outcomes worse on paired controls. Lessons expressed outside the agents’ actual action space are just noise. Memory has to live in the action space or it hurts.

Grounding it in reality

Synthetic seeds prove the comparison, but I wanted the demo grounded. So Aftershock can compile scenarios offline from real open incident data and show that incident stream’s real first-on-scene latency on screen as the baseline. The flagship pack is Hurricane Ida over New York, the night of 2021-09-01, built from FDNY EMS and Fire dispatch records via NYC Open Data: 2,003 real EMS incidents, ~16.5% of calls held, a 948-second mean response time vs. ~524 s on a calm night two weeks earlier.

And it never overclaims. Every field on screen is labeled REAL / MAPPED / INFERRED / SYNTHETIC: the demand arrival and the latency baseline are real; mission severity is mapped; lives-at-risk is inferred; outcomes are a simulated model. I do not claim agents beat real responders on lives saved — only that they face the real demand the world actually produced. The compiler runs offline, so a real scenario is still byte-deterministic.

Shipping it on Alibaba Cloud

The observatory — a React + FastAPI app where you can scrub any run tick-by-tick, inspect each agent’s decisions, compare arms side by side, and start live runs — is deployed on an Alibaba Cloud ECS instance with Docker Compose behind a Caddy HTTPS front door. The deployment is reproducible straight from the public repo: clone, drop in your DASHSCOPE_API_KEY, docker compose up. Run records stream over WebSockets, and an MCP spectator server exposes the same data to any MCP client.

What I’d tell the next person building an agent society

Decide what “better” means before you build, and make it a number. Lives, latency, cost — on byte-identical worlds. Without that, you have a story, not a result.
Spend your model budget where judgment lives. Cost-tiered cognition (flash workers, a plus arbitrator, a max analyst) got big-model outcomes at small-model prices on Qwen Cloud.
Coordination is a mechanic, not a conversation. A typed protocol + an auction beat more prompting and more agents, by a wide margin.
Validate and give feedback. Reject invalid actions with reasons and remember them for one tick; small models recover instead of corrupting the run.
Publish your negative results. The scripted-is-competitive finding and the memory paradox are the most credible things in the whole project.

The larger goal is to make multi-agent systems falsifiable: not “look how many agents are talking,” but “here is the coordination mechanism, here is the baseline, and here is the measured gain.” Qwen Cloud’s cheap-enough small models are what make that experiment affordable to run.

Try it live: https://aftershock.redoubtlabs.dev · Read the code: https://github.com/bluntmachetti/aftershock

Built with Qwen Cloud (qwen3.5-flash / qwen3.5-plus / qwen3-max via DashScope) and Alibaba Cloud ECS, for the Qwen Cloud Global AI Hackathon.