Cross-country normalization meets the data

The previous post described V1.9.1 as a tactical band-aid, with V1.9.2 as the real structural fix coming in two or three weeks. That V1.9.2 is not going to ship on that timeline. Not because the idea is bad, but because a simulation we ran today found three problems with it that only show up when you look at real data across enough countries. The release is shelved.

This post walks through what we did, what we found, and what it changes.

What the rewrite was supposed to do

V1.9.1 shipped with a diagnostic bug: per-countryz-score normalization inflated a few signals that didn't deserve it. Cape Verde's elections-and-voting signal sat 2.78 standard deviations above Cape Verde's own other categories, but globally, Cape Verde's elections activity is ordinary. A small country with one noisy category looked like a crisis; a large country with five elevated categories looked average.

The proposed V1.9.2 rewrite was to repair the math. Instead of normalizing within each country, normalize each category across all countries, what we call cross-country normalization. Syria's civil war signal gets compared to every country's civil war signal, not to Syria's own other categories. Under this approach, every category lives on the same standardized scale, and a single threshold separates HIGH from MODERATE. No three-layer patch.

The test

Before shipping, we ran the new math against staging data. A first pass used an approximation: the per-country z-scores from V1.9.1 as input. That failed: the output JSON only carries the top ten categories per country, so the full population of low-signal countries wasn't available to compute a meaningful global mean.

A second pass queried Athena directly for raw recipe scores across every country in every category, including zeros. This is what the math actually needs. We applied the cross-country normalization, swept a single HIGH threshold from 0 to 5 in 0.1-point steps, and measured two things: face-validity (do known crisis countries still rank HIGH?) and tier distribution (does HIGH share land in our target 25–45% band?). Total spend: under a dollar in Athena queries. Total time: about thirty minutes.

Finding 1: rare-signal countries outranked active war zones

Under the cross-country approach, the top of the global salience distribution looks like this: Bhutan (top signal: economic_liberalization from the 1970s), Argentina and Chile (economic_nationalization), Guatemala (volcanic_eruption), South Korea (historical humanitarian_crisis). Syria, Yemen, Afghanistan, Somalia (all with active armed conflict) ranked below them.

The mechanism: rare categories have a handful of non-zero countries against a floor of zeros. A category where only five countries have any data pulls global mean close to zero and lets those five countries score many standard deviations above the mean. Meanwhile, common crisis categories like civil_war_and_insurgencyare populated across fifty-plus countries with varying elevations: Syria's civil war signal is elevated, but so is Yemen's and Somalia's. No single country reaches the extreme z-scores that rare-category outliers do.

Category severity weighting was supposed to correct for this: low-severity categories like economic_liberalizationare weighted at 0.4 or 0.6, not 1.0. It didn't correct enough. The numerical floor of each weighted z was still dominant.

Finding 2: face validity and the HIGH-share band are incompatible

At threshold 3.2, 11 out of 11 testable face-validity countries rank correctly. But 82% of scored countries land HIGH. At threshold 4.0, HIGH share drops to 44% (inside target band), but Syria, Somalia, Yemen, Afghanistan, Lebanon, Iran, and Ukraine all demote to MODERATE. Face validity falls to 4/11.

There is no threshold in the sweep where both constraints hold. The shape of the distribution (a tail of rare-signal outliers at the top, a dense cluster of active-conflict countries in the middle) forbids it.

Finding 3: the test cases aren't in the data

Cape Verde, Denmark, and Germany, the false-positive test cases that motivated V1.9.1's patch and would be the primary validation for V1.9.2, aren't in the staging recipes table. They filter out because they don't have enough curated events to hit the scoring floor. We cannot validate the fix for the exact problem the fix was designed to solve until event-curation coverage expands.

One prediction that did hold

Venezuela. V1.9.1 scored Venezuela INDICATIVE, which was a known face-validity miss. The cross-country approach flipped Venezuela to HIGH, via currency_crisis at the XL pattern length, with a salience of 4.09. The direction matched our design-doc prediction; the specific driving category was different (we predicted political_repression, not currency_crisis). The simulation picked up the right answer through a slightly different path, which is the kind of thing a simulation is for.

Why we're shelving instead of patching

Three options were on the table after the simulation:

Patch cross-country normalization. Log-transform the raw scores before the z-score calculation (compresses rare-category outliers). Or filter out low-severity categories entirely. Or only score against a whitelist of risk-relevant categories.
Abandon and escalate to a hierarchical model. A partial pooling approach that shrinks small-country scores toward regional priors. It addresses the Cape Verde symptom more directly, but is a larger engineering project.
Wait for more data, then re-simulate. 89 staging countries is a biased sample. More countries will shift the global distribution shape in ways that may resolve the outlier problem without structural changes.

We're choosing the third option, with the decision revisited once our curated-event coverage reaches 190+ countries (up from 89 in staging today, ~173 in the curated working set). Here's why:

More countries means more zero-fill for crisis categories, which pushes global mean down and gives real crisis signals more room to separate from the pack.
The countries we can't currently test (Cape Verde, Denmark, Germany) re-enter the data as curation catches up. Without them, we can't honestly claim the rewrite fixes anything.
Several of the structural patches may turn out to be unnecessary at higher N. Understanding the data shape before committing to a patch avoids re-patching later.

The evidence package from today's simulation is preserved. When data coverage reaches threshold, an internal methodology review picks up from today's findings; the decision tree doesn't reset.

What's shipping next instead

V1.9.1 stays in production. The tier distribution, drift monitoring, and methodology page are all calibrated against it. We have the numbers we have.

The top priority moves to event curation on Ukraine and Venezuela, the two known V1.9.1 face-validity misses that the rewrite was supposed to absorb. Iran stays deferred. These aren't glamorous; they're adding verified events per country one batch at a time. They're also what actually unblocks the next iteration.

Why this matters

Most forecasting products don't publish failed simulations. The incentive is to ship the version number and claim the win. The alternative is what happened this week: test the next release before shipping, find that it doesn't work the way the design doc claimed, decide honestly.

If we hadn't run the simulation, in three weeks we would have shipped V1.9.2 on schedule and quietly hoped nobody noticed that Bhutan was flagged HIGH. Instead we spent less than a dollar on Athena queries, changed our mind, and wrote this post.

That's the process. The version numbers we publish are what we actually believe. When the answer is “we don't know yet, we're waiting for better data,” that's what we say.