Correction · 2026-05-25 · Graduation suspended

A subsequent audit of our evaluation pipeline found that the LOCO-AUC of 0.9147 reported below was computed on only 10 of this recipe's 174 positive events, roughly 94% were silently dropped because several of the selected indicators lack older-year data coverage. On the full historical evidence base the discrimination collapses to ≈0.50, i.e. not yet distinguishable from chance. The graduation is therefore suspended pending a corrected re-walk. The analysis below is retained for transparency but is superseded and should not be cited. We are correcting the pipeline (minimum data-coverage and minimum-evidence gates) and will re-validate honestly before any graduation claim is restored. We publish this correction openly because the integrity of the method is the product.

Methodology · Published 2026-05-23

Foreign Intervention, graduated: what eighteen years of indicator trajectories say

QGI's first ML-validated recipe earns its risk score today. Four indicators, eighteen years, two hard validation gates cleared. The mechanic is similarity to a historical pre-crisis trajectory, and the line between similarity and probability is where this piece spends most of its time.

A recipe is now graduated

For most of QGI's history, the word recipeon our methodology page has been doing more work than the code. We've scored countries against patterns we believed were valid; we hadn't formally validated, in held-out historical data, the five components that a recipe must specify. As of today, one has been validated. The recipe is foreign_intervention. Both hard validation gates cleared. Risk scores derived from it are now published for the 2021 bake.

This piece is the long version of that announcement. It explains what graduated, what the underlying mechanic does, and (more importantly) what it cannot do. The validation that earned the graduation tells one specific thing: our scoring mechanic can discriminate pre-crisis trajectories from non-pre-crisis trajectories. It does not tell us, and we do not claim, a calibrated probability of the event itself.

What QGI means by “recipe”

The Methodology Charter defines a recipe as five components. Every recipe must specify all five before it can carry a published score:

Key indicators: the small handful of substrate indicators (out of 104 we track) that carry the strongest discriminative signal for this crisis class.
Canonical length: how many years (3–25y) the predictive signal accumulates over. Some crises develop over three years; some over twenty-five. The recipe says which.
Canonical movement shape: the average trajectory of those indicators across the canonical length, computed from all historical positive instances in our corpus.
Aggregation rule: how the per-indicator measurements combine into a single country-recipe number.
Terminating event: the crisis type that defines the recipe. For foreign_intervention, a curated foreign-intervention event in our corpus.

Validation is the part that didn't exist before. An XGBoost classifier trained on our historical corpus produces the indicator selection and the canonical length. Cosine similarity on the z-scored trajectories produces the aggregation. Held-out historical events test whether the resulting score actually recovers the cases it was supposed to.

What graduated, in numbers

For foreign_intervention, the validated five components look like this:

Indicator	Source	SHAP weight
Regulatory Quality (RQ.EST)	World Bank Governance	51%
Power distribution by gender (v2pepwrgen)	V-Dem	27%
State fiscal capacity (v2stfisccap)	V-Dem	17%
International election monitoring (v2elintmon)	V-Dem	5%

Canonical length: eighteen years.The classifier's discriminative AUC peaked at this window. A country's score is computed from its trajectory across the four indicators over the eighteen years preceding the evaluation date.

Canonical movement shape: the mean z-scored trajectory of each indicator, averaged across all historical foreign-intervention positives in the training set.

Aggregation rule:cosine similarity is computed between each country's trajectory and the canonical trajectory, per indicator. The four per-indicator similarities are combined as a weighted average, with the SHAP-derived weights shown above. The result, in [-1, +1], is rescaled to [0, 100] for publication.

Terminating event: a curated foreign-intervention entry in our corpus, vetted by hand, timestamped to the year the intervention began.

What the risk score actually says

For every country, on every recipe that has graduated, QGI publishes a score in [0, 100]. The score reads:

Over the past eighteen years, how closely does this country's trajectory across the four key indicators align with the average pre-crisis trajectory of historical foreign interventions?

65 or higher is HIGH. Below 40 is LOW. Between is MODERATE. 50 is the point where the country's trajectory is geometrically orthogonal to the canonical shape: neither matching nor opposite.

In the 2021 bake, sixty-eight countries had complete coverage across the four indicators over the eighteen-year window. The three highest scores:

Iran 98.1 · HIGH(80% band 96.6–98.4)

Indicator trajectory overlay: Iran vs the foreign_intervention canonical pre-crisis profile.

Turkmenistan 97.0 · HIGH(80% band 95.6–97.2)

Indicator trajectory overlay: Turkmenistan vs the foreign_intervention canonical pre-crisis profile.

Equatorial Guinea 96.9 · HIGH(80% band 95.4–97.1)

Indicator trajectory overlay: Equatorial Guinea vs the foreign_intervention canonical pre-crisis profile.

Each chart overlays one country's trajectory across the four key indicators (faint coloured lines) against the canonical pre-crisis trajectory (heavier line). The closer the overlay, the higher the cosine similarity. In Iran's case, three of the four indicators trace the canonical shape closely across the full eighteen-year window. In Equatorial Guinea's case, the alignment is most pronounced on regulatory quality and fiscal capacity. The aggregate is similar; the substrate is country-specific.

The band after each score (e.g. Iran “98.1, 80% band 96.6–98.4”) is a margin of error, not decoration. We compute it by bootstrap-resampling the historical events that define the canonical shape and watching how much each country's score moves. It is honest about a real limit: the canonical shape for this recipe rests on far fewer events than the headline count suggests, with its most heavily-weighted indicator, regulatory quality, is anchored on 46 historical instances. The band makes that uncertainty visible rather than hiding it behind a single confident number. Iran stays firmly in the HIGH zone even at the low end of its band.

The validation that earned the graduation

Graduation is not automatic. Before a recipe's scores are published, four validations run. Two are hard gates: if either fails, the recipe does not graduate. The other two produce evidence but do not block.

Validation	Type	Result
V1 : Positive-instance recall Median similarity at one year before event vs non-event baseline. Foreign intervention: 96.0 vs 70.1, a 25.8 percentile-point separation. Gate: 20pp.	HARD GATE	PASS
V2 : Discrimination-consistency correlation Correlation between similarity score and XGBoost discriminator probability. Pearson 0.26, Spearman 0.50. Moderate; not problematic.	Reporting only	n/a
V3 : Per-cohort risk landscape Median similarity by cohort. Stable-democracy 5.9, mid-tier 66.7, fragile-state 94.8. Cleanly discriminating across three tiers.	Reporting only	n/a
V4 : Out-of-time holdout Held-out historical events must hit HIGH zone (≥65) at event-Y-1. Foreign intervention: 13 of 15 holdout events (86.7%). Gate: 60%.	HARD GATE	PASS

Both hard gates cleared. The recipe graduates.

What this score does not mean

This is a similarity score, not a calibrated probability. We do not claim “country X has Y% probability of foreign intervention in N years.” The validation that cleared discrimination (telling pre-crisis from non-pre-crisis trajectories) is a different claim from calibrated probability. Earlier this same week, the same recipe attempted a calibrated probability output via Beta calibration on the same XGBoost backbone; the expected calibration error (ECE) failed the hard threshold across multiple cohort × recipe cells. Rather than ship a probability the methodology does not stand behind, we ship the discrimination layer directly. Discriminative AUC: 0.9147. Calibrated probability: deferred.
A HIGH score is not a forecast.It is interpretable as: “this country's trajectory looks like the historical pre-intervention pattern.” Pattern resemblance is not causal mechanism. The four indicators did not name a trigger; they identify a shape.
A LOW score is not “safe.”A country may face foreign intervention for reasons not captured by the four indicators: idiosyncratic shocks, new actors, regional contagion. Low similarity is “the methodology has not flagged this pattern,” not “the methodology has cleared this country.”
The score is bake-relative. The 2021 evaluation reflects what was in our 2021 indicator corpus. Prospective use beyond 2021 requires further out-of-time validation as data lands.

Why we are publishing this

The methodology pause that preceded this graduation was deliberate. For most of QGI's life, the scoring pipeline produced numbers without an ML-validated recipe substrate. The numbers were useful upstream of the validation gates the Methodology Charter demanded (and we said so), but they did not constitute a graduated recipe.

Foreign intervention is the first to clear those gates. Two parallel walks (civil_war, econ_recession) on the same framework returned different verdicts: both demonstrated that the indicator-selection mechanic worked, but neither cleared the additional thresholds the per-recipe pre-registration demanded. Foreign intervention cleared discrimination. The mechanism that earned this graduation (XGBoost-derived indicator selection, SHAP-weighted cosine similarity on z-scored trajectory profiles, held-out out-of-time validation) is now provenance for any subsequent recipe to use.

We expect more graduations in the coming weeks. We do not expect every recipe to clear. The validation framework is designed to fail recipes when the historical signal isn't there. That is the point. Recipes that fail are not published; recipes that pass are.

For the reader who wants more

Methodology page : full derivation of the similarity-as-risk-metric framework, what the score is and is not, and the validation flow before any recipe's scores publish.
Similarity-as-risk-metric long-form (qgi-docs) : the methodology document the page above is built from.
Foreign intervention recipe DECISION : the formal eight-of-eight ratified decision document with the validation dossier and the rider package.

Published 2026-05-23 · QG Intelligence · Indicator weights, validation statistics, and country scores cited from the validated 2021 bake of the foreign_intervention recipe. The discrimination layer is shipped directly; calibrated probability is deliberately not claimed. This piece reports the first formal ML recipe graduation and the validation that earned it. It is not a forecast.