ML Calibration Audit

Independent back-test on 169 historical niches · 4-year outcomes (2022 → 2026) · Bootstrap 95% confidence intervals on every metric · published April 2026

Headline Metrics

96.2%

NO-GO precision

97.8%

GO precision

169

Historical niches
back-tested with 4-year outcomes

2,710

Ground-truth
training labels

6,779

Niches
scored to date

What This Measures

Every RIDGE verdict carries a confidence level: "70% GO," "85% NO-GO," "60% HIDDEN GEM." A confidence number is only useful if it is calibrated — that is, if 70% GO predictions turn out to be correct 70% of the time in reality. Uncalibrated models can still rank well, but their probabilities are meaningless for decision-making.

The audit on this page is the second half: how often the verdict was right when applied to historical niches whose 4-year outcome we already know. The cohort is 169 Amazon FBA niches that entered the RIDGE pipeline in 2022–2023; their 2026 marketplace state is the ground truth. Every published metric is reported with a non-parametric bootstrap 95% confidence interval (2,000 resamples).

Headline Results

Metric	RIDGE	Appropriate baseline
NO-GO precision (verdict)	96.2%	46.2% always-DEAD baseline
GO precision (verdict)	97.8%	46.2% always-DEAD baseline
HIDDEN GEM in-product signal precision	41%	20.1% positive prior
Outcome window	4 years (2022 → 2026)	Not published anywhere else
Confidence intervals	Bootstrap, 2,000 resamples, every metric	Point estimates only (when disclosed at all)

On the always-DEAD baseline: the 169-niche cohort prior is 46.2% DEAD / 53.8% GO. A trivial always-DEAD classifier would already score 46.2%, which is why we publish NO-GO precision separately and lead with the bootstrap confidence interval rather than a point estimate — the headline 97.8% GO precision is only meaningful alongside its 95% interval and the baseline.

The Methodology, at Category Level

Without exposing proprietary details:

Output type: calibrated probability that a niche remains viable twelve months forward. A score of 0.40 means roughly 40% of historical niches in that bucket were ALIVE at the outcome date — not a dimensionless 0–100 index.
Decision head: three-class verdict over DEAD / ALIVE / THRIVING; the shipped P(GO) is P(ALIVE) + P(THRIVING) renormalized to the simplex.
Validation protocol: stratified k-fold cross-validation with niche-family grouping to prevent leakage between sibling niches, plus the separately-held 169-cohort back-test reported on this page. Every metric is reported with a bootstrap 95% confidence interval, never as a bare point estimate.
Label sourcing: 2,710 ground-truth labels (1,213 ALIVE / 830 THRIVING / 667 DEAD) derived algorithmically from longitudinal Amazon trajectory signals against 2022–2026 marketplace observations — not human annotation. Temporal cutoff enforced to prevent target leakage from outcome-period features.
Cohort priors disclosed: training cohort 75.4% positive (ALIVE+THRIVING) vs. 169-back-test cohort 53.8% positive — the cohort shift is published, not hidden behind in-distribution-only numbers.

Specific feature names, model architecture, and calibration parameters are not disclosed — they are the defining trade secret of RIDGE, and disclosing them would permit adversarial gaming. We publish the validation protocol and the back-test, not the gradient table. This is the same stance every production ML model takes at scale.

Why No Competitor Shows This

Publishing a calibration audit requires (a) having a calibrated model, (b) having enough ground-truth labels to audit it, (c) accepting the marketing cost of every confidence interval that does not collapse to a tight point, and (d) committing to the version-over-time discipline that lets a back-test be reproduced rather than re-marketed. Most Amazon-research tools are not built on ML at all — they are heuristic point-estimates. Those that are do not publish the audit.

RIDGE publishes the niche list, the outcome dates, the bootstrap protocol, and every confidence interval. If a competitor wants to run the same back-test against their own scoring engine, the cohort is open.

Related research

Methodology — full validation protocol, cohort priors, and the rejected-experiments ledger.
Public Leaderboard — held-out 169-niche benchmark with nested temporal cross-validation.
2026 Back-test Report — full bootstrap distribution on the 169-cohort.
ML Methodology Overview — how the ML layer fits into the broader fifteen-signal pipeline.

Get a calibrated verdict on your niche

Order Analysis