ML Calibration Audit
Independent back-test on 169 historical niches · 4-year outcomes (2022 → 2026) · Bootstrap 95% confidence intervals on every metric · published April 2026
Headline Metrics
back-tested with 4-year outcomes
training labels
scored to date
What This Measures
Every RIDGE verdict carries a confidence level: "70% GO," "85% NO-GO," "60% HIDDEN GEM." A confidence number is only useful if it is calibrated — that is, if 70% GO predictions turn out to be correct 70% of the time in reality. Uncalibrated models can still rank well, but their probabilities are meaningless for decision-making.
The audit on this page is the second half: how often the verdict was right when applied to historical niches whose 4-year outcome we already know. The cohort is 169 Amazon FBA niches that entered the RIDGE pipeline in 2022–2023; their 2026 marketplace state is the ground truth. Every published metric is reported with a non-parametric bootstrap 95% confidence interval (2,000 resamples).
Headline Results
| Metric | RIDGE | Appropriate baseline |
|---|---|---|
| NO-GO precision (verdict) | 96.2% | 46.2% always-DEAD baseline |
| GO precision (verdict) | 97.8% | 46.2% always-DEAD baseline |
| HIDDEN GEM in-product signal precision | 41% | 20.1% positive prior |
| Outcome window | 4 years (2022 → 2026) | Not published anywhere else |
| Confidence intervals | Bootstrap, 2,000 resamples, every metric | Point estimates only (when disclosed at all) |
On the always-DEAD baseline: the 169-niche cohort prior is 46.2% DEAD / 53.8% GO. A trivial always-DEAD classifier would already score 46.2%, which is why we publish NO-GO precision separately and lead with the bootstrap confidence interval rather than a point estimate — the headline 97.8% GO precision is only meaningful alongside its 95% interval and the baseline.
The Methodology, at Category Level
Without exposing proprietary details:
- Output type: calibrated probability that a niche remains viable twelve months forward. A score of 0.40 means roughly 40% of historical niches in that bucket were ALIVE at the outcome date — not a dimensionless 0–100 index.
- Decision head: three-class verdict over
DEAD/ALIVE/THRIVING; the shipped P(GO) is P(ALIVE) + P(THRIVING) renormalized to the simplex. - Validation protocol: stratified k-fold cross-validation with niche-family grouping to prevent leakage between sibling niches, plus the separately-held 169-cohort back-test reported on this page. Every metric is reported with a bootstrap 95% confidence interval, never as a bare point estimate.
- Label sourcing: 2,710 ground-truth labels (1,213 ALIVE / 830 THRIVING / 667 DEAD) derived algorithmically from longitudinal Amazon trajectory signals against 2022–2026 marketplace observations — not human annotation. Temporal cutoff enforced to prevent target leakage from outcome-period features.
- Cohort priors disclosed: training cohort 75.4% positive (ALIVE+THRIVING) vs. 169-back-test cohort 53.8% positive — the cohort shift is published, not hidden behind in-distribution-only numbers.
Specific feature names, model architecture, and calibration parameters are not disclosed — they are the defining trade secret of RIDGE, and disclosing them would permit adversarial gaming. We publish the validation protocol and the back-test, not the gradient table. This is the same stance every production ML model takes at scale.
Why No Competitor Shows This
Publishing a calibration audit requires (a) having a calibrated model, (b) having enough ground-truth labels to audit it, (c) accepting the marketing cost of every confidence interval that does not collapse to a tight point, and (d) committing to the version-over-time discipline that lets a back-test be reproduced rather than re-marketed. Most Amazon-research tools are not built on ML at all — they are heuristic point-estimates. Those that are do not publish the audit.
RIDGE publishes the niche list, the outcome dates, the bootstrap protocol, and every confidence interval. If a competitor wants to run the same back-test against their own scoring engine, the cohort is open.
Related research
- Methodology — full validation protocol, cohort priors, and the rejected-experiments ledger.
- Public Leaderboard — held-out 169-niche benchmark with nested temporal cross-validation.
- 2026 Back-test Report — full bootstrap distribution on the 169-cohort.
- ML Methodology Overview — how the ML layer fits into the broader fifteen-signal pipeline.