Reference
Reference material for the current data contract, CLI surface, metrics, and result artifacts.
Reference material for the current data contract, CLI surface, metrics, and result artifacts.
The backtest harness needs enough information to replay the same model surface consistently across folds and across repos.
For a runnable pilot manifest, the repo needs:
If recommendation stability is in scope, the repo also needs:
The formulas in scope include lagged and rolling terms. That means fold inputs must be rebuilt from source data at each cutoff rather than sliced from a full-sample engineered matrix.
The repo keeps smaller GitHub-friendly replication packages under ../data/:
_st active engineering pilot_ov reserve candidate_os retained stress fixtureLarge reviewed bundles under data_review/ are kept local only.
The active engineering pilot is _st.
The active manifest currently lives in the local planning layer at
.planning/research/pilot_manifest.yaml.
Note that .planning/ is local-only in the current repo setup, so colleagues
using GitHub alone should rely on the tracked replication data, README, docs,
and report rather than the local planning spine.
Backtest outputs are written under run-scoped result trees so filtered reruns do not overwrite earlier summaries.
Typical layout:
summary/run_status.csv
Fold-by-fold execution state.
summary/holdout_summary.csv
Repo-level forward holdout comparison.
summary/parameter_stability_summary.csv
Repo-level adjacent-refit parameter drift summary.
summary/recommendation_stability_summary.csv
Repo-level recommendation stability summary on the current provisional shared
recommendation surface.
The active _st engineering batch is:
The holdout and parameter-stability summaries are identical across repos on that example. Recommendation stability is also present, but it should still be treated as provisional.
The main entrypoint is:
validateValidate a pilot manifest and print the planned run scope.
planBuild the run matrix for the active manifest.
runExecute a batch or write a dry-run result tree.
Dry run:
Target one repo:
Target one fold:
--manifest <path>--repo-target <name>--fold-id <n>--dry-run--results-root <dir>The CLI is designed around the current M1 single-series parity surface. It is not yet a general hierarchical / panel backtest runner.
RMSE
Root mean squared error on the observed KPI scale.
WMAPE
Weighted mean absolute percentage error on the observed KPI scale.
Mean error
Signed bias on the observed KPI scale.
SMAPE
Secondary holdout metric on the observed KPI scale.
Holdout ELPD / log score
Secondary probabilistic diagnostics when compatible posterior outputs are
available.
standardized_posterior_shift
Adjacent-refit coefficient shift scaled by posterior uncertainty.
allocation_turnover
0.5 * sum(abs(w_t - w_t-1)) across matched channels.
marginal_response_rank_corr
Spearman correlation of channel marginal-response ranks across adjacent
refits on the shared recommendation surface.
Important note:
marginal_response_rank_corr is not ROI. It is a rank comparison on the
repo-owned recommendation surface. The metric was deliberately renamed from a
previous ROI-style label because the current allocator does not compute true
ROI.Recommendation stability in the current repo should still be treated as provisional, because the allocator surface is backtest-owned and not yet an owner-approved production policy.