Chapter 2

Reference

Reference material for the current data contract, CLI surface, metrics, and result artifacts.

Data Contract

The backtest harness needs enough information to replay the same model surface consistently across folds and across repos.

Minimum Required Inputs

For a runnable pilot manifest, the repo needs:

a weekly source table
a date column
a KPI / response column
the locked model formula
priors
boundaries
repo targets and repo paths
fit settings and seed policy

If recommendation stability is in scope, the repo also needs:

media spend history or equivalent channel-spend history
a declared recommendation contract
a channel map from spend inputs to model terms / allocation variables

Why The Source Table Matters

The formulas in scope include lagged and rolling terms. That means fold inputs must be rebuilt from source data at each cutoff rather than sliced from a full-sample engineered matrix.

Tracked Data Packages

The repo keeps smaller GitHub-friendly replication packages under ../data/:

_st active engineering pilot
_ov reserve candidate
_os retained stress fixture

Large reviewed bundles under data_review/ are kept local only.

Current Active Pilot

The active engineering pilot is _st.

The active manifest currently lives in the local planning layer at .planning/research/pilot_manifest.yaml.

Note that .planning/ is local-only in the current repo setup, so colleagues using GitHub alone should rely on the tracked replication data, README, docs, and report rather than the local planning spine.

Results And Artifacts

Backtest outputs are written under run-scoped result trees so filtered reruns do not overwrite earlier summaries.

Result Tree Shape

Typical layout:

results.../
  <dataset_id>/
    <comparison_label>/
      run_id=.../
        experiment_manifest.yaml
        fold_manifest.csv
        summary/
          run_status.csv
          holdout_scores.csv
          holdout_summary.csv
          parameter_stability_summary.csv
          recommendation_stability_summary.csv
        repo_target=<repo>/
          fold_id=01/
            run_manifest.yaml
            run_status.json
            fit_payload.rds
            prediction_payload.rds
            holdout_scores.csv
            recommendations.csv
          stability/
            parameter_drift.csv
            parameter_drift_summary.csv
            recommendation_drift.csv
            recommendation_drift_summary.csv

Most Important Summary Files

summary/run_status.csv Fold-by-fold execution state.
summary/holdout_summary.csv Repo-level forward holdout comparison.
summary/parameter_stability_summary.csv Repo-level adjacent-refit parameter drift summary.
summary/recommendation_stability_summary.csv Repo-level recommendation stability summary on the current provisional shared recommendation surface.

Current Worked Example

The active _st engineering batch is:

results_engineering_m1_st_full/_st/engineering_m1_st_scale_false/
run_id=20260407T211743.943118Z__all-repos__all-folds__live/

The holdout and parameter-stability summaries are identical across repos on that example. Recommendation stability is also present, but it should still be treated as provisional.

CLI Reference

The main entrypoint is:

Rscript scripts/dsambayes-backtest.R <command> [options]

`validate`

Validate a pilot manifest and print the planned run scope.

Rscript scripts/dsambayes-backtest.R validate \
  --manifest .planning/research/pilot_manifest.yaml

`plan`

Build the run matrix for the active manifest.

Rscript scripts/dsambayes-backtest.R plan \
  --manifest .planning/research/pilot_manifest.yaml

`run`

Execute a batch or write a dry-run result tree.

Dry run:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --dry-run

Target one repo:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --repo-target charles_dev

Target one fold:

Rscript scripts/dsambayes-backtest.R run \
  --manifest .planning/research/pilot_manifest.yaml \
  --fold-id 1

Common Options

--manifest <path>
--repo-target <name>
--fold-id <n>
--dry-run
--results-root <dir>

Current Limitation

The CLI is designed around the current M1 single-series parity surface. It is not yet a general hierarchical / panel backtest runner.

Metrics Reference

Forward Holdout Metrics

RMSE Root mean squared error on the observed KPI scale.
WMAPE Weighted mean absolute percentage error on the observed KPI scale.
Mean error Signed bias on the observed KPI scale.
SMAPE Secondary holdout metric on the observed KPI scale.
Holdout ELPD / log score Secondary probabilistic diagnostics when compatible posterior outputs are available.

Stability Metrics

standardized_posterior_shift Adjacent-refit coefficient shift scaled by posterior uncertainty.
allocation_turnover 0.5 * sum(abs(w_t - w_t-1)) across matched channels.
marginal_response_rank_corr Spearman correlation of channel marginal-response ranks across adjacent refits on the shared recommendation surface.

Important note:

marginal_response_rank_corr is not ROI. It is a rank comparison on the repo-owned recommendation surface. The metric was deliberately renamed from a previous ROI-style label because the current allocator does not compute true ROI.

Interpretation Guidance

Holdout metrics address predictive performance.
Parameter stability addresses how much posterior media effects move between adjacent refits.
Recommendation stability addresses how much the recommended allocation surface moves between adjacent refits under one controlled comparison scenario.

Recommendation stability in the current repo should still be treated as provisional, because the allocator surface is backtest-owned and not yet an owner-approved production policy.