Home Methods Data Performance Philosophy Research Live Dashboard

1. Headline coverage

2 yrsTick Coverage
24.3MMBP-10 Snaps
78 daysLive Tick Capture
ESCME GLBX.MDP3

Coverage window: Apr 2024 to Apr 2026 for ticks; Jan 27 2026 to Apr 15 2026 for MBP-10 order book. All timestamps are exchange time (UTC, CME convention), verified against our own wall-clock capture before being written to the training store.

2. Tick stack

The tick history is layered. Older ticks are synthetic 1-second OHLCV from Databento; recent ticks are real event-level trades and quotes; live ticks come from our own Sierra Chart capture running continuously since 2026-01-28.

RangeSourceLevelNotes
2024-04 → 2026-02 Databento OHLCV-1s Synthetic 1s bars Used for long-horizon feature windows (VPIN, RV, regime stats)
2026-02 → 2026-04 Databento MBP-10 Real ticks (trade + quote) Training and backtest fuel for short-horizon models
2026-01-28 → live Sierra Chart capture Real ticks, native feed Live capture on our own VM; forms the deterministic replay fold

Synthetic OHLCV-1s is not a substitute for real ticks. We do not train short-horizon models on it. It exists so that long-horizon features (e.g. 20-day realised volatility, 6-month regime baselines) are well-supported on the day we start a backtest.

3. MBP-10 order book

24.3 million MBP-10 snapshots, spanning 2026-01-27 to 2026-04-15, covering ES on CME GLBX.MDP3. Each snapshot is a point-in-time state of ten levels of the bid stack and ten levels of the ask stack, with size and price at each level.

Sources are layered and cross-checked, not a single vendor:

Three-source reconciliation. We run a daily job that compares bid/ask/size at matched timestamps across the three feeds. Discrepancies above a narrow tolerance are flagged before training sees them.

4. 5-minute bar history

The XGB 5m head is trained on closed 5-minute OHLCV bars resampled from the tick store, not pulled from a bar-vendor. This is deliberate: closing a bar from our own ticks guarantees that backtest bars match live bars exactly. It also forces us to handle the CME maintenance gap in one place, rather than inheriting whatever a vendor happened to do.

The 52-feature set is decomposed into five families: trend, range, relative position, session context, and volatility regime. Feature importance is tracked each retrain; any single feature accounting for more than 35% of gain triggers a manual review before the model ships.

5. Data quality and gaps

6. Training store and retention

The training store is a tiered layout of Parquet shards on the Linux training cluster. Each shard is content-addressed: {instrument}/{feed}/{date}/{shard}.parquet. Every F2_dom training run records the exact shard hashes it consumed; reproducing a model card from scratch requires nothing more than the hash list and the training recipe.

We retain every shard the system has ever trained on. Data is never overwritten.

7. Live tick capture since 2026-01-28

A dedicated Sierra Chart instance on the Azure Windows VM captures every tick we observe into a compressed append-only file. This is the most important fold in the backtester: it is the only set of ticks where we can run a deterministic replay of what the live system saw, alongside the trade journal from IB, and prove 1:1 parity between backtest and live.

Every day closed out of v131 becomes another row of the walk-forward fold. The 78-day figure on the home page is this capture.