Data | BHF Capital

1. Headline coverage

2 yrsTick Coverage

24.3MMBP-10 Snaps

78 daysLive Tick Capture

ESCME GLBX.MDP3

Coverage window: Apr 2024 to Apr 2026 for ticks; Jan 27 2026 to Apr 15 2026 for MBP-10 order book. All timestamps are exchange time (UTC, CME convention), verified against our own wall-clock capture before being written to the training store.

2. Tick stack

The tick history is layered. Older ticks are synthetic 1-second OHLCV from Databento; recent ticks are real event-level trades and quotes; live ticks come from our own Sierra Chart capture running continuously since 2026-01-28.

Range	Source	Level	Notes
`2024-04 → 2026-02`	Databento OHLCV-1s	Synthetic 1s bars	Used for long-horizon feature windows (VPIN, RV, regime stats)
`2026-02 → 2026-04`	Databento MBP-10	Real ticks (trade + quote)	Training and backtest fuel for short-horizon models
`2026-01-28 → live`	Sierra Chart capture	Real ticks, native feed	Live capture on our own VM; forms the deterministic replay fold

Synthetic OHLCV-1s is not a substitute for real ticks. We do not train short-horizon models on it. It exists so that long-horizon features (e.g. 20-day realised volatility, 6-month regime baselines) are well-supported on the day we start a backtest.

3. MBP-10 order book

24.3 million MBP-10 snapshots, spanning 2026-01-27 to 2026-04-15, covering ES on CME GLBX.MDP3. Each snapshot is a point-in-time state of ten levels of the bid stack and ten levels of the ask stack, with size and price at each level.

Sources are layered and cross-checked, not a single vendor:

Databento MBP-10 — the primary historical store. Clean, de-duplicated, event-time aligned.
IB live depth — our own capture from the IB Gateway on the production VM, used as a real-time ground truth for the deployed model.
External L2 feed (John’s) — a third-party Level-2 snapshot stream used as an independent consistency check against Databento and IB.

Three-source reconciliation. We run a daily job that compares bid/ask/size at matched timestamps across the three feeds. Discrepancies above a narrow tolerance are flagged before training sees them.

4. 5-minute bar history

The XGB 5m head is trained on closed 5-minute OHLCV bars resampled from the tick store, not pulled from a bar-vendor. This is deliberate: closing a bar from our own ticks guarantees that backtest bars match live bars exactly. It also forces us to handle the CME maintenance gap in one place, rather than inheriting whatever a vendor happened to do.

The 52-feature set is decomposed into five families: trend, range, relative position, session context, and volatility regime. Feature importance is tracked each retrain; any single feature accounting for more than 35% of gain triggers a manual review before the model ships.

5. Data quality and gaps

CME maintenance window. 17:00-18:00 ET daily. Data is absent by design; the backtester and the live bridge both mask the window so no model trains or trades on phantom ticks.
Holiday sessions. Partial RTH days are included; we do not exclude them from training, and we do not pretend they are full sessions in the metrics.
Roll dates. Contract rolls are handled through a canonical front-month series with back-adjusted price only for long-horizon feature windows. Short-horizon features never cross a roll boundary.
Outages. Capture outages on our side are recorded and excluded from the deterministic replay fold. We do not silently smooth over them.

6. Training store and retention

The training store is a tiered layout of Parquet shards on the Linux training cluster. Each shard is content-addressed: {instrument}/{feed}/{date}/{shard}.parquet. Every F2_dom training run records the exact shard hashes it consumed; reproducing a model card from scratch requires nothing more than the hash list and the training recipe.

We retain every shard the system has ever trained on. Data is never overwritten.

7. Live tick capture since 2026-01-28

A dedicated Sierra Chart instance on the Azure Windows VM captures every tick we observe into a compressed append-only file. This is the most important fold in the backtester: it is the only set of ticks where we can run a deterministic replay of what the live system saw, alongside the trade journal from IB, and prove 1:1 parity between backtest and live.

Every day closed out of v131 becomes another row of the walk-forward fold. The 78-day figure on the home page is this capture.