# Spike 02 — JS Monte Carlo Baseline: Execution Plan

**Created:** 2026-05-20
**Owner:** Ronan
**Cites:** [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md), [Spike 01 REPORT.md](../01-pyodide-cold-start/REPORT.md)
**Purpose:** Day-by-day plan for the JS Monte Carlo baseline spike. Persistent reference doc — pick up from wherever you left off.

**Current status (2026-05-20):** Scaffolded. Pre-flight files written. Day 1 measurements not yet run.

This document is the operational counterpart to [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md). ADR-0006 specifies *what* the lighthouse workload is and *why*; this doc specifies *how* and *when* to measure the baseline. The two should never disagree — if you find yourself needing to change a measurement target, update ADR-0006 (or its successor) first, then this plan.

---

## Approach summary (settled choices)

| Choice | Decision | Source |
|---|---|---|
| Toolchain | Plain static HTML/JS, no build step | Carried over from Spike 01 |
| Ordering | Bottom-up — trivial JS worker first, real kernel later, `SharedWorker` last | This plan |
| Repo location | `spikes/02-js-monte-carlo-baseline/` per [CLAUDE.md](../../CLAUDE.md) | Planning-repo convention |
| Local server | `python3 serve.py` (Spike 01's server, CSP tightened) | Day 1 |
| Kernel source | Vendored in `worker.js` initially — no CDN. The point is to measure the JS-payload cold-start floor without a network round-trip we don't control. | This plan |
| Reference workload | Pi-by-sampling (Monte Carlo classic) as the harness placeholder. The real reference workload is deferred per [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md) §Open questions. | This plan |
| Runs per condition | 20 (p50 + p95), matching Spike 01 | Spike 01 |
| Cold-cache method | Incognito window per run, with DevTools open + "Disable cache" + throttling configured **before** navigation (per Spike 01 Day 1 finding) | Spike 01 |

---

## Pre-flight (done)

- [x] Spike directory scaffolded (`spikes/02-js-monte-carlo-baseline/`)
- [x] README.md
- [x] PLAN.md (this document)
- [x] REPORT.md skeleton
- [x] `index.html` host page with locked-down CSP
- [x] `worker.js` JS Monte Carlo kernel harness (trivial + pi-by-sampling task)
- [x] `main.js` page wiring
- [x] `instrumentation.js` six-timestamp recorder (lifted from Spike 01)
- [x] `style.css` (lifted from Spike 01)
- [x] `serve.py` with CSP (tighter than Spike 01: no `wasm-unsafe-eval`, no CDN allowance)

---

## Day-by-day plan

### Day 1 — Baseline (plain Web Worker, JS kernel, trivial + pi tasks)

**Goal:** the unoptimised cold-start number for a JS worker under a hardened CSP. This is the moment-of-truth measurement; the whole point of ADR-0006 is that this number is in a different regime than Spike 01's.

**Build / verify:**
- Host page (`index.html`) served with a strict CSP — explicitly **no `unsafe-eval`**, no `wasm-unsafe-eval` (JS-only), no third-party origins.
- Plain Web Worker (`worker.js`) loads the kernel inline, dispatches a trivial task (`return 2 + 2`) and a pi-by-sampling task.
- Six timestamps logged per Spike 01:
  1. `t0` — embed script eval start
  2. `t1` — Worker created
  3. `t2` — worker module top-level eval start
  4. `t3` — worker reports ready (will be ~equal to `t2` on Day 1 — no async setup)
  5. `t4` — first task dispatched
  6. `t5` — first task result received

**Measure:**
- 20 runs in fresh incognito windows, desktop unthrottled. Record p50 and p95 of `t5 − t0` (end-to-end cold-start).
- Repeat 20 runs with DevTools throttling at "Fast 4G" / 10 Mbps / 50ms RTT (same conditions as Spike 01 for direct comparison).

**Stop and look:**
- Headline number: median end-to-end cold-start, Fast 4G, trivial task.
- Compare against Spike 01's Fast 4G number (6895 ms cold, 5993 ms warm-IDB) to confirm the regime change. If Day 1 unthrottled p50 is >500ms or Fast 4G p50 is >1s, **stop and investigate** — something is wrong with the harness or the CSP, because a KB-scale JS worker should not be near those numbers.
- Steady-state pi-by-sampling latency: report. Sets the per-task floor for the per-task budget the scheduler has to amortise.

**Write to REPORT.md:** "Day 1 — Baseline" section, with numbers, browser version, raw timings.

---

### Day 2 — Real workload shape + cross-browser determinism

**Goal:** measure two things that the [ADR-0006 amendment (2026-05-21)](../../docs/adr/0006-lighthouse-js-monte-carlo.md#changelog) made load-bearing:

1. **Cross-browser bitwise determinism** for a realistic Monte Carlo kernel. §Decision item 7 of ADR-0006 commits to deterministic-seed cross-node re-execution as the verification model. That commitment is conditional on V8 / SpiderMonkey / JavaScriptCore producing bitwise-identical results for the same seed. IEEE 754 does *not* guarantee this for `Math.sqrt`, transcendentals, or accumulation order — so the question is empirical, not theoretical.
2. **The postMessage round-trip floor.** The submitter SDK design needs to know "what's the smallest task that's worth dispatching one-at-a-time vs batching?" That's a floor set by postMessage dispatch cost, not by kernel work. Measuring this informs the SDK shape — see ADR-0006 §Open questions on submitter SDK.

The latency curve (task duration vs N samples) is the third measurement, but it is a side-effect of the determinism and round-trip work, not the primary goal.

**Build:**
- The pi-by-sampling task is already in `worker.js` from Day 1 scaffolding. It uses splitmix64 — explicit seed. Confirm it returns `{ estimate, n, seed }` so cross-browser comparison is straightforward.
- Add a second task fixture: a small agent-based step (e.g. 1000 random walkers, 100 steps each, return aggregate or per-walker final position). Still sub-second, still deterministic-given-seed, exercises transcendentals (`Math.sqrt`, possibly `Math.exp`) and accumulation patterns that pi-by-sampling doesn't. The point is to stress determinism on operations the IEEE spec *doesn't* nail down.
- Add a "round-trip floor" probe: dispatch the trivial task (`return 2 + 2`) in a tight loop of N=1000, batched and unbatched. Measure mean per-task overhead. (No kernel work — pure dispatch cost.)

**Measure:**

*Cross-browser determinism (the load-bearing measurement):*
- Run pi-by-sampling with seed=42, N=1e6 on Chrome stable, Firefox stable, Safari stable. Log the estimate as a full-precision `Number.prototype.toString()` or the underlying Float64 bits (`new Float64Array([estimate]).buffer` as hex). Compare bitwise across browsers.
- Same for seed=42, N=1e7 (longer accumulation — more chance for order-of-operations divergence).
- Same for agent-walk with seed=42, walkers=1000, steps=100. Bitwise comparison of the aggregate.
- For each task: also confirm within-browser determinism (10 runs same seed → identical bitwise).

*Latency curve (the diagnostic):*
- pi-by-sampling at N=1e5, 1e6, 1e7. 10 runs each. p50 wall-clock per task. Chrome stable + Safari (the engine extremes from Day 1).
- agent-walk at one size (1000 walkers, 100 steps). 10 runs. p50 wall-clock.

*postMessage round-trip floor:*
- Trivial task in a tight loop of 1000 dispatches, measured two ways:
  - **Unbatched**: dispatch one task per postMessage, wait for result, dispatch next. Total time / 1000 = mean per-task RTT.
  - **Batched**: single postMessage of a batch=1000 spec; worker runs the loop internally; single response. Total time / 1000 = mean per-task cost when amortised over a batch.
- Chrome stable + Safari.

**Stop and look:**

- **Determinism — the load-bearing question.** If bitwise results match across all three browsers for both tasks at both sizes: ADR-0006 §Decision item 7 holds without further constraints. The verification model is viable. If results match within a browser but diverge across browsers (even by 1 ulp): ADR-0006 §Decision item 7 needs an amendment — either an SDK constraint (integer-only kernels, or a pinned deterministic-math library like fdlibm-as-WASM) or a different verification model (statistical sampling, trusted-anchor re-execution). Capture both the finding and the proposed amendment direction in REPORT.md. **This is the most important finding from Day 2; do not skip it.**
- **Round-trip floor.** Unbatched vs batched per-task cost tells the SDK what to do. If unbatched is dominated by RTT (>1 ms per dispatch on a sub-millisecond kernel), the SDK must default to batching and expose batch size as a parameter. If unbatched is comparable to kernel work (e.g. RTT ~0.1 ms vs pi-N=1e6 ~5 ms), the SDK can dispatch one-at-a-time and stay simple. Either is a real finding for the SDK design.
- **Latency curve.** Should be roughly linear in N. Non-linearity at large N (e.g. GC pauses, BigInt overhead in the PRNG state mixing) is worth investigating; non-linearity at small N is the RTT floor showing through.
- **Math.random vs seeded PRNG.** Out of scope as a measurement (the harness already uses splitmix64) but worth re-confirming: if for any reason `Math.random` shows up in the kernel, that's an immediate non-determinism source and a bug.

**Findings to push back upstream:**
- ADR-0006 §Decision item 7 — amend if cross-browser bitwise determinism fails.
- ADR-0006 §Open question on submitter SDK shape — fold the RTT-floor finding into the SDK design.
- A new finding line in REPORT.md §Findings to push back upstream for any non-IEEE-spec behaviour observed.

**Write to REPORT.md:** "Day 2 — Workload shape + cross-browser determinism" section.

---

### Day 3 — `SharedWorker` lifecycle

**Goal:** measure subsequent-page connect time when the `SharedWorker` is already running — the cross-page experience that survives the ADR-0005 → ADR-0006 supersession.

This is the measurement that was scheduled for Spike 01 Day 4 but superseded with a runtime that fits.

**Build:**
- Refactor `worker.js` from `Worker` to `SharedWorker`. Communication moves to `port.postMessage` / `port.onmessage`.
- Add a second page (`page2.html`) on the same origin. Identical embed script.
- Add navigation: from `index.html`, link to `page2.html`. Both connect to the same `SharedWorker`.
- Log a per-instance UUID at boot so we can verify the second page sees the same instance.

**Measure:**
- Land on `index.html` cold — full Day 1 path. Record `t0`–`t5`.
- Navigate to `page2.html`. Record time from page parse to first task result. Target <50ms p50.
- 20 runs of the navigation case.

**Cross-browser:**
- Chrome stable, Firefox stable, Safari stable on macOS. Note any failures or quirks.
- Real iOS Safari device if accessible. Specifically test: does the `SharedWorker` survive same-origin navigation? Does it survive a backgrounded tab + return?

**Stop and look:**
- Subsequent-page connect should be near-instant (<50ms p50). If it's slow, dig in: is the connection itself slow, or is the worker not actually shared?
- iOS Safari edge cases are expected (Safari has historically lagged on `SharedWorker`). Document them, don't try to "fix" them in the spike — the fallback ladder is the architectural answer.

**Write to REPORT.md:** "Day 3 — SharedWorker lifecycle" section.

---

### Day 4 (optional, gated) — WASM-compiled submitter sanity check

**Goal:** confirm the WASM-generic runtime contract works for a WASM-compiled kernel and measure the cold-start delta vs the pure-JS path.

**Build only if Days 1–3 land clean.** This is a contract-validation measurement, not a lighthouse-decision input.

**Build:**
- Compile a tiny Rust or AssemblyScript kernel (~1–10 KB WASM) that exposes the same pi-by-sampling interface. Same seeded PRNG.
- Add `wasm-unsafe-eval` to the CSP. Document the directive in `serve.py` exactly as Spike 01 did.
- Worker loads the WASM, runs the same task, returns the same result.

**Measure:**
- 20 runs of cold-start with the WASM kernel, Fast 4G.
- Per-task latency for the WASM kernel vs the pure-JS kernel.

**Stop and look:**
- If the WASM kernel cold-start is meaningfully worse than pure-JS, the SDK should default to JS and treat WASM as opt-in. If they're comparable, WASM-compiled submitters are a viable first-class path.
- The 5.6 MB Pyodide regime does *not* return for a tiny WASM kernel. The Pyodide finding was about Pyodide, not WASM-in-browser.

**Write to REPORT.md:** "Day 4 — WASM submitter" section.

---

### Day 5 — Final report

**Goal:** finalise the baseline numbers and write the recommendation.

**Finalise REPORT.md:**
- Fill in the full results table.
- Write the **baseline-vs-targets recommendation** at the top. Two paragraphs:
  - What the numbers say.
  - Recommendation: proceed to architecture work with these numbers as the floor / revise [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md) / flag a structural issue.
- Sign off date and name.

**Update the citing docs:**
- Append findings summary to ADR-0006's §References. If the numbers are clean, this is a contributing input to ADR-0006 → Accepted (the other input being the demand-validation pass).
- If the numbers reveal a structural problem, that's its own ADR or amendment.

---

## How to pick up if you've stopped

The spike folder is structured so you can stop at any natural boundary and resume. To resume:

1. Read this PLAN.md to find where you left off (look for the last day with completed measurements in REPORT.md).
2. Read `REPORT.md` to see what numbers exist.
3. Read the most recent day's notes — they should explain any state (CSP changes, harness changes, browser versions).
4. If anything is unclear about state, run the trivial Day 1 measurement again — it's cheap and confirms the harness still works.

If you've stopped because something broke: the failing condition is itself a finding. Document it before "fixing forward" — sometimes the right move is to escalate to ADR-0006 (or a successor) rather than patch the spike.

---

## Anti-rationalisation rule

Thresholds are written **before** the spike runs. They are deliberately tighter than Spike 01's because the JS payload is two orders of magnitude smaller than Pyodide's.

| # | Measurement | Target (pass) | Fail threshold |
|---|---|---|---|
| 1 | First-page cold-start, cold cache, Fast 4G | <500 ms p50 | >1.0 s p50 |
| 2 | First-page cold-start, warm HTTP cache | <100 ms p50 | >300 ms p50 |
| 3 | Subsequent-page `SharedWorker` connect | <50 ms p50 | >250 ms p50 |
| 4 | Steady-state task latency, trivial | <5 ms p50 | >20 ms p50 |
| 5 | Steady-state task latency, pi-by-sampling N=1e6 | (diagnostic, no fixed target) | flag if >100 ms p50 |
| 6 | Memory footprint | <50 MB RSS | >150 MB (flag, not auto-fail) |
| 7 | **Cross-browser bitwise determinism** (Chrome stable / Firefox / Safari, pi-by-sampling + agent-walk, seeds 42 and 1, N covering accumulation depth) | identical Float64 bits across all three engines for every (task, seed, N) combination | any cross-browser bitwise divergence — gates ADR-0006 §Decision item 7 amendment |
| 8 | Per-task postMessage round-trip floor, unbatched, trivial task | <1 ms mean per dispatch | >5 ms — SDK must default to batching |

If a number lands between target and fail threshold, **write down both** in REPORT.md and the recommendation. Do not adjust the thresholds post-hoc to make a yellow number green. If the threshold turns out to have been wrong, that's a finding too — say so explicitly.

---

## What's out of scope for this spike

- Validating the lighthouse workload class. Demand-validation conversations do that — see [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md) §Open questions.
- Choosing the public-demo reference workload (Magpie subset, SIR, options pricing, published GP benchmark). Deferred until demand validation clarifies which shape resonates.
- Submitter SDK ergonomics. A separate RFC or design note covers that; this spike measures a hand-rolled harness, not a real SDK surface.
- Production embed script, scheduler protocol, coordinator wire format.
- Cross-origin third-party host testing on real WordPress / production sites. Synthetic locked-down host is sufficient for the baseline.
- Pyodide / Python re-measurement. Held as a future runtime per [ADR-0006](../../docs/adr/0006-lighthouse-js-monte-carlo.md) §Decision item 5.

If you find yourself building any of the above: stop. Either it's a different spike, or it's the next phase entirely.
