Status: Phase 0 ADR (scoping) — v0.1 (drafted 2026-05-16); CEO ratification of proposal direction received 2026-05-15 via Architect decision walkthrough (Item 1); Phase 1+ gated on trigger signals (see §”Phase Sequencing”) Date: 2026-05-16 (v0.1 — Phase 0 scoping ADR per CEO ratification of e2e suite design proposal direction May 15) Supersedes: None (generalizes ADR-061’s three-phase calibration shape to project-scope; ADR-061 remains the canonical reference for the ethics-path-specific instance) Issues: #1004 (probe-set harness — narrow-scoped existing instance), #1018 (audit-write integration tests — narrow-scoped existing instance), #1070 (multi-turn evaluation harness — most-generalized existing instance, canonical-retest-run8.py) Related: ADR-061 (LLM-touch boundary enforcement, three-phase calibration template), Pattern-070 (Cleanup-Job-with-Cancellation-Hygiene, Emerging — operational invariants apply to Layer 2 harness orchestration), Pattern-072 (Registries that Grow into Architectural Shapes, Proven — probe registry is same-shape instance), PDR-005 (BYOC distribution model — Phase 5 cross-host gated by MCP server ship) Deciders: Chief Architect (drafted); CEO ratification of direction (2026-05-15); Lead Developer (Phase 1+ implementation refinement at trigger time); CXO (probe-set scoping refinement); CIO (methodology shelf consideration for operational invariants)
The project has accumulated three narrow-scoped end-to-end harness instances over ~3 weeks of work, each scoped to a specific surface:
canonical-retest-run8.py — the most-generalized of the three; multi-turn synthetic conversations driven through the floorEach instance solves a specific validation problem (boundary enforcement; audit-envelope integrity; multi-turn conversation flow). What does not yet exist is a cross-surface e2e harness — a single test surface that drives synthetic inputs through the entire request lifecycle (API entry → intent classification → workflow dispatch → LLM call → ethics detection → response generation → audit-envelope writing) and validates the integration of those steps, not the unit-level correctness of each.
Two structural signals from May 15 converge on the need for this surface:
pending → running → completed/failed/canceled job lifecycle. The pattern generalizes.The project also has four named test-rigor trigger signals (per May 4 PM walkthrough) that would justify tightening test rigor: coverage trend drops, latent-bug regression ships, alpha→beta transition, feature ships test-free where missing tests are obviously needed. None of these have fired yet — Lead Dev’s discipline holds (79% test coverage on code-touching commits as of May 4 review). What has surfaced is that the design horizon for e2e suite architecture is longer than the implementation horizon — we’ll regret not having the scoping ADR in place the moment the trigger fires.
Unit + integration test coverage at the component level is solid. The gap is at the whole-flow level under realistic conditions:
Designing under trigger pressure costs weeks of scramble. Designing pre-trigger costs ~1 architectural session (this ADR). The asymmetric cost favors landing the structural commitment now even if Phase 1+ implementation waits for an actual trigger.
The proposal direction was ratified by CEO May 15 via Architect decision walkthrough (Item 1). The four-layer shape and five-phase sequence were agreed; this ADR is the formal Phase 0 scoping artifact.
Cross-surface end-to-end validation via a generalized simulation-harness pattern is the right architectural primitive for an LLM-touch product approaching multi-host distribution. The four-layer shape generalizes ADR-061’s three-phase calibration template; the five-phase implementation sequence packages it for trigger-driven rollout without forcing implementation work before signals justify it.
The existing narrow-scoped harnesses (#1004 probe-set, #1018 audit-write, #1070 multi-turn) are reference instances; they fold into the generalized harness when Phase 2 lands rather than being replaced standalone.
The e2e suite is structured as four operational layers, mirroring Pattern-070’s operational invariants for cleanup-job-with-cancellation-hygiene (transaction-boundary isolation; cancellation hygiene; lifespan wiring; failure isolation envelope):
Layer 1 — Synthetic Input Registry
A catalog of probe sets by surface, with each probe carrying a structured shape:
@dataclass
class Probe:
input: str # the synthetic input
surface: str # which probed surface (ethics, intent_classification, slot_extraction, multi_turn, etc.)
expected_intent: Optional[str] # for intent-classification probes
expected_action_class: Optional[str] # PASS / DECLINE / DEGRADE / etc.
expected_audit_shape: Optional[dict] # structured fields the audit envelope must contain
severity: Literal["critical", "important", "informational"]
notes: str # human context for the probe
This is the same-shape pattern as Pattern-072 (Registries that Grow into Architectural Shapes, Proven via #1094): a typed catalog of entries dispatched at consumption. The task_type registry, safe_surface() registry, and the prospective probe registry are three instances of the same architectural shape.
Single source of truth: probes are defined in one registry per surface; tests reference probes by key, not by inlined input strings. When a probe is updated (e.g., #1004’s PROFESSIONAL category vector list), every consuming test sees the new value automatically.
Layer 2 — Harness Orchestration
Runs probes through the full request lifecycle and captures actual output + audit envelope. The orchestration layer is governed by Pattern-070’s four operational invariants:
AsyncSessionFactory.session_scope() per call; one probe’s transaction state cannot leak to the nextasyncio.current_task() at probe-start; cancellation propagates cleanly without leaving orphan resourcesPhase class (startup, run-probes, shutdown) — matches the orchestration shape used in the audit-write cleanup job (#1018)The harness drives synthetic input through the production request lifecycle — same code paths users hit — and captures (response shape, audit envelope contents, latency, side effects on persisted state).
Layer 3 — Disagreement-Table Generation
Compares actual output vs. expected. Classifies divergences into four categories:
This is the same disagreement-table shape used in #1004’s run-1 and run-2 reports. The Layer 3 output is the load-bearing surface for triage: which divergences are bugs in the implementation vs. cases where the expected shape needs updating.
Layer 4 — Reporting + CI Integration
Emits structured pass/fail signals to standard test infrastructure (pytest integration). Includes:
Phase 0 — Scoping ADR (this document) Pre-trigger architectural design work. ~1 architectural session. Status: complete with this filing.
Phase 1 — Harness Scaffolding Implementation of Layer 1 (probe registry primitives) + Layer 2 (orchestration). Starts with ethics + intent classification surfaces — folding in existing #1004 + #1070 work. Estimated effort: ~1 week Lead Dev. Gated on triggers below.
Phase 2 — Existing Probe-Set Integration Folds #1004 probe set + #1070 multi-turn into the new harness; demonstrates equivalence with existing canonical-retest workflow. Estimated effort: ~3-5 days.
Phase 3 — Gap Surfaces Adds probe coverage for surfaces not currently exercised end-to-end (workflow dispatch, slot extraction, response generation). Estimated effort: ~1-2 weeks. CXO + Lead Dev co-design at this phase.
Phase 4 — CI Gating
Converts from on-demand to gated (e.g., PR touching services/ must pass ethics + intent e2e probe sets). Estimated effort: ~1 week. Requires the “no-regression rule” disposition codified.
Phase 5 — Cross-Host e2e When BYOC MCP server packaging ships, extend Layer 2 to drive probes through MCP surface in addition to FastAPI surface. Estimated effort: ~1 week. Gated by BYOC MCP server ship (PDR-005 v0.4+ trigger).
Total Phase 1-5 scope estimate: ~4-6 weeks Lead Dev spread across the BYOC → 1.0 → beta arc.
Phase 0 (this ADR) is pre-trigger architectural work. Phase 1+ implementation should kick off when any of the following triggers fire:
If multiple triggers fire simultaneously, Phase 1 starts at the earliest one. If none fire, Phase 0 sits as ratified architectural commitment without implementation cost.
The architectural shape is validated by three existing instances, each scoped narrowly to one surface:
| Instance | Surface | Validates |
|---|---|---|
| #1004 probe-set harness | Ethics boundary enforcement | Layer 1 single-source-of-truth probe shape; Layer 3 disagreement-table classification |
| #1018 audit-write integration tests | Audit envelope integrity | Layer 2 transaction-boundary isolation via session_scope |
| #1070 multi-turn evaluation harness | Multi-turn conversation flow | Layer 2 lifespan wiring; multi-probe orchestration through the floor |
When Phase 2 lands, these three instances fold into the generalized harness as the first probe-sets in Layer 1, demonstrating equivalence with the existing canonical-retest workflow.
Pattern-070 (Cleanup-Job-with-Cancellation-Hygiene, Emerging) names four operational invariants that the e2e harness’s Layer 2 orchestration must satisfy. The harness itself becomes a fourth reference instance of Pattern-070 when Phase 1 ships:
AsyncSessionFactory.session_scope per probeasyncio.current_task capture at probe-startPhase class managing startup/shutdownPhase 1 implementation must demonstrate all four invariants; CI gating at Phase 4 enforces they remain honored across future refactors.
The probe registry (Layer 1) is the same-shape architectural primitive as the task_type registry and the safe_surface() registry — typed catalogs of entries dispatched at consumption time. Pattern-072 (Proven as of #1094 close-out 2026-05-15) names this shape; the probe registry is a third reuse, reinforcing the pattern’s generality.
docs/internal/architecture/current/adrs/adr-061-llm-touch-boundary-enforcement.mddocs/internal/architecture/current/patterns/pattern-070-cleanup-job-with-cancellation-hygiene.mddocs/internal/architecture/current/patterns/pattern-072-registries-that-grow-into-architectural-shapes.mddev/active/PDR-005-bring-your-own-chat-draft-v0.3-2026-05-15.md (current cycle; v0.4+ ratification triggers Phase 5)mailboxes/arch/sent/memo-arch-to-ceo-cc-lead-ppm-cxo-cio-host-exec-pa-e2e-suite-design-proposal-2026-05-15.mddev/2026/05/15/2026-05-15-0606-arch-opus-log.md §”12:19 PM — Decision walkthrough w/ PM, item 1 of 5”canonical-retest-run8.py, May 13, 2026 (commit e37608b7)mailboxes/arch/read/memo-cio-to-arch-cc-ceo-lead-ppm-cxo-host-exec-pa-e2e-suite-methodology-disposition-2026-05-15.md— Chief Architect, 2026-05-16 v0.1 (Phase 0 ADR; pre-trigger architectural commitment per CEO ratification of proposal direction May 15)