ADR-062: Project-Scope End-to-End Suite — Generalizing ADR-061 Simulation Harness

Status: Phase 0 ADR (scoping) — v0.1 (drafted 2026-05-16); CEO ratification of proposal direction received 2026-05-15 via Architect decision walkthrough (Item 1); Phase 1+ gated on trigger signals (see §”Phase Sequencing”) Date: 2026-05-16 (v0.1 — Phase 0 scoping ADR per CEO ratification of e2e suite design proposal direction May 15) Supersedes: None (generalizes ADR-061’s three-phase calibration shape to project-scope; ADR-061 remains the canonical reference for the ethics-path-specific instance) Issues: #1004 (probe-set harness — narrow-scoped existing instance), #1018 (audit-write integration tests — narrow-scoped existing instance), #1070 (multi-turn evaluation harness — most-generalized existing instance, canonical-retest-run8.py) Related: ADR-061 (LLM-touch boundary enforcement, three-phase calibration template), Pattern-070 (Cleanup-Job-with-Cancellation-Hygiene, Emerging — operational invariants apply to Layer 2 harness orchestration), Pattern-072 (Registries that Grow into Architectural Shapes, Proven — probe registry is same-shape instance), PDR-005 (BYOC distribution model — Phase 5 cross-host gated by MCP server ship) Deciders: Chief Architect (drafted); CEO ratification of direction (2026-05-15); Lead Developer (Phase 1+ implementation refinement at trigger time); CXO (probe-set scoping refinement); CIO (methodology shelf consideration for operational invariants)


Context and Problem Statement

The project has accumulated three narrow-scoped end-to-end harness instances over ~3 weeks of work, each scoped to a specific surface:

Each instance solves a specific validation problem (boundary enforcement; audit-envelope integrity; multi-turn conversation flow). What does not yet exist is a cross-surface e2e harness — a single test surface that drives synthetic inputs through the entire request lifecycle (API entry → intent classification → workflow dispatch → LLM call → ethics detection → response generation → audit-envelope writing) and validates the integration of those steps, not the unit-level correctness of each.

Two structural signals from May 15 converge on the need for this surface:

  1. BYOC distribution model (PDR-005): the most ambitious BYOC version requires cross-host validation — when the MCP server packaging ships, validating that the same input produces equivalent behavior across Claude Desktop / ChatGPT / Slack / etc. has no harness today. Per-host e2e becomes load-bearing once BYOC packaging lands; cannot be reconstructed from unit + integration tests at that point.
  2. Anthropic Dreams architectural review (May 15): the simulation-harness pattern from ADR-061 has a clean borrow-target in Anthropic’s pending → running → completed/failed/canceled job lifecycle. The pattern generalizes.

The project also has four named test-rigor trigger signals (per May 4 PM walkthrough) that would justify tightening test rigor: coverage trend drops, latent-bug regression ships, alpha→beta transition, feature ships test-free where missing tests are obviously needed. None of these have fired yet — Lead Dev’s discipline holds (79% test coverage on code-touching commits as of May 4 review). What has surfaced is that the design horizon for e2e suite architecture is longer than the implementation horizon — we’ll regret not having the scoping ADR in place the moment the trigger fires.

The specific gap

Unit + integration test coverage at the component level is solid. The gap is at the whole-flow level under realistic conditions:

Why pre-trigger design now

Designing under trigger pressure costs weeks of scramble. Designing pre-trigger costs ~1 architectural session (this ADR). The asymmetric cost favors landing the structural commitment now even if Phase 1+ implementation waits for an actual trigger.

The proposal direction was ratified by CEO May 15 via Architect decision walkthrough (Item 1). The four-layer shape and five-phase sequence were agreed; this ADR is the formal Phase 0 scoping artifact.


Decision

Principle

Cross-surface end-to-end validation via a generalized simulation-harness pattern is the right architectural primitive for an LLM-touch product approaching multi-host distribution. The four-layer shape generalizes ADR-061’s three-phase calibration template; the five-phase implementation sequence packages it for trigger-driven rollout without forcing implementation work before signals justify it.

The existing narrow-scoped harnesses (#1004 probe-set, #1018 audit-write, #1070 multi-turn) are reference instances; they fold into the generalized harness when Phase 2 lands rather than being replaced standalone.

Four-Layer Architecture

The e2e suite is structured as four operational layers, mirroring Pattern-070’s operational invariants for cleanup-job-with-cancellation-hygiene (transaction-boundary isolation; cancellation hygiene; lifespan wiring; failure isolation envelope):

Layer 1 — Synthetic Input Registry

A catalog of probe sets by surface, with each probe carrying a structured shape:

@dataclass
class Probe:
    input: str  # the synthetic input
    surface: str  # which probed surface (ethics, intent_classification, slot_extraction, multi_turn, etc.)
    expected_intent: Optional[str]  # for intent-classification probes
    expected_action_class: Optional[str]  # PASS / DECLINE / DEGRADE / etc.
    expected_audit_shape: Optional[dict]  # structured fields the audit envelope must contain
    severity: Literal["critical", "important", "informational"]
    notes: str  # human context for the probe

This is the same-shape pattern as Pattern-072 (Registries that Grow into Architectural Shapes, Proven via #1094): a typed catalog of entries dispatched at consumption. The task_type registry, safe_surface() registry, and the prospective probe registry are three instances of the same architectural shape.

Single source of truth: probes are defined in one registry per surface; tests reference probes by key, not by inlined input strings. When a probe is updated (e.g., #1004’s PROFESSIONAL category vector list), every consuming test sees the new value automatically.

Layer 2 — Harness Orchestration

Runs probes through the full request lifecycle and captures actual output + audit envelope. The orchestration layer is governed by Pattern-070’s four operational invariants:

  1. Transaction-boundary isolation: each probe uses AsyncSessionFactory.session_scope() per call; one probe’s transaction state cannot leak to the next
  2. Cancellation hygiene: capture asyncio.current_task() at probe-start; cancellation propagates cleanly without leaving orphan resources
  3. Lifespan wiring: harness lifecycle managed by a Phase class (startup, run-probes, shutdown) — matches the orchestration shape used in the audit-write cleanup job (#1018)
  4. Failure isolation envelope: broad-except no-propagate around each probe so one probe’s failure doesn’t tank the entire suite; failures are captured for Layer 3 reporting

The harness drives synthetic input through the production request lifecycle — same code paths users hit — and captures (response shape, audit envelope contents, latency, side effects on persisted state).

Layer 3 — Disagreement-Table Generation

Compares actual output vs. expected. Classifies divergences into four categories:

This is the same disagreement-table shape used in #1004’s run-1 and run-2 reports. The Layer 3 output is the load-bearing surface for triage: which divergences are bugs in the implementation vs. cases where the expected shape needs updating.

Layer 4 — Reporting + CI Integration

Emits structured pass/fail signals to standard test infrastructure (pytest integration). Includes:

Phase Sequencing

Phase 0 — Scoping ADR (this document) Pre-trigger architectural design work. ~1 architectural session. Status: complete with this filing.

Phase 1 — Harness Scaffolding Implementation of Layer 1 (probe registry primitives) + Layer 2 (orchestration). Starts with ethics + intent classification surfaces — folding in existing #1004 + #1070 work. Estimated effort: ~1 week Lead Dev. Gated on triggers below.

Phase 2 — Existing Probe-Set Integration Folds #1004 probe set + #1070 multi-turn into the new harness; demonstrates equivalence with existing canonical-retest workflow. Estimated effort: ~3-5 days.

Phase 3 — Gap Surfaces Adds probe coverage for surfaces not currently exercised end-to-end (workflow dispatch, slot extraction, response generation). Estimated effort: ~1-2 weeks. CXO + Lead Dev co-design at this phase.

Phase 4 — CI Gating Converts from on-demand to gated (e.g., PR touching services/ must pass ethics + intent e2e probe sets). Estimated effort: ~1 week. Requires the “no-regression rule” disposition codified.

Phase 5 — Cross-Host e2e When BYOC MCP server packaging ships, extend Layer 2 to drive probes through MCP surface in addition to FastAPI surface. Estimated effort: ~1 week. Gated by BYOC MCP server ship (PDR-005 v0.4+ trigger).

Total Phase 1-5 scope estimate: ~4-6 weeks Lead Dev spread across the BYOC → 1.0 → beta arc.

Trigger Signals for Phase 1+ Kickoff

Phase 0 (this ADR) is pre-trigger architectural work. Phase 1+ implementation should kick off when any of the following triggers fire:

If multiple triggers fire simultaneously, Phase 1 starts at the earliest one. If none fire, Phase 0 sits as ratified architectural commitment without implementation cost.


Consequences

Positive

Negative / Tradeoffs

Non-Consequences (explicitly out of scope)


Validation

Existing Reference Instances (narrow-scoped)

The architectural shape is validated by three existing instances, each scoped narrowly to one surface:

Instance Surface Validates
#1004 probe-set harness Ethics boundary enforcement Layer 1 single-source-of-truth probe shape; Layer 3 disagreement-table classification
#1018 audit-write integration tests Audit envelope integrity Layer 2 transaction-boundary isolation via session_scope
#1070 multi-turn evaluation harness Multi-turn conversation flow Layer 2 lifespan wiring; multi-probe orchestration through the floor

When Phase 2 lands, these three instances fold into the generalized harness as the first probe-sets in Layer 1, demonstrating equivalence with the existing canonical-retest workflow.

Architectural Invariants Inherited from Pattern-070

Pattern-070 (Cleanup-Job-with-Cancellation-Hygiene, Emerging) names four operational invariants that the e2e harness’s Layer 2 orchestration must satisfy. The harness itself becomes a fourth reference instance of Pattern-070 when Phase 1 ships:

  1. Transaction-boundary isolation via AsyncSessionFactory.session_scope per probe
  2. Cancellation hygiene via asyncio.current_task capture at probe-start
  3. Lifespan wiring via Phase class managing startup/shutdown
  4. Failure isolation envelope via broad-except no-propagate

Phase 1 implementation must demonstrate all four invariants; CI gating at Phase 4 enforces they remain honored across future refactors.

Pattern-072 Recognition Trigger

The probe registry (Layer 1) is the same-shape architectural primitive as the task_type registry and the safe_surface() registry — typed catalogs of entries dispatched at consumption time. Pattern-072 (Proven as of #1094 close-out 2026-05-15) names this shape; the probe registry is a third reuse, reinforcing the pattern’s generality.


Cross-references


Open Items (Phase 1+ work, not gated by this ADR)

— Chief Architect, 2026-05-16 v0.1 (Phase 0 ADR; pre-trigger architectural commitment per CEO ratification of proposal direction May 15)