ADR-061: LLM-Touch Boundary Enforcement — Two-Layer Detection with Floor as De-Facto Ethics Layer

Status: Ratified v1.0 (PM verbal ratification 2026-05-03); v1.1 amendment 2026-05-15 (output-side companion shipped per #1017 — see §”Amendment 2026-05-15”) Date: 2026-04-28 (v0.1) → 2026-04-30 (v1.0 — Lead Dev fixes + CEO calibration reframe applied) → 2026-05-03 (verbally ratified) → 2026-05-04 (status block updated) → 2026-05-15 (v1.1 — output-side companion amendment per #1017) Supersedes: None (extends ADR-060 with a complementary boundary-enforcement architecture) Issues: #1002 (the reframe), #1003 (the diagnostic), #1004 (the structural fix), #992 (ETHICS-ACTIVATE Phase A redirect_context), #1016 (LLM-touch boundary principle epic), #1017 (output-side companion — v1.1 amendment) Related: ADR-060 (Floor-First Routing), Pattern-062 (Assembly Assumption), Pattern-064 (Extension Without Integration — companion), Pattern-071 (Audit Logs as Attack Surface — emerging, sibling of Pattern-064; introduced by #1017 hash-only audit invariant), Pattern-072 (Registries that Grow into Architectural Shapes — emerging; the task_type registry was third-meaningful-reuse trigger via #1017’s profile dispatch) Deciders: Chief Architect (drafted); Lead Developer + CXO + CIO + PM (review pending)


Context and Problem Statement

The BoundaryEnforcer (#197 Phase 2A, refactored October 2025) was the project’s first ethics-enforcement infrastructure. It was wired at the universal entry point of IntentService._process_intent_internal (services/intent/intent_service.py:627), upstream of the intent classifier. The architecture appeared correct: ethics gate runs before any other dispatch, populates an audit envelope on violation, and routes the request through the conversational floor for voice-appropriate decline (“the enforcer detects, Piper speaks” — #992 Phase A design principle).

In practice, when the gate was activated for testing during #992 Phase E (Apr 25, 2026), the audit envelope was empty for naturally-phrased harassment input. A diagnostic comparison run (#1003, Apr 26) confirmed: ENABLE_ETHICS_ENFORCEMENT=true and =false produced indistinguishable responses on the same input. The flag was observably inert.

The Specific Failure

The BoundaryEnforcer’s harassment detector is a substring matcher against ten literal trigger words ("harass", "harassment", "bully", "bullying", "intimidate", "threaten", "inappropriate", "unwanted", "uncomfortable", "offensive"services/ethics/boundary_enforcer_refactored.py:121-132). Naturally-phrased harassment vectors do not contain any of these words. The detector returns confidence: 0.0 and violation_detected: False for input that any reader would recognize as harassment.

Three additional findings sharpened the picture:

Initial Misframing and Reframe

PPM and Lead Developer initially framed the failure as a routing problem“pre-classifier keyword-match dispatch shadows ethics floor”. Architectural verification (Apr 26 #1002 scoping) showed the gate was already at the universal entry point; the pre-classifier ran inside classify_multiple further downstream of the ethics gate at services/intent/intent_service.py:631. The bypass was not routing-order; it was detection-effectiveness. The substring detector ran but did not detect.

The reframe was load-bearing: a routing fix would have produced no observable behavior change. A detector fix is the actual work.

Root Cause

The BoundaryEnforcer architecture treated literal-pattern matching as the entire detection surface. Anything outside the 10-30 trigger words across categories was invisible to the gate. The LLM — the thing that makes naturally-phrased input legible — was not consulted at the boundary.

This is a specific manifestation of Pattern-064 (Extension Without Integration) at the infrastructure layer: BoundaryEnforcer was extended to a universal entry point in #197 Phase 2D without ever being integrated with realistic input shape. The unit tests passed because they used inputs that quoted trigger words; the activation gate was wired; the audit envelope was structured. None of these elements caught the integration failure with naturally-phrased input.

It is also a specific manifestation of Pattern-045 (Green Tests, Red User) at the infrastructure layer: tests passed, gate activated, audit envelope populated correctly when triggered — and yet user-facing behavior was unchanged because the detector was too narrow to fire on the input shape it was purportedly detecting.


Decision

Principle

At LLM-touch boundaries, four elements must be present at every surface where LLM output is consumed or natural-language input is evaluated:

  1. Permissive input shape — boundary validation does not constrain input to enums or rigid patterns. Natural-language input is naturally fuzzy; rigid validation cannot encode open-domain semantics.
  2. Schema validation at consumption — at the point of consumption, parse and validate against a structured contract. On failure, structured fallback (not silent pass-through).
  3. Safe-fallback path — when validation fails, a known path runs. For natural-language input: the floor LLM’s general competence. For LLM output: redaction, canned response, or retry-with-stricter-prompt.
  4. Audit envelope — every LLM-touch event records (which surface, raw output size, validation result, action taken) for operator legibility.

The substring detector pre-#1004 was the inversion of this principle: rigid pattern matching at the boundary (1), no semantic schema (2), no architected safe-fallback (3 — though the floor was implicitly doing the work, the architecture didn’t acknowledge it), and audit envelope that was empty when the detector failed silently (4).

Architecture: Two-Layer Detector + Floor Backstop

User message
    │
    ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Literal-trigger fast-path (current substring impl) │
│   - Cheap, deterministic, fast (~10ms when hit)             │
│   - Catches obvious cases that quote literal trigger words  │
│   - audit_data.detector = "literal-trigger"                 │
│   - audit_data.fast_path_hit = True                         │
└────────────┬────────────────────────────────────────────────┘
             │ no fast-path hit
             │ audit_data.fast_path_hit = False (recorded for
             │ calibration-window observability)
             ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Semantic LLM detector (#1004 Fix B)                │
│   - Structured JSON output (Pydantic-validated)             │
│   - confidence-tiered: 0.85+ block / 0.6–0.85 ambiguous /   │
│     <0.6 pass                                               │
│   - LRU cache (1024 entries); audit_data.cache_hit records  │
│   - audit_data.detector = "semantic" (when violation found) │
│                          = "none" (when no violation found) │
└────────────┬────────────────────────────────────────────────┘
             │ violation_detected (either layer)
             ▼ (existing path, unchanged from #992 Phase C)
       Floor LLM (denial_mode=True, redirect_context hint)
       composes decline voice
             │
             │ no violation detected (either layer);
             │ audit_data.detector = "none"
             ▼
       Floor LLM (denial_mode=False, normal context)
       general competence handles the request — including
       implicit ethics work for input shapes the detectors miss
       (FLOOR_IMPLICIT_ETHICS Phase 2 telemetry case)

The floor LLM is the de-facto ethics layer for natural-language input that doesn’t trip either detector. This was already true pre-#1004 (the #1003 evidence showed the floor handling harassment vectors competently). The architecture now acknowledges this rather than treating the floor as accidental backstop.

Audit Envelope (Fix C1)

BoundaryDecision.audit_data gains six new fields:

audit_data = {
    # ... existing fields ...
    "detector": "literal-trigger" | "semantic" | "none",  # which path fired
    "decision_tier": "block" | "ambiguous" | "pass",
    "semantic_confidence": float | None,  # semantic path only
    "semantic_reasoning": str | None,  # audit-only; never user-routed
    "fast_path_hit": bool,  # whether literal-trigger fast-path matched first
    "cache_hit": bool,  # whether semantic detector result came from LRU cache
    # ... rest of existing fields ...
}

The detector: "none" value is load-bearing: it distinguishes “neither layer fired; floor is handling implicitly” from “Layer 1 fired” and “Layer 2 fired.” This is what makes the FLOOR_IMPLICIT_ETHICS case (Telemetry Phase 2 sibling concern) operator-detectable.

fast_path_hit and cache_hit are operator-distinguishable signals worth documenting separately from detector:

Three operator-distinguishable cases:

  1. BoundaryEnforcer fired (literal-trigger or semantic)detector field is "literal-trigger" or "semantic"; audit envelope present
  2. Floor handled with denial_mode=True — semantic detector caught it, floor performed the decline (case 1 with denial_mode=True downstream)
  3. Floor handled with denial_mode=False but ethics-shaped behaviordetector == "none"; implicit ethics work; FLOOR_IMPLICIT_ETHICS counter (Telemetry Phase 2) records via structural heuristic category=="unknown" AND floor_hit==true

The redirect_context Handoff (#992 Phase A)

The redirect_context field on BoundaryDecision (declared at boundary_enforcer_refactored.py:81-88; computed via _derive_redirect_context() and _compute_redirect_context() helpers; consumed at the floor handoff site) is the canonical reference instance for structured layer-to-layer handoff in this architecture:

This is the model for any future LLM-touch boundary handoff: enforcement and voice are separate concerns with a typed contract between them.

What This ADR Does Not Establish


Consequences

Positive

Negative

Neutral / Open


Implementation Notes

The implementation shipped in #1004 (commit b26d6c85, Apr 27, 2026):

The activation flag (ENABLE_ETHICS_ENFORCEMENT=true in docker-compose.yml) is held pending PM/PA decision per Lead Developer’s recommendation (Apr 27 memo 2322907a). This ADR’s ratification is the documented-coverage prerequisite the team has chosen to land before the flip.


Amendment 2026-05-15 — Output-side companion (#1017 shipped)

ADR-061 v1.0 named “#1017 (post-generation content filter for LLM outputs)” as a sibling concern in §”What This ADR Does Not Establish” — explicitly out-of-scope for the input-side BoundaryEnforcer architecture. #1017 shipped 2026-05-15 as a structural companion to this ADR. This amendment documents the companion architecture without revising the original v1.0 input-side decision.

The four-element principle applied to OUTPUTS

ADR-061 v1.0’s four-element principle was framed for input boundaries. The same four elements apply at output boundaries with one direction-swap:

  1. Permissive output shape — the LLM emits free-form text; we cannot constrain the output at generation time without crippling the model’s usefulness
  2. Schema validation at consumption — at the moment the output is about to reach a user surface, parse and validate against per-task-type expectations. On detector match (PII regex / boundary category), structured fallback (redact-in-place / canned substitute), not silent pass-through
  3. Audit envelope at the boundary — every filter decision writes a typed record (OutputFilterDecision) capturing the action class, severity, matched rules, hashes (never raw content) — see hash-only invariant below
  4. Structured handoff to callerFilterResult.filtered_content is the minimal caller-facing surface; the decision record stays in audit and never leaks raw PII back through the return path

OutputFilter architecture

services/ethics/output_filter.py lands a decorator chokepoint at LLMClient.complete(). Every LLM call in production flows through it when an OutputFilter is wired (per OutputFilterWiringPhase in web/startup.py). Failure to wire = unfiltered LLM (graceful degradation by design — defense-in-depth layer must not block startup).

Profile dispatch via task_type: the existing task_type parameter (already required at every LLMClient.complete() call site) drives filter-profile selection. Ten production task types route to the user_visible profile (full Tier 1 + Tier 2 coverage); one task type (intent_classification) routes to internal (log-only; never echoed verbatim to users). Unknown task types default to user_visible (fail-closed).

Three-tier detection:

Severity → action matrix:

Detection Severity Action
PII regex (email/phone/SSN/credit-card) medium Redact in place → [REDACTED]
Secret formats (API keys, bearer tokens) high Redact + operator-flag
URL with embedded credentials high Redact entire URL
BoundaryEnforcer category violation critical Drop output + canned substitute
No match Passthrough

Regenerate-on-violation: when a boundary category fires, the decorator retries the LLM call once before surfacing the canned response (compresses user-visible failure rate; most LLM-output filter trips are non-deterministic). attempt_number + prior_attempt_decision_id propagate to the audit envelope for forensic chain visibility.

Canned response (CXO-ratified, output-side ownership phrasing): “That came out wrong — let me try a different approach.” Cross-checked against CT v2.3 §Tone-0 cadence analysis; deliberately avoids the input-side BoundaryEnforcer’s refusal framing because the output-side correction is a different psychological situation (Piper correcting her own output, not refusing the user’s ask).

Hash-only audit invariant (Pattern-064-adjacent / Pattern-071 candidate)

The OutputFilterDecision dataclass stores hashes of content, never raw content. Storing the content an audit log is intended to govern as raw text turns the audit log into the leak amplification surface — same skeletal shape as Pattern-064 (“alive scaffolding”), different failure mode (compliance-shaped infrastructure that actively makes the underlying problem worse). CIO filed as Emerging Pattern-071 (“Audit Logs as Attack Surface”) 2026-05-15.

The invariant is enforced at two layers:

  1. Schema layerOutputFilterDecision has original_content_hash and filtered_content_hash (sha256 hex) but no field for raw content
  2. Write-time guardlog_output_filter_decision() truncates any audit_metadata string >256 chars and flags invariant_violations[] so the audit-log layer catches future drift if a caller mutates audit_metadata with raw content

Forensic verification works via hash comparison: an operator with two events can confirm same-content-or-not without seeing either.

Phase 3 verification (probe set)

tests/ethics/test_output_filter_probe_set_1017.py lands 25 parametrized tests:

Each probe asserts: action class, severity tier, matched rules, redactions count where applicable, hash-only invariant (raw PII/secret never appears in decision.to_dict()).

CI gate: tests/ is covered by .github/workflows/test.yml:136 (pytest tests/ --tb=short -v -m "not llm"), which picks up the probe-set file automatically. Regression = CI break.

Phase 3 follow-ups deferred: regenerate-cycle probes (attempt_number=2), multi-violation probes (PII + boundary in same output), voice-register failure mode tier (per CXO Q7 sequencing).

Where the input-side and output-side architectures meet

ADR-061 v1.0 acknowledged the floor as the de-facto ethics layer for naturally-phrased inputs. The v1.1 amendment closes the loop on the output side: the BoundaryEnforcer (the same component v1.0 hardened) now also evaluates outputs, via the OutputFilter’s Tier 2 wrapper. The principle stays: enforcement and voice are separate concerns with a typed contract between them. The contract for outputs is OutputFilterDecision; the voice handoff is the CXO-ratified canned response (or the redacted-but-passing content).

The combined surface coverage:

Together, both surfaces satisfy the four-element principle at the two boundaries where LLM content crosses a trust gate (user input → system; system output → user). The remaining LLM-touch surfaces inventoried in #1016 Phase 1 (~23 total at filing) gradually align under the same four-element discipline as Phase 4 work proceeds.

What v1.1 does not establish

Implementation evidence



Review and Ratification

v0.1 drafted by Chief Architect 2026-04-28; distributed to Lead Dev / CXO / CIO for review.

v1.0 updated 2026-04-30 with Lead Dev review feedback applied + CEO Apr 30 calibration reframe:

CXO and CIO reviews remain optional; their input on voice/experience framing and methodology framework respectively is welcome but not blocking ratification, given Lead Dev’s substantive review is the implementation-accuracy gate. Either can submit feedback for a v1.x revision.

PM ratification pending. Once ratified, this ADR is the documented-coverage prerequisite for the Phase F flag-flip per Lead Developer’s Apr 27 recommendation.