Date: July 20, 2025 Status: Accepted Deciders: Claude Code (Architecture Assistant), Development Team
PM-038 required production-grade staging environment to validate MCP integration performance improvements and deployment procedures. The existing development setup lacked the infrastructure complexity and monitoring capabilities needed to properly validate production readiness.
Implement production-grade staging environment using Docker Compose with comprehensive monitoring, automated deployment, and rollback capabilities.
Multi-Service Docker Compose Architecture
# 8 Core Services + Monitoring Stack
services:
- postgres-staging # Database persistence
- redis-staging # Cache and session storage
- chromadb-staging # Vector database
- temporal-staging # Workflow orchestration
- api-staging # Main application API
- web-staging # User interface
- nginx-staging # Load balancer/reverse proxy
- prometheus-staging # Metrics collection
- grafana-staging # Monitoring dashboards
Network Architecture
piper-staging-network (172.20.0.0/16)Data Persistence Strategy
1. Database Services
postgres-staging:
image: postgres:15
environment:
POSTGRES_INITDB_ARGS: "--auth-host=scram-sha-256"
ports: ["5434:5432"] # Isolated from development (5433)
volumes:
- piper_postgres_staging_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
deploy:
resources:
limits: {memory: 1G, cpus: '1.0'}
reservations: {memory: 512M, cpus: '0.5'}
2. Application Services
api-staging:
build:
context: .
dockerfile: Dockerfile.staging
environment:
# MCP Production Configuration
- ENABLE_MCP_FILE_SEARCH=true
- USE_MCP_POOL=true
- MCP_POOL_MAX_CONNECTIONS=10
- MCP_CIRCUIT_BREAKER_ENABLED=true
depends_on:
postgres-staging: {condition: service_healthy}
redis-staging: {condition: service_healthy}
chromadb-staging: {condition: service_healthy}
3. Monitoring Stack
prometheus-staging:
image: prom/prometheus:latest
command:
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
volumes:
- ./config/staging/prometheus.yml:/etc/prometheus/prometheus.yml:ro
grafana-staging:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=staging_grafana_admin_2025
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- ./config/staging/grafana/dashboards:/etc/grafana/provisioning/dashboards:ro
1. Staging-Specific Environment Variables
# Application Environment
APP_ENV=staging
APP_DEBUG=false
LOG_LEVEL=INFO
# MCP Integration (Production-ready)
ENABLE_MCP_FILE_SEARCH=true
USE_MCP_POOL=true
MCP_POOL_MAX_CONNECTIONS=10
MCP_CIRCUIT_BREAKER_ENABLED=true
MCP_CONTENT_SCORING_ENABLED=true
# Infrastructure
POSTGRES_HOST=localhost
POSTGRES_PORT=5434
POSTGRES_DB=piper_morgan_staging
REDIS_HOST=localhost
REDIS_PORT=6380
# Security
POSTGRES_PASSWORD=staging_secure_password_2025
REDIS_PASSWORD=staging_redis_secure_2025
SECRET_KEY=staging_secret_key_2025
JWT_SECRET_KEY=staging_jwt_secret_2025
2. Feature Flag Configuration
# Production Features Enabled for Staging
ENABLE_CLARIFYING_QUESTIONS=true
ENABLE_MULTI_REPO=true
ENABLE_RATE_LIMITING=true
# Development Features Disabled
ENABLE_LEARNING=false # Keep disabled for staging
ENABLE_DEBUG_ENDPOINTS=false
1. Automated Deployment Script
#!/bin/bash
# scripts/deploy_staging.sh
# Phase 1: Infrastructure Services (30s)
docker-compose -f docker-compose.staging.yml up -d \
postgres-staging redis-staging chromadb-staging
sleep 30
# Phase 2: Application Services (45s)
docker-compose -f docker-compose.staging.yml up -d \
api-staging web-staging
sleep 45
# Phase 3: Monitoring and Proxy (15s)
docker-compose -f docker-compose.staging.yml up -d \
nginx-staging prometheus-staging grafana-staging
sleep 15
# Phase 4: Verification
./scripts/verify_staging_deployment.sh
2. Comprehensive Verification System
# 14 verification categories
test_categories=(
"basic_connectivity"
"health_endpoints"
"mcp_integration"
"api_functionality"
"performance"
"database_connectivity"
"redis_connectivity"
"chromadb_connectivity"
"container_health"
"monitoring_stack"
"security_headers"
"environment_variables"
"log_collection"
"data_persistence"
)
1. Kubernetes-Style Health Probes
# All services include comprehensive health checks
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
2. Service-Specific Health Endpoints
/health - Basic health status/health/liveness - Kubernetes liveness probe/health/readiness - Kubernetes readiness probe/health/comprehensive - Full component health/health/mcp - MCP-specific health (PM-038)1. Container Resource Limits
# Production-grade resource allocation
api-staging:
deploy:
resources:
limits: {memory: 2G, cpus: '2.0'}
reservations: {memory: 1G, cpus: '1.0'}
postgres-staging:
deploy:
resources:
limits: {memory: 1G, cpus: '1.0'}
reservations: {memory: 512M, cpus: '0.5'}
2. Logging Configuration
# All services use structured logging
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "3"
1. Emergency Rollback (30 seconds)
# Complete environment shutdown
docker-compose -f docker-compose.staging.yml down
# Restore previous version
docker-compose -f docker-compose.staging.yml pull
docker-compose -f docker-compose.staging.yml up -d
2. Safe Rollback with Data Preservation
# Automated rollback script with data backup
./scripts/rollback_staging.sh --preserve-data --version=previous
3. Rollback Decision Matrix | Issue Type | Severity | Time Limit | Action | |————|———-|————|——–| | Health check failures | High | 5 minutes | Application rollback | | MCP performance issues | Medium | 10 minutes | Feature disable | | Database corruption | Critical | 2 minutes | Full rollback | | Security breach | Critical | 30 seconds | Infrastructure shutdown |
1. Automated Backup Before Rollback
# Database backup
docker-compose -f docker-compose.staging.yml exec postgres-staging \
pg_dump -U piper piper_morgan_staging > \
backups/pre_rollback_$(date +%Y%m%d_%H%M%S).sql
# Configuration backup
cp .env.staging backups/env_staging_$(date +%Y%m%d).backup
2. Volume Snapshot Strategy
# Docker volume backup
docker run --rm -v piper_postgres_staging_data:/data \
-v $(pwd)/backups:/backup ubuntu \
tar czf /backup/postgres_data_$(date +%Y%m%d).tar.gz /data
1. Application Metrics
# prometheus.yml configuration
scrape_configs:
- job_name: 'piper-api-staging'
static_configs:
- targets: ['api-staging:8001']
metrics_path: '/health/metrics'
scrape_interval: 15s
2. Infrastructure Metrics
1. System Overview Dashboard
2. MCP Performance Dashboard
3. Application Performance Dashboard
1. Service Isolation
networks:
piper-staging-network:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
2. External Access Control
1. Service Authentication
2. Secret Management
# Environment-specific secrets
POSTGRES_PASSWORD=staging_secure_password_2025
REDIS_PASSWORD=staging_redis_secure_2025
GRAFANA_ADMIN_PASSWORD=staging_grafana_admin_2025
Port Allocation Strategy
# Development Environment
API: 8001, Database: 5433, Redis: 6379, ChromaDB: 8000
# Staging Environment
API: 8001, Database: 5434, Redis: 6380, ChromaDB: 8001
1. Automated Testing Pipeline
2. Environment Promotion Strategy
# Development → Staging → Production
git tag staging-YYYYMMDD
./scripts/deploy_staging.sh
./scripts/verify_staging_deployment.sh
# Manual approval for production
Implementation Date: July 20, 2025 Staging Environment URL: http://localhost:8001 (API), http://localhost:8081 (Web) Risk Level: Low (well-tested patterns, comprehensive monitoring) Business Impact: High (enables production-ready deployments)