Staging Rollback Procedures

Date: August 11, 2025 Status: Production Ready Environment: Staging with Docker Compose Rollback Strategy: Automated + Manual procedures

Overview

This guide provides comprehensive rollback procedures for the Piper Morgan staging environment, ensuring safe deployment practices and quick recovery from deployment issues.

Rollback Strategy

Multi-Layer Approach

  1. Automatic Rollback: Health check failures trigger immediate rollback
  2. Manual Rollback: Developer-initiated rollback for planned changes
  3. Emergency Rollback: Force rollback for critical failures
  4. Gradual Rollback: Partial rollback for specific components

Prerequisites

Required Tools

# Verify rollback tools availability
ls -la scripts/rollback_*.sh
ls -la scripts/verify_staging_deployment.sh
ls -la scripts/backup_staging.sh

Required Access

Automatic Rollback

Health Check Triggers

The staging environment automatically triggers rollback when:

Automatic Rollback Process

# Health check monitoring (runs every 30 seconds)
while true; do
    ./scripts/health_check.sh
    if [ $? -ne 0 ]; then
        echo "Health check failed, triggering automatic rollback"
        ./scripts/rollback_staging.sh
        break
    fi
    sleep 30
done

Rollback Triggers

# Health check thresholds
health_checks:
  response_time:
    threshold: 500ms
    consecutive_failures: 3
    action: "rollback"

  error_rate:
    threshold: 5%
    consecutive_failures: 2
    action: "rollback"

  service_health:
    threshold: 100%
    consecutive_failures: 1
    action: "rollback"

Manual Rollback

Planned Rollback

# 1. Verify current deployment health
./scripts/verify_staging_deployment.sh

# 2. Create backup before rollback
./scripts/backup_staging.sh

# 3. Execute rollback
./scripts/rollback_staging.sh

# 4. Verify rollback success
./scripts/verify_staging_deployment.sh

Emergency Rollback

# Force immediate rollback (bypasses health checks)
./scripts/emergency_rollback.sh

# Verify critical services are running
./scripts/verify_critical_services.sh

Rollback Procedures by Component

Application Rollback

# Rollback application to previous version
git checkout HEAD~1
docker-compose build app
docker-compose up -d app

# Verify application health
curl http://localhost:8001/health

Database Rollback

# Restore database from backup
./scripts/restore_staging.sh backup_$(date +%Y-%m-%d_%H-%M-%S).sql

# Verify database integrity
docker-compose exec app python -c "
from services.database import get_db
db = get_db()
result = db.execute('SELECT COUNT(*) FROM information_schema.tables')
print(f'Database tables: {result.fetchone()[0]}')
"

Configuration Rollback

# Restore previous configuration
git checkout HEAD~1 -- .env
git checkout HEAD~1 -- docker-compose.yml

# Restart services with old config
docker-compose down
docker-compose up -d

Monitoring Stack Rollback

# Rollback monitoring services
docker-compose down prometheus grafana
docker-compose up -d prometheus grafana

# Verify monitoring is working
curl http://localhost:9090/-/healthy
curl http://localhost:3001/api/health

Rollback Scripts

Primary Rollback Script

#!/bin/bash
# scripts/rollback_staging.sh

set -e

echo "🚨 Initiating staging rollback..."

# 1. Stop current deployment
echo "Stopping current deployment..."
docker-compose down

# 2. Checkout previous version
echo "Checking out previous version..."
git checkout HEAD~1

# 3. Restore from backup if available
if [ -f "backup_latest.sql" ]; then
    echo "Restoring database from backup..."
    ./scripts/restore_staging.sh backup_latest.sql
fi

# 4. Rebuild and restart
echo "Rebuilding and restarting services..."
docker-compose build
docker-compose up -d

# 5. Wait for services to be ready
echo "Waiting for services to be ready..."
sleep 30

# 6. Verify rollback success
echo "Verifying rollback success..."
./scripts/verify_staging_deployment.sh

echo "βœ… Rollback completed successfully"

Emergency Rollback Script

#!/bin/bash
# scripts/emergency_rollback.sh

set -e

echo "🚨 EMERGENCY ROLLBACK - Bypassing health checks"

# 1. Force stop all services
docker-compose down --remove-orphans

# 2. Checkout last known good version
git checkout HEAD~5  # Go back 5 commits for safety

# 3. Restart with minimal verification
docker-compose up -d

# 4. Basic health check only
curl -f http://localhost:8001/health || echo "Health check failed but services are running"

echo "⚠️  Emergency rollback completed - manual verification required"

Verification Script

#!/bin/bash
# scripts/verify_staging_deployment.sh

set -e

echo "πŸ” Verifying staging deployment..."

# Test 1: Application health
echo "Test 1: Application health..."
curl -f http://localhost:8001/health

# Test 2: Database connection
echo "Test 2: Database connection..."
docker-compose exec app python -c "from services.database import get_db; print('Database OK')"

# Test 3: Redis connection
echo "Test 3: Redis connection..."
docker-compose exec app python -c "import redis; r = redis.Redis(host='redis'); r.ping(); print('Redis OK')"

# Test 4: MCP health
echo "Test 4: MCP health..."
curl -f http://localhost:8001/health/mcp

# Test 5: Performance test
echo "Test 5: Performance test..."
start_time=$(date +%s%N)
curl -s http://localhost:8001/health > /dev/null
end_time=$(date +%s%N)
response_time=$(( (end_time - start_time) / 1000000 ))

if [ $response_time -lt 500 ]; then
    echo "βœ… Performance OK: ${response_time}ms"
else
    echo "⚠️  Performance degraded: ${response_time}ms"
fi

echo "βœ… All verification tests passed"

Rollback Decision Matrix

When to Rollback

Issue Type Severity Rollback Action Verification Required
Critical Service Down High Immediate Full health check
Performance Degradation Medium After 3 failures Performance metrics
Configuration Error Medium Manual Configuration validation
Database Corruption High Immediate Data integrity check
Security Vulnerability High Immediate Security scan

Rollback Timing

Rollback Type Trigger Time Recovery Time Verification Time
Automatic <30 seconds 2-5 minutes 1-2 minutes
Manual Developer initiated 5-10 minutes 2-3 minutes
Emergency <10 seconds 1-3 minutes Manual verification

Monitoring During Rollback

Key Metrics to Watch

# Monitor rollback progress
watch -n 5 'docker-compose ps && echo "---" && curl -s http://localhost:8001/health | jq'

# Check service logs during rollback
docker-compose logs -f app postgres redis

Rollback Status Dashboard

# Create rollback status dashboard
cat > rollback_status.md << EOF
# Rollback Status: $(date)

## Services Status
$(docker-compose ps)

## Health Checks
$(curl -s http://localhost:8001/health | jq)

## Performance Metrics
$(curl -s http://localhost:8001/metrics | grep -E "(response_time|error_rate)")
EOF

Post-Rollback Actions

Immediate Actions

  1. Verify System Health: Run comprehensive health checks
  2. Document Rollback: Record reason and timing
  3. Notify Team: Alert relevant stakeholders
  4. Investigate Root Cause: Analyze why rollback was needed

Follow-up Actions

  1. Root Cause Analysis: Document findings and lessons learned
  2. Prevention Measures: Implement fixes to prevent recurrence
  3. Rollback Improvement: Enhance rollback procedures if needed
  4. Team Review: Conduct post-mortem if significant issues occurred

Rollback Best Practices

Before Deployment

# 1. Create backup
./scripts/backup_staging.sh

# 2. Tag current version
git tag "pre-deploy-$(date +%Y%m%d-%H%M%S)"

# 3. Verify rollback tools
./scripts/test_rollback_tools.sh

During Deployment

# 1. Monitor health checks
./scripts/monitor_deployment.sh

# 2. Watch for warning signs
- Response time increases
- Error rate spikes
- Service health degradation

After Rollback

# 1. Verify system stability
./scripts/verify_staging_deployment.sh

# 2. Document rollback details
./scripts/document_rollback.sh

# 3. Plan next deployment
./scripts/plan_next_deployment.sh

Troubleshooting Rollback Issues

Common Rollback Problems

Rollback Script Fails

# Check script permissions
chmod +x scripts/rollback_staging.sh

# Verify script dependencies
./scripts/check_rollback_dependencies.sh

# Run with debug output
bash -x scripts/rollback_staging.sh

Services Won’t Start After Rollback

# Check Docker resources
docker system df
docker system prune -f

# Verify port availability
netstat -tulpn | grep -E "(8001|8081|3001|9090)"

# Check service logs
docker-compose logs [service_name]

Database Rollback Issues

# Verify backup file integrity
./scripts/verify_backup.sh backup_file.sql

# Check database permissions
docker-compose exec postgres psql -U piper_user -l

# Manual database restore
docker-compose exec -T postgres psql -U piper_user -d piper_morgan_staging < backup_file.sql

Rollback Metrics and Reporting

Rollback Performance Metrics

# Track rollback success rate
./scripts/rollback_metrics.sh

# Expected metrics:
# - Rollback success rate: >95%
# - Average rollback time: <5 minutes
# - Recovery time: <10 minutes

Rollback Report Template

# Rollback Report

**Date**: [Date]
**Trigger**: [Automatic/Manual/Emergency]
**Reason**: [Description of issue]
**Rollback Time**: [Duration]
**Recovery Time**: [Duration]
**Services Affected**: [List]
**Root Cause**: [Analysis]
**Prevention Measures**: [Actions taken]
**Lessons Learned**: [Key insights]

Status: Production Ready βœ… Rollback Strategy: Multi-layer approach βœ… Automation: Health check triggers βœ… Recovery Time: <5 minutes target βœ… Documentation: Comprehensive procedures βœ