This guide provides comprehensive instructions for rolling back StillMe deployments across different environments and deployment strategies.
- Application Rollback: Rollback to previous application version
- Configuration Rollback: Rollback to previous configuration
- Database Rollback: Rollback database changes
- Infrastructure Rollback: Rollback infrastructure changes
- Health Check Failures: Continuous health check failures
- High Error Rate: Error rate > 5%
- Performance Degradation: P95 latency > 1000ms
- Security Incidents: Security breach or compromise
- Manual Override: Manual rollback decision
# Rollback to previous version
make rollback TAG=v1.2.3
# Or use rollback script directly
./scripts/rollback.sh --tag v1.2.3# Check current running version
docker ps --filter "name=stillme" --format "table {{.Image}}"
# Check available versions
docker images stillme --format "table {{.Tag}}"# Stop current container
docker stop stillme-prod
# Remove current container
docker rm stillme-prod# Start previous version
docker run -d \
--name stillme-prod \
--restart unless-stopped \
-p 8080:8080 \
-v ./data:/app/data \
-v ./logs:/app/logs \
stillme:v1.2.3
# Verify rollback
curl http://localhost:8080/healthz# docker-compose.prod.yml
services:
stillme-prod:
image: stillme:v1.2.3 # Previous version
# ... rest of configuration# Deploy previous version
docker-compose -f docker-compose.prod.yml up -d --force-recreate
# Verify rollback
docker-compose -f docker-compose.prod.yml ps
curl http://localhost:8080/healthz# Check deployment history
kubectl rollout history deployment/stillme -n stillme
# Check specific revision
kubectl rollout history deployment/stillme --revision=2 -n stillme# Rollback to previous version
kubectl rollout undo deployment/stillme -n stillme
# Rollback to specific revision
kubectl rollout undo deployment/stillme --to-revision=2 -n stillme# Check rollout status
kubectl rollout status deployment/stillme -n stillme
# Check pod status
kubectl get pods -n stillme
# Check service health
kubectl port-forward svc/stillme 8080:8080 -n stillme
curl http://localhost:8080/healthz# Switch traffic back to blue
kubectl patch service stillme -p '{"spec":{"selector":{"version":"blue"}}}'
# Verify traffic switch
kubectl get svc stillme -n stillme -o yaml# Scale down green deployment
kubectl scale deployment stillme-green --replicas=0 -n stillme
# Verify green is scaled down
kubectl get pods -n stillme# Pause canary rollout
kubectl patch rollout stillme -p '{"spec":{"paused":true}}' -n stillme
# Check rollout status
kubectl get rollout stillme -n stillme# Rollback canary to stable version
kubectl patch rollout stillme -p '{"spec":{"rollbackTo":{"revision":1}}}' -n stillme
# Verify rollback
kubectl rollout status rollout/stillme -n stillme# Basic rollback
./scripts/rollback.sh --tag v1.2.3
# Rollback with namespace
./scripts/rollback.sh --tag v1.2.3 --namespace production
# Rollback with service name
./scripts/rollback.sh --tag v1.2.3 --service stillme-api
# Rollback with custom health check URL
./scripts/rollback.sh --tag v1.2.3 --url http://api.stillme.ai/healthz| Option | Description | Default |
|---|---|---|
-t, --tag |
Previous tag to rollback to | Required |
-n, --namespace |
Kubernetes namespace | stillme |
-s, --service |
Service name | stillme-prod |
-u, --url |
Health check URL | http://localhost:8080/healthz |
-h, --help |
Show help message | - |
# .github/workflows/rollback.yml
name: Manual Rollback
on:
workflow_dispatch:
inputs:
environment:
description: 'Environment to rollback'
required: true
default: 'production'
type: choice
options:
- production
- staging
tag:
description: 'Tag to rollback to'
required: true
type: string
jobs:
rollback:
runs-on: ubuntu-latest
environment: ${{ github.event.inputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Rollback deployment
run: |
./scripts/rollback.sh --tag ${{ github.event.inputs.tag }} --namespace ${{ github.event.inputs.environment }}
- name: Verify rollback
run: |
curl -f http://localhost:8080/healthz
curl -f http://localhost:8080/readyz# Trigger rollback via GitHub CLI
gh workflow run rollback.yml -f environment=production -f tag=v1.2.3
# Or via GitHub UI
# Go to Actions > Manual Rollback > Run workflow# Check liveness probe
curl http://localhost:8080/healthz
# Check readiness probe
curl http://localhost:8080/readyz
# Check metrics endpoint
curl http://localhost:8080/metrics# Check application logs
kubectl logs -f deployment/stillme -n stillme
# Check for errors
kubectl logs deployment/stillme -n stillme | grep ERROR
# Check resource usage
kubectl top pods -n stillme# Run load tests
make load-test
# Check performance metrics
curl http://localhost:8080/metrics | grep http_request_duration# Check SLO compliance
curl http://localhost:8080/metrics | grep -E "(p95|error_rate|availability)"# Check security headers
curl -I http://localhost:8080/ | grep -E "(X-|Strict-|Content-Security)"# Run security scans
make security
# Check security compliance
curl http://localhost:8080/security/status# Emergency rollback script
./scripts/emergency_rollback.sh
# Or manual emergency rollback
kubectl rollout undo deployment/stillme -n stillme
kubectl rollout status deployment/stillme -n stillme# Activate kill switch
curl -X POST http://localhost:8080/security/kill-switch/activate
# Verify kill switch
curl http://localhost:8080/security/kill-switch/status# Scale down service
kubectl scale deployment stillme --replicas=0 -n stillme
# Or delete service
kubectl delete deployment stillme -n stillme- On-Call Engineer: +1-XXX-XXX-XXXX
- Security Team: security@stillme.ai
- Management: management@stillme.ai
- Incident Response: incident@stillme.ai
- Identify Issue: Document the issue requiring rollback
- Assess Impact: Determine scope and impact of rollback
- Notify Team: Inform relevant team members
- Backup Data: Ensure data is backed up
- Test Rollback: Test rollback procedure in staging
- Prepare Rollback: Identify target version for rollback
- Stop Traffic: Stop traffic to affected service
- Execute Rollback: Run rollback procedure
- Verify Health: Check service health endpoints
- Test Functionality: Verify core functionality
- Monitor Metrics: Watch key performance metrics
- Check Logs: Review application logs for errors
- Verify Rollback: Confirm rollback was successful
- Monitor System: Monitor system for 24-48 hours
- Document Incident: Document incident and rollback
- Root Cause Analysis: Investigate root cause
- Update Procedures: Update rollback procedures if needed
- Team Communication: Communicate status to team
# Check deployment status
kubectl get deployment stillme -n stillme
# Check pod status
kubectl get pods -n stillme
# Check events
kubectl get events -n stillme --sort-by='.lastTimestamp'
# Check logs
kubectl logs deployment/stillme -n stillme# Check health endpoints
curl -v http://localhost:8080/healthz
curl -v http://localhost:8080/readyz
# Check service configuration
kubectl describe service stillme -n stillme
# Check ingress
kubectl describe ingress stillme -n stillme# Check resource usage
kubectl top pods -n stillme
kubectl top nodes
# Check metrics
curl http://localhost:8080/metrics
# Check for resource limits
kubectl describe pod -l app=stillme -n stillme# Debug pod
kubectl debug pod/stillme-xxx -n stillme
# Port forward for debugging
kubectl port-forward svc/stillme 8080:8080 -n stillme
# Execute commands in pod
kubectl exec -it deployment/stillme -n stillme -- /bin/bash
# Check configuration
kubectl describe configmap stillme-config -n stillme
kubectl describe secret stillme-secrets -n stillme- Automated Rollback: Implement automated rollback triggers
- Health Checks: Comprehensive health check validation
- Monitoring: Real-time monitoring during rollback
- Documentation: Document all rollback procedures
- Testing: Regular rollback testing in staging
- Staging Testing: Thorough testing in staging environment
- Canary Deployments: Use canary deployments for risky changes
- Feature Flags: Use feature flags for gradual rollouts
- Monitoring: Comprehensive monitoring and alerting
- Backup Strategy: Regular backups and recovery testing
- Incident Communication: Clear incident communication
- Status Updates: Regular status updates during rollback
- Post-Incident Review: Conduct post-incident reviews
- Lessons Learned: Document and share lessons learned
- Team Training: Regular team training on rollback procedures
- Documentation: docs/
- Issues: GitHub Issues
- Security: SECURITY.md
- Community: GitHub Discussions
Last Updated: $(date) Next Review: $(date -d "+3 months") Maintainer: StillMe DevOps Team