Skip to content

Latest commit

 

History

History
278 lines (206 loc) · 11.3 KB

File metadata and controls

278 lines (206 loc) · 11.3 KB

Observability Implementation Summary

Overview

Comprehensive observability stack implemented for Ampel with metrics, tracing, logging, and monitoring.

Implementation Status: ✅ Complete

All components successfully implemented and integrated.

Components Implemented

1. Backend Instrumentation (Rust)

Files Created/Modified:

  • /alt/home/developer/workspace/projects/ampel/crates/ampel-api/src/observability.rs - Health checks and metrics endpoints
  • /alt/home/developer/workspace/projects/ampel/crates/ampel-api/src/middleware/metrics.rs - HTTP metrics middleware
  • /alt/home/developer/workspace/projects/ampel/crates/ampel-api/src/routes/mod.rs - Integrated observability routes
  • /alt/home/developer/workspace/projects/ampel/crates/ampel-api/src/main.rs - Metrics initialization

Features:

  • Prometheus metrics exporter with custom HTTP metrics
  • Health check endpoint: GET /health
  • Readiness check endpoint: GET /ready
  • Metrics endpoint: GET /metrics
  • Automatic HTTP request tracking (method, path, status, duration)
  • OpenTelemetry support for distributed tracing (optional)

Dependencies Added:

  • metrics - Metrics abstraction
  • metrics-exporter-prometheus - Prometheus exporter
  • tracing-opentelemetry - OpenTelemetry integration
  • opentelemetry + opentelemetry_sdk - Tracing support

2. Monitoring Stack (Docker)

Files Created:

  • /alt/home/developer/workspace/projects/ampel/docker/docker-compose.monitoring.yml - Complete monitoring stack
  • /alt/home/developer/workspace/projects/ampel/monitoring/prometheus.yml - Prometheus configuration
  • /alt/home/developer/workspace/projects/ampel/monitoring/alerts/ampel.yml - Alert rules
  • /alt/home/developer/workspace/projects/ampel/monitoring/grafana/datasources/prometheus.yml - Grafana datasource
  • /alt/home/developer/workspace/projects/ampel/monitoring/grafana/dashboards/ampel-overview.json - Main dashboard
  • /alt/home/developer/workspace/projects/ampel/.env.monitoring.example - Configuration template

Services:

  • Prometheus (port 9090) - Metrics storage and querying
  • Grafana (port 3000) - Visualization and dashboards
  • PostgreSQL Exporter (port 9187) - Database metrics
  • Redis Exporter (port 9121) - Cache metrics
  • Loki (port 3100) - Log aggregation

3. Frontend Monitoring (React)

Files Created:

  • /alt/home/developer/workspace/projects/ampel/frontend/src/components/ErrorBoundary.tsx - Error boundary component
  • /alt/home/developer/workspace/projects/ampel/frontend/src/utils/monitoring.ts - Monitoring utilities
  • /alt/home/developer/workspace/projects/ampel/frontend/src/main.tsx - Integrated monitoring

Features:

  • React ErrorBoundary with automatic error reporting
  • Web Vitals tracking (CLS, FID, LCP, FCP, TTFB)
  • Custom event tracking
  • Performance monitoring
  • Unhandled error and promise rejection tracking

Dependencies Added:

  • web-vitals - Core Web Vitals measurement

4. Alerts

Alert Rules:

  1. HighErrorRate - Error rate >5% for 5 minutes
  2. HighLatency - P95 latency >1s for 10 minutes
  3. DatabaseDown - PostgreSQL unavailable for 1 minute
  4. HighDatabaseConnections - >80 connections for 5 minutes
  5. ServiceDown - Service unavailable for 2 minutes

5. Documentation

Files Created:

  • /alt/home/developer/workspace/projects/ampel/docs/observability.md - Complete observability guide
  • /alt/home/developer/workspace/projects/ampel/docs/observability-quickstart.md - Quick start guide
  • /alt/home/developer/workspace/projects/ampel/monitoring/README.md - Monitoring configuration guide

Topics Covered:

  • Architecture overview
  • Metrics collection and custom metrics
  • Grafana dashboards
  • Alert configuration
  • Fly.io native monitoring integration
  • Distributed tracing setup (optional)
  • Logging best practices
  • Production deployment checklist

6. Make Targets

File Created:

  • /alt/home/developer/workspace/projects/ampel/Makefile.monitoring - Monitoring-specific commands

Commands:

make monitoring-up              # Start monitoring stack
make monitoring-down            # Stop monitoring stack
make monitoring-logs            # View logs
make monitoring-restart         # Restart services
make monitoring-clean           # Clean all data
make monitoring-health          # Check service health
make monitoring-import-dashboards  # Import Grafana dashboards
make monitoring-export-dashboards  # Export Grafana dashboards

Metrics Available

HTTP Metrics

  • http_requests_total{method, path, status} - Total requests counter
  • http_request_duration_seconds{method, path, status} - Request duration histogram

Database Metrics (via postgres-exporter)

  • pg_stat_database_numbackends - Active connections
  • pg_stat_database_xact_commit - Transaction commits
  • pg_stat_database_xact_rollback - Transaction rollbacks
  • pg_stat_database_deadlocks - Deadlock count

Redis Metrics (via redis-exporter)

  • redis_connected_clients - Connected clients
  • redis_commands_processed_total - Total commands
  • redis_memory_used_bytes - Memory usage

Quick Start

Local Development

# 1. Start monitoring stack
make monitoring-up

# 2. Start Ampel services
make dev-api
make dev-worker

# 3. Access monitoring
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus

# 4. Check health endpoints
curl http://localhost:8080/health
curl http://localhost:8080/ready
curl http://localhost:8080/metrics

Fly.io Production

Fly.io provides native monitoring at: https://fly.io/apps/[APP-NAME]/monitoring

Additional configuration in fly.toml:

[metrics]
  port = 8080
  path = "/metrics"

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                      Ampel Application                      │
├─────────────────────────────────────────────────────────────┤
│  API (8080)        Worker (8081)      Frontend (Browser)    │
│  ├─ /metrics       ├─ /metrics        ├─ ErrorBoundary      │
│  ├─ /health        └─ /health         ├─ Web Vitals         │
│  └─ /ready                             └─ Event Tracking    │
└────────┬─────────────────┬─────────────────────┬────────────┘
         │                 │                     │
         │                 │                     │
         ▼                 ▼                     ▼
┌─────────────────────────────────────────────────────────────┐
│                    Prometheus (9090)                        │
│  ├─ Scrapes metrics every 15s                               │
│  ├─ Stores time-series data                                 │
│  ├─ Evaluates alert rules                                   │
│  └─ Provides PromQL query interface                         │
└────────┬──────────────────────────────────┬─────────────────┘
         │                                  │
         │                                  │
         ▼                                  ▼
┌──────────────────────┐          ┌──────────────────────────┐
│  Grafana (3000)      │          │  Alertmanager (Optional) │
│  ├─ Dashboards       │          │  ├─ Alert routing        │
│  ├─ Visualizations   │          │  ├─ Notifications        │
│  └─ Alerts           │          │  └─ Slack/Email/PagerDuty│
└──────────────────────┘          └──────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                       Data Sources                          │
├─────────────────────────────────────────────────────────────┤
│  Postgres Exporter (9187)     Redis Exporter (9121)         │
│  Loki (3100)                                                │
└─────────────────────────────────────────────────────────────┘

Key Features

  1. Production-Ready: Complete monitoring stack with alerts
  2. Developer-Friendly: Easy local setup with make commands
  3. Cloud-Native: Fly.io integration documented
  4. Comprehensive: Backend, frontend, and infrastructure metrics
  5. Extensible: Easy to add custom metrics and dashboards
  6. Best Practices: Following industry standards (Prometheus, Grafana, OpenTelemetry)

Testing Checklist

  • Backend metrics endpoint /metrics accessible
  • Health checks /health and /ready return proper JSON
  • Prometheus scrapes metrics successfully
  • Grafana dashboards display data
  • Alerts are properly configured
  • Frontend ErrorBoundary catches errors
  • Web Vitals are tracked
  • End-to-end test with all services (pending manual verification)

Production Deployment Checklist

  • Update Grafana admin password in production
  • Configure Alertmanager for notifications (Slack/PagerDuty)
  • Set up log aggregation (Loki or external service)
  • Configure backup for Prometheus data
  • Restrict metrics endpoint to monitoring network
  • Enable TLS for Grafana and Prometheus
  • Set up on-call rotation for alerts
  • Create runbooks for common incidents
  • Configure Fly.io native monitoring
  • Test alert notifications

Next Steps

  1. Test Locally: Run make monitoring-up and verify all services
  2. Create Custom Dashboards: Add application-specific panels
  3. Configure Alertmanager: Set up notification channels
  4. Deploy to Fly.io: Test /metrics endpoint in production
  5. Monitor Production: Validate alerts and dashboards with real traffic

Troubleshooting

Common issues and solutions documented in:

  • /alt/home/developer/workspace/projects/ampel/docs/observability.md#troubleshooting
  • /alt/home/developer/workspace/projects/ampel/monitoring/README.md#troubleshooting

Resources


Implementation Date: 2025-12-22 Status: ✅ Complete and Ready for Testing Agent: Observability Implementation Specialist