This repository is the single source of truth for Dynatrace observability across all services in the ADO project. It owns two distinct layers and supports three delivery mechanisms — choose the one that fits your team.
| Layer | What it manages | Where it lives |
|---|---|---|
| Platform | Management zones, auto-tags, alerting profiles, notification integrations (Slack / MS Teams / PagerDuty / Splunk On-Call), request attributes, span attributes | terraform/platform-resources/ |
| Application | SLOs (service + per-endpoint), metric event alerts, dashboards, synthetic monitors, log metrics | scaffold/observability/ or scaffold/observability-operator/ → rendered into each app repo |
| Monaco + Argo CD CMP | Custom Operator | Crossplane | |
|---|---|---|---|
| How | Jinja2 templates → Monaco YAML → CMP sidecar | CRD manifests → Go controller | Claims → Composition → provider |
| Drift detection | CronJob every 6h | Built-in every 5 min | provider-terraform: 10 min |
| App team writes | Monaco YAML in observability/ |
DynatraceSLO / DynatraceAlert CRs |
ServiceObservabilityClaim |
| Backstage entity resolution | Manual entity IDs | Auto via backstageId tag |
Not supported natively |
| Build cost | Zero | Already built (Go operator in operator/) |
provider-terraform works today; native provider = 2–4 months |
| Best for | Existing Monaco investment | Clean API, org-specific logic | Orgs already running Crossplane |
Detailed docs for each approach:
docs/
observability-template/
│
├── scaffold/
│ ├── observability/ # Monaco approach — Jinja2 → Monaco YAML
│ │ ├── manifest.yaml.j2 # Monaco v2 project manifest
│ │ ├── environments/
│ │ │ ├── dev.yaml.j2 # SLO targets + env config
│ │ │ ├── staging.yaml.j2
│ │ │ ├── perf.yaml.j2 # Relaxed thresholds for load testing
│ │ │ └── prod.yaml.j2 # Contractual SLA targets
│ │ ├── slos/
│ │ │ ├── availability.yaml.j2 + availability-slo.json.j2
│ │ │ ├── latency.yaml.j2 + latency-slo.json.j2
│ │ │ └── endpoints/ # Per-endpoint SLOs
│ │ │ ├── critical-endpoints.yaml.j2 # Teams list their endpoints here
│ │ │ ├── endpoint-availability-slo.json.j2
│ │ │ ├── endpoint-latency-slo.json.j2
│ │ │ └── generated/ # Output of generate-endpoint-slos.py
│ │ ├── alerts/
│ │ │ ├── error-rate.yaml.j2 + error-rate.json.j2
│ │ │ ├── latency-p99.yaml.j2 + latency-p99.json.j2
│ │ │ └── error-budget-burn.yaml.j2 + error-budget-burn.json.j2
│ │ ├── dashboards/
│ │ │ └── service-overview.yaml.j2 + service-dashboard.json.j2
│ │ ├── synthetic/
│ │ │ └── health-check.yaml.j2 + http-monitor.json.j2
│ │ └── log-metrics/
│ │ └── error-log-metric.yaml.j2
│ │
│ ├── observability-operator/ # Operator approach — Jinja2 → CRD manifests
│ │ ├── dev/
│ │ │ ├── slos.yaml.j2 # DynatraceSLO CRDs (target: 99.0)
│ │ │ └── alerts.yaml.j2 # DynatraceAlert CRDs
│ │ ├── staging/
│ │ │ ├── slos.yaml.j2
│ │ │ └── alerts.yaml.j2
│ │ ├── perf/
│ │ │ └── slos.yaml.j2 # Relaxed thresholds for load tests
│ │ └── prod/
│ │ ├── slos.yaml.j2 # DynatraceSLO (target: 99.9%)
│ │ ├── alerts.yaml.j2 # errorRate + latencyP99 + burnRateFast + burnRateSlow
│ │ └── dashboard.yaml.j2 # DynatraceDashboard (service-overview template)
│ │
│ ├── scripts/ # Validation + generation scripts → app repos
│ │ ├── ddu-estimator.py # DDU consumption estimate (Monaco approach)
│ │ ├── slo-regression-check.py # Block PRs that lower SLO targets
│ │ └── generate-endpoint-slos.py # Generate Monaco configs from endpoints YAML
│ │
│ └── backstage/ # Backstage integration reference templates
│ ├── catalog-info.yaml.j2 # Backstage catalog descriptor
│ └── deployment-labels.yaml.j2 # Required k8s labels for DT auto-tagging
│
├── scripts/
│ ├── oac_utils.py # ADO REST client + Jinja2 render utilities
│ ├── bootstrap.py # Initial scaffold pipeline script
│ ├── propagate.py # Template update propagation script
│ └── drift_detector.py # Drift detection CronJob script (Monaco approach)
│
├── pipelines/
│ ├── bootstrap-pipeline.yaml # Manual — scaffolds OaC into all repos
│ ├── propagation-pipeline.yaml # Auto — pushes template updates on main push
│ └── oac-pr-validation.yaml # Per-app PR gate (YAML lint, Monaco dry-run,
│ # DDU estimate, endpoint SLO sync check,
│ # SLO regression, secret scan)
│
├── manifests/
│ ├── argocd/
│ │ ├── monaco-cmp/ # Monaco CMP sidecar (Monaco approach only)
│ │ │ ├── plugin.yaml # CMP v2 plugin — discover/init/generate hooks
│ │ │ ├── cmp-configmap.yaml
│ │ │ ├── repo-server-patch.yaml # Adds Monaco sidecar to argocd-repo-server
│ │ │ ├── external-secrets.yaml # ESO ExternalSecrets for DT credentials
│ │ │ ├── sync-hook.yaml # PostSync Job — actual Monaco deploy
│ │ │ └── kustomization.yaml
│ │ ├── applicationset-oac.yaml # Monaco: matrix(ADO repos × dev/staging/perf/prod)
│ │ └── applicationset-oac-operator.yaml # Operator: syncs CRD manifests per env dir
│ ├── kyverno/
│ │ └── enforce-oac-gitops.yaml # Blocks direct kubectl apply on OaC resources
│ └── drift-detector/ # Monaco approach drift detection
│ ├── cronjob.yaml # Runs every 6h, compares manifest hashes
│ └── rbac.yaml
│
├── operator/ # Custom Kubernetes operator (Go)
│ ├── api/v1alpha1/
│ │ ├── types.go # DynatraceSLO, DynatraceAlert, DynatraceDashboard,
│ │ │ # DynatraceNotification type definitions
│ │ ├── groupversion_info.go # Scheme registration
│ │ └── zz_generated.deepcopy.go # Generated DeepCopy methods
│ ├── controllers/
│ │ ├── dynatraceslo_controller.go # SLO reconciler (entity resolution, finalizer, drift)
│ │ ├── dynatracealert_controller.go # Alert reconciler (sloRef cross-reference)
│ │ ├── dynatracedashboard_controller.go # Dashboard reconciler (template engine, SLO watch)
│ │ └── dynatracenotification_controller.go
│ ├── internal/dynatrace/
│ │ ├── client.go # Typed DT REST API client
│ │ └── dashboard_templates.go # service-overview, slo-report, endpoint-detail
│ ├── config/
│ │ ├── crd/ # CRD YAML manifests — install with kubectl apply -k
│ │ │ ├── dynatraceslos.yaml
│ │ │ ├── dynatracealerts.yaml
│ │ │ ├── dynatracedashboards.yaml
│ │ │ ├── dynatracenotifications.yaml
│ │ │ └── kustomization.yaml
│ │ ├── rbac/ # ServiceAccount, ClusterRole, ClusterRoleBinding
│ │ └── manager/deployment.yaml # 2-replica HA deployment
│ ├── examples/
│ │ ├── payments-api.yaml # Full example: SLOs + alerts + dashboard + notification
│ │ └── dashboard-templates.yaml # All three dashboard templates side-by-side
│ ├── main.go # Manager setup, controller registration
│ └── go.mod
│
├── crossplane/ # Crossplane integration
│ ├── provider-terraform/
│ │ ├── provider.yaml # Installs provider-terraform from Upbound registry
│ │ └── workspace-platform.yaml # Workspace CR wrapping terraform/platform-resources/
│ ├── provider/
│ │ ├── PROVIDER_BUILD.md # Guide: build provider-dynatrace with upjet
│ │ └── provider-configs.yaml # ProviderConfig per environment (dev/staging/perf/prod)
│ ├── xrds/
│ │ └── service-observability-xrd.yaml # ServiceObservability XRD (team-facing API)
│ ├── compositions/
│ │ └── service-observability-composition.yaml # Expands claim → SLO + alerts + synthetic
│ └── claims/
│ └── payments-api-prod.yaml # Example ServiceObservabilityClaim
│
├── terraform/
│ ├── ado-variable-group/main.tf # ADO variable group + pipeline PAT
│ ├── dynatrace-tokens/main.tf # DT API tokens per env → Vault
│ └── platform-resources/
│ ├── main.tf # Provider config
│ ├── variables.tf # environments variable (dev/staging/perf/prod)
│ ├── alerting_variables.tf # notifications variable
│ ├── management_zones.tf # One MZ per environment (env:dev … env:prod)
│ ├── auto_tags.tf # Auto-tagging from Backstage k8s labels
│ ├── alerting_profiles.tf # One alerting profile per environment
│ ├── alerting_notifications.tf # Slack, MS Teams, PagerDuty, Splunk On-Call
│ ├── request_attributes.tf # Custom request attributes (HTTP headers + OTel)
│ ├── span_attributes.tf # OTel allow-list, masking, capture rules
│ ├── outputs.tf # MZ IDs, alerting profile IDs, notification IDs
│ └── terraform.tfvars.example
│
└── docs/ # Detailed implementation guides
├── README.md # Index + which approach to choose
├── operator/
│ ├── README.md # Architecture, reconciliation loop, HA
│ ├── getting-started.md # Install → first SLO → GitOps
│ ├── crds-reference.md # Full API reference for all 4 CRDs
│ ├── dashboard-templates.md # Template layouts, tile reference, adding templates
│ └── development.md # Build, test, extend the operator
└── crossplane/
├── README.md # Architecture, Path A vs B decision guide
├── getting-started.md # Install Crossplane → provider → first Claim
├── provider-terraform.md # Workspace anatomy, state, outputs, limitations
├── native-provider.md # upjet build guide, timeline, maintenance
└── compositions.md # XRD spec, Composition patches, extending
Applied once by the SRE team. Creates the shared Dynatrace infrastructure that all application-level configs (Monaco or operator) depend on.
One zone per environment: env:dev, env:staging, env:perf, env:prod.
Primary rule matches the environment:<label> auto-tag; namespace CONTAINS rules
provide coverage before auto-tags propagate.
| k8s label | Dynatrace tag | Backstage source |
|---|---|---|
app.kubernetes.io/name |
service:<name> |
metadata.name |
app.kubernetes.io/part-of |
system:<name> |
spec.system |
app.kubernetes.io/component |
component:<type> |
spec.type |
team |
team:<name> |
spec.owner |
environment |
environment:<env> |
deployment convention |
backstage.io/kubernetes-id |
backstage-id:<id> |
metadata.name |
domain |
domain:<name> |
metadata.labels.domain |
tier |
tier:<name> |
metadata.labels.tier |
| pod namespace (built-in) | k8s.namespace.name:<ns> |
namespace convention |
| Environment | Severities routed | Delay |
|---|---|---|
| dev | AVAILABILITY, ERROR, PERFORMANCE, CUSTOM | 0 min |
| staging | All above + MONITORING_UNAVAILABLE | 0 min |
| perf | All above + RESOURCE_CONTENTION | 0 min |
| prod | All severities | 0 min (AVAILABILITY/ERROR); 5 min (PERFORMANCE) |
| Channel | Dev | Staging | Perf | Prod |
|---|---|---|---|---|
| Slack | #alerts-dev |
#alerts-staging |
#alerts-perf |
#alerts-prod |
| MS Teams | Dev Alerts | Staging Alerts | Perf Alerts | Prod Alerts |
| PagerDuty | — | — | — | ✓ prod-p1 policy |
| Splunk On-Call | — | — | — | ✓ prod routing key |
| Attribute | Header → Span key | Purpose |
|---|---|---|
| Team | X-Backstage-Team → team |
Route alerts, filter dashboards |
| Service Name | X-Backstage-Service → service.name |
Service-level filtering |
| Environment | X-Backstage-Env → deployment.environment |
Cross-MZ querying |
| Domain | X-Backstage-Domain → domain |
Business domain grouping |
| System | X-Backstage-System → system |
Backstage System grouping |
| Correlation ID | X-Correlation-ID → correlation.id |
Distributed trace stitching |
| Tenant ID | X-Tenant-ID → tenant.id |
Multi-tenant SLO splitting |
| Feature Flag | X-Feature-Flag → feature.flag |
Incident ↔ flag correlation |
| HTTP Status Class | Derived from response code | Split error rate by 2xx/4xx/5xx |
26 OTel keys indexed via dynatrace_attribute_allow_list — queryable in DQL, Notebooks, and Davis AI.
tenant.id masked via dynatrace_attribute_masking.
Four span capture rules: always keep error spans + team-labelled spans; ignore health probes + Istio internal spans.
cd terraform/platform-resources
cp terraform.tfvars.example terraform.tfvars # fill in dt_url, dt_api_token, notifications block
terraform init && terraform plan && terraform apply
# Capture IDs needed by the app-layer configs
terraform output -json alerting_profile_ids
# → {"dev":"abc-123","staging":"def-456","perf":"ghi-789","prod":"jkl-000"}-
Add required k8s labels to your Deployment (from
scaffold/backstage/deployment-labels.yaml.j2):labels: app.kubernetes.io/name: payments-api app.kubernetes.io/part-of: checkout-platform backstage.io/kubernetes-id: payments-api environment: prod team: platform domain: checkout
-
Run the bootstrap pipeline in ADO (
dryRun: false,repoFilter: <repo-name>). This renders Monaco Jinja2 templates into anobservability/PR in the app repo. -
Review and merge the PR. The
oac-pr-validationpipeline gates:- YAML syntax lint
- Monaco static validation + staging dry-run
- DDU estimate (blocks > 5,000 DDU/month)
- Endpoint SLO generated files in sync
- SLO regression check (blocks target drop > 0.1%)
- Secret scan (blocks hardcoded DT tokens or tenant URLs)
-
Argo CD detects
observability/manifest.yamland deploys to dev → staging → perf (automated) → prod (manual approval).
Teams list critical endpoints in observability/slos/endpoints/critical-endpoints.yaml:
service: payments-api
endpoints:
- id: post-payments
method: POST
path: /api/v1/payments
latency_ms: 400
slo_target: 99.9
- id: get-payment-status
method: GET
path: /api/v1/payments/{id}/status
latency_ms: 200
slo_target: 99.95Run the generator and commit the output:
python scripts/generate-endpoint-slos.py \
--endpoints observability/slos/endpoints/critical-endpoints.yaml \
--env-file observability/environments/prod.yaml
git add observability/slos/endpoints/generated/
git commit -m "chore(oac): add endpoint SLOs for payments-api"The PR validation pipeline fails if critical-endpoints.yaml is edited but
generated/ is not regenerated.
Edit observability/environments/prod.yaml in the app repo:
my-service:
SLOTarget: "99.95"Open a PR. The SLO regression gate blocks any decrease > 0.1%. Merge → Argo CD PostSync Job applies the change to Dynatrace.
- Update the webhook / API key in Vault at
secret/dynatrace/notifications. terraform applyinterraform/platform-resources/.- No Monaco changes needed.
The operator delivers a clean, domain-specific Kubernetes API. No Monaco YAML, no CMP sidecar, no PostSync Job, no drift CronJob — the controller handles all of it.
# Install all four CRDs
kubectl apply -k operator/config/crd/
# Deploy the operator (2 replicas, leader election)
kubectl apply -f operator/config/rbac/serviceaccount.yaml
kubectl apply -f operator/config/rbac/role.yaml
kubectl apply -f operator/config/manager/deployment.yaml
# Verify
kubectl rollout status deployment dynatrace-operator -n sre-tools| CRD | Short name | What it creates in Dynatrace |
|---|---|---|
DynatraceSLO |
dtslo |
SLO v2 — availability or latency p99 |
DynatraceAlert |
dtalert |
Metric event anomaly detection rule |
DynatraceDashboard |
dtdash |
Dashboard from a named template |
DynatraceNotification |
dtnotif |
Notification integration |
Run the bootstrap pipeline with the observability-operator/ scaffold. The PR
in the app repo contains per-env CRD manifests instead of Monaco YAML:
observability/
├── dev/ slos.yaml + alerts.yaml
├── staging/ slos.yaml + alerts.yaml
├── perf/ slos.yaml
└── prod/ slos.yaml + alerts.yaml + dashboard.yaml
Argo CD syncs these as standard Kubernetes manifests (no Monaco plugin needed). The operator controller reconciles each CRD to Dynatrace every 5 minutes.
apiVersion: oac.YOUR_ORG.io/v1alpha1
kind: DynatraceSLO
metadata:
name: payments-api-availability
namespace: payments-api
spec:
environment: prod
serviceSelector:
backstageId: payments-api # controller resolves to DT SERVICE entity automatically
type: availability
target: 99.9
window: "-1w"kubectl get dtslo -A
# NAMESPACE NAME ENV TYPE TARGET CURRENT SYNCED
# payments-api payments-api-availability prod availability 99.9 99.94 TrueThree built-in templates — select via spec.template:
| Template | What it shows |
|---|---|
service-overview |
SLO tiles + request/error rates + latency percentiles + burn rate |
slo-report |
SLO compliance history + error budget remaining + 1h/6h burn rates |
endpoint-detail |
Per-endpoint throughput, error rate, p50/p99 latency, slowest table |
The controller resolves spec.sloRefs to real Dynatrace SLO IDs from
status.dynatraceId of referenced DynatraceSLO objects before building
the dashboard JSON.
See
docs/operator/for full architecture, CRD reference, and development guide.
Wraps terraform/platform-resources/ as a Workspace CR for continuous reconciliation:
kubectl apply -f crossplane/provider-terraform/provider.yaml
kubectl apply -f crossplane/provider-terraform/workspace-platform.yamlOutputs (MZ IDs, alerting profile IDs) written to dynatrace-platform-outputs Secret.
Auto-generates a Crossplane provider from the dynatrace-oss/dynatrace Terraform schema.
Once built, installs CRDs for every DT resource type and a dedicated controller per CRD.
apiVersion: oac.YOUR_ORG.io/v1alpha1
kind: ServiceObservabilityClaim
metadata:
name: payments-api-prod
namespace: payments-api
spec:
serviceName: payments-api
environment: prod
team: platform
slo:
availabilityTarget: 99.9
latencyTargetMs: 300
alerts:
errorRateThreshold: 0.5
burnRateFast: 14
burnRateSlow: 6
synthetic:
url: "https://payments-api.prod.internal/health"
frequencyMinutes: 1The Composition expands this into: availability SLO + latency SLO + error rate alert + fast burn alert + slow burn alert + synthetic monitor.
See
docs/crossplane/for full architecture, provider build guide, and XRD/Composition reference.
touch .no-oac && git add .no-oac && git commit -m "chore: opt out of OaC scaffold" && git pushBootstrap and propagation scripts skip repos with this file. Existing DT configs are not deleted.
Variable group oac-bootstrap-secrets — required PAT scopes:
| Scope | Reason |
|---|---|
Code (Read & Write) |
Push scaffold branches |
Pull Request (Read & Write) |
Open PRs |
Identity (Read) |
Resolve reviewer email → ADO identity |
cd terraform/ado-variable-group
terraform init && terraform apply \
-var="ado_org_service_url=https://dev.azure.com/YOUR_ORG" \
-var="ado_project=YOUR_PROJECT" \
-var="ado_pat=<admin-pat>" \
-var="pipeline_pat=<pipeline-pat>" \
-var="pr_reviewer_emails=alice@example.com,bob@example.com"| Symptom | Cause | Fix |
|---|---|---|
| Bootstrap skips all repos | observability/manifest.yaml already exists |
Normal on re-run. Use --repo-filter. |
Monaco dry-run fails HTTP 401 |
DT token expired or wrong scopes | terraform apply in dynatrace-tokens/. ESO refreshes within 1h. |
Argo CD OutOfSync — Monaco |
CMP sidecar init hook failed |
kubectl logs -n argocd deploy/argocd-repo-server -c monaco-cmp |
| Kyverno blocks ConfigMap | Direct kubectl apply on monaco-oac-state-* |
Only Argo CD sync creates these. Run argocd app sync <name>. |
| Drift detector pages every 6h | PostSync Job failing | kubectl logs -n sre-tools job/monaco-deploy-<app>-<env> |
dtslo SYNCED=False, reason=EntityResolution |
backstageId tag not found in DT |
Verify pod has label backstage.io/kubernetes-id and auto-tag has run |
dtslo SYNCED=False, reason=DynatraceAPI (401) |
DT token expired | terraform apply in dynatrace-tokens/ |
Dashboard stays SYNCED=False |
Referenced SLOs not yet synced | Controller retries automatically when SLO status updates |
Operator pod CrashLoopBackOff |
Can't read credential Secrets | kubectl get externalsecret -n sre-tools |
Crossplane Workspace READY=False |
Terraform init failed | kubectl describe workspace dynatrace-platform-resources -n sre-tools → Events |
| Endpoint SLO CI gate fails | generated/ not regenerated after editing critical-endpoints.yaml |
Run python scripts/generate-endpoint-slos.py and commit generated/ |
| Management zone empty | environment k8s label missing on pods |
kubectl get pods -n <ns> --show-labels |
| Request attributes empty in traces | Istio stripping X-Backstage-* headers |
Check EnvoyFilter; OTel span attributes are the fallback |
| Span attributes not visible in DQL | OTel key not in allow-list | Add to local.span_allow_list in span_attributes.tf and re-apply |
PLATFORM LAYER — terraform/platform-resources/ (applied once by SRE)
┌───────────────────────────────────────────────────────────────────────┐
│ Management zones (env:dev/staging/perf/prod) │
│ Auto-tags (Backstage k8s labels → DT contextless tags) │
│ Alerting profiles (one per env, scoped to MZ) │
│ Notifications (Slack / MS Teams / PagerDuty / Splunk On-Call) │
│ Request attributes (HTTP headers + OTel span fallback) │
│ Span attribute allow-list + masking + capture rules │
└───────────────────────────┬───────────────────────────────────────────┘
│ IDs referenced by app layer
┌──────────── ┼─────────────┐
│ │ │
┌─────────▼───────┐ ┌───▼────────┐ ┌──▼───────────────────┐
│ MONACO APPROACH │ │ OPERATOR │ │ CROSSPLANE APPROACH │
│ │ │ │ │ │
│ observability/ │ │ obs-oper/ │ │ ServiceObservability │
│ (Jinja2→Monaco) │ │ (CRD YML) │ │ Claim │
│ │ │ │ │ │ │ │ │
│ Argo CD CMP │ │ Argo CD │ │ Argo CD (std sync) │
│ sidecar │ │ std sync │ │ │ │
│ │ │ │ │ │ │ Composition engine │
│ Monaco deploy │ │ operator │ │ │ │
│ PostSync Job │ │ controller │ │ provider controller │
│ Drift CronJob │ │ every 5m │ │ every 10m │
└────────┬────────┘ └────┬───────┘ └────────┬─────────────┘
└───────────────┴──────────────────┘
│
Dynatrace REST API
SLOs / Alerts / Dashboards
Synthetic / Log metrics
catalog-info.yaml (Backstage)
↓ teams mirror as Kubernetes pod labels
app.kubernetes.io/name, team, environment, domain, backstage.io/kubernetes-id
↓ Dynatrace OneAgent reads pod labels automatically
↓ dynatrace_autotag_v2 translates labels to DT contextless tags
team:platform, environment:prod, service:payments-api, domain:checkout
↓ management zone SELECTOR rule matches environment:prod
env:prod MZ scopes SLOs, alerts, dashboards to prod traffic only
↓ alerting profile routes prod alerts → PagerDuty + #alerts-prod Slack
↓ request attributes enrich every trace with team/domain/tenant context
↓ span attribute allow-list makes OTel keys queryable in DQL / Davis AI
↓ operator backstageId resolution: GET /api/v2/entities?tag(backstage-id:payments-api)
→ resolves to specific SERVICE entity ID for per-service SLO scoping
Full implementation guides are in docs/:
- Operator overview — architecture, reconciliation loop, finalizers, HA
- Operator getting started — install CRDs, deploy, first SLO
- Operator CRD reference — full spec/status for all 4 CRDs
- Dashboard templates — tile layouts, adding new templates
- Operator development — build, test, extend
- Crossplane overview — Path A vs B, component map
- Crossplane getting started — install, first Claim
- provider-terraform guide — Workspace, state, outputs
- Native provider guide — upjet build, timeline
- XRD & Composition reference — full spec, patch mechanics