AWS DevOps Agent × 5G Core — Cross-Layer Investigation Demo

A hands-on workshop that demonstrates AWS DevOps Agent investigating cross-layer incidents on Amazon EKS. Using a simulated 5G Core network as the application layer, four pre-built failure scenarios show the agent tracing from Kubernetes pod failures through AWS infrastructure changes to CloudTrail audit logs — identifying who broke what, when, and from where — in under 3 minutes per scenario.

The Problem: Incidents Don't Respect Team Boundaries

Modern cloud-native applications — especially telco workloads running on Kubernetes — fail in ways that span multiple operational layers. A single Security Group change can cascade from AWS networking → ElastiCache connectivity → Kubernetes pod health → application-level service discovery → subscriber-facing outage. Traditional monitoring tools show you what broke. Figuring out why it broke and who caused it requires an engineer to manually correlate pod events, application logs, CloudWatch metrics, VPC configuration, and CloudTrail audit trails — often across different consoles, different teams, and different areas of expertise.

For telco networks, this problem is amplified. The blast radius of infrastructure changes isn't measured in failed HTTP requests — it's measured in subscribers losing service. A Network Repository Function (NRF) losing its Redis backend means every Network Function in the 5G Core loses service discovery simultaneously. Millions of subscribers can't register, handover between cells, or establish data sessions — and the NOC team is hunting through logs trying to figure out which layer broke first.

The Solution: AWS DevOps Agent

AWS DevOps Agent is an AI-powered operations assistant that investigates incidents across your entire AWS environment. You describe a symptom in plain language — "The catalog service is returning errors" or "AMF pods are stuck in Pending" — and it autonomously traces the full causal chain from application to infrastructure to human action.

What it connects to:

Kubernetes API — pod status, deployment state, events (OOMKill, FailedScheduling, ImagePullBackOff), HPA configuration, node conditions
CloudWatch Logs — application logs, EKS control plane audit logs (who ran what kubectl commands)
CloudWatch Metrics — Container Insights (CPU, memory, network per pod/node), custom metrics, ALB metrics
CloudTrail — every AWS API call with principal, IP address, timestamp, and request parameters
Topology discovery — resource relationships (pod → node → ASG → EC2 instance → Security Group → ElastiCache)

How it investigates:

The agent reasons through the problem dynamically: checks pod status, reads logs for error patterns, follows dependency chains, inspects infrastructure configuration, and correlates with CloudTrail to identify the specific human action that caused the incident. It can also leverage saved skills and runbooks for domain-specific investigation patterns. It reports back with a complete timeline, root cause, and evidence.

What This Demo Does

This repository provides a lightweight set of 5G Core stub services on Amazon EKS — not a real 5G core, but Python microservices that speak correct 3GPP vocabulary and use real AWS dependencies (ElastiCache Redis, SQS). It includes four pre-built failure scenarios that showcase DevOps Agent's cross-layer investigation capabilities. Each scenario:

Injects a real infrastructure problem (one shell command)
Triggers a CloudWatch alarm (NOC-style alerting)
Prompts the DevOps Agent to investigate (paste a sentence)
Demonstrates the agent tracing from symptom to root cause across layers
Restores to healthy state (one shell command)

The goal is to prove that DevOps Agent can identify root causes that would normally require coordination across platform, networking, and application teams — in minutes instead of hours.

Architecture

A 5G Core is the brain of a mobile network, built as cloud-native microservices. Each service is a Network Function (NF):

NF	Role	What breaks when it's down
NRF	Service registry (backed by Redis)	All NFs lose service discovery → total core outage
AMF	Subscriber registration & mobility	Devices can't attach to network or handover between cells
SMF	Data session management	No new internet/data connections
UPF	User data plane (packet forwarding)	Active sessions lose connectivity
PCF	Policy & QoS decisions	No quality-of-service enforcement

The NFs are Python stubs that speak correct 3GPP vocabulary and use real AWS dependencies (ElastiCache Redis, SQS). They are not a production 5G core — but the infrastructure, failure modes, cascading dependencies, logs, metrics, and CloudTrail events are all real.

Demo Scenarios

Each scenario demonstrates a different cross-layer correlation — the agent starts at the symptom and works backward to the root cause:

#	Scenario	Symptom	Root Cause	What the Agent Finds	Time
1	Security Group Change	NRF can't reach Redis — all NFs lose service discovery, subscribers can't register	VPC Security Group rule removed	Traces NRF→Redis connection failure through SG rule removal to CloudTrail — identifies who revoked the rule, when, and from what IP	~2 min
2	ASG Capacity Ceiling	AMF pods stuck Pending during busy hour — new subscriber registrations queuing	Auto Scaling Group max capacity reached	Correlates FailedScheduling events with node CPU saturation and ASG at max — explains why Cluster Autoscaler can't provision nodes to handle busy hour traffic	~4 min
3	Bad Deployment	AMF pods in CrashLoopBackOff — active subscribers losing mobility management	`kubectl set image` with non-existent tag	Traces ImagePullBackOff to a bad image tag, finds the exact `kubectl set image` command in EKS audit logs — identifies the user, kubectl version, and source IP	~3 min
4	HPA Scaling Storm	AMF replicas thrashing 2→7→3→6 during busy hour — intermittent registration failures as connections churn	HPA misconfigured: 15% target, 0s stabilization	Explains the feedback loop: target too low + no stabilization → rapid scale up/down → NRF connection pool churn → subscribers see intermittent 5G registration failures	~8 min

Example: Scenario 1 Investigation

You inject the failure:

./scripts-5g/scenario-1-sg-change.sh inject

You give the agent a symptom (not a hint):

"The NRF service is returning errors. Investigate."

The agent autonomously:

Checks NRF pod status → Running (not crashed)
Reads NRF logs → ConnectionRefusedError to Redis endpoint
Verifies Redis is healthy → ElastiCache node is up
Inspects Security Group → port 6379 inbound rule is missing
Queries CloudTrail → finds RevokeSecurityGroupIngress API call
Reports: "User X removed the inbound rule allowing port 6379 from the pod CIDR at timestamp Y from IP Z"

Infrastructure Deployed

Everything is managed by Terraform and deployed via a single deploy.sh script:

VPC — 3 AZs, public/private subnets, single NAT gateway
EKS Cluster (v1.30) — system node group (2× t3.medium, tainted) + app node group (2× t3.medium, max 3)
ElastiCache Redis — NRF service registry backend
SQS Queue — async message processing
Container Insights (enhanced observability) — pod/node/cluster metrics
CloudWatch Alarms — one per scenario + baseline alarms (NOC-style alerting)
ALB Ingress Controller — internet-facing load balancer
Cluster Autoscaler with IRSA — responds to pod scheduling pressure
EKS Control Plane Logging — API, audit, authenticator, controller manager, scheduler

Cost: ~$5/hour while running. The EKS control plane + Redis cost ~$0.12/hour even with nodes at zero. Destroy with terraform destroy when done.

Prerequisites

Tool	Purpose
AWS CLI v2	Infrastructure provisioning, scenario scripts
Terraform ≥ 1.5	Infrastructure as code
kubectl	Kubernetes deployment and verification
Helm 3	Container Insights + ALB controller
jq	JSON parsing in scripts
AWS Account	With admin access (EKS, ElastiCache, VPC, IAM)

Run the automated check:

./prerequisites.sh

Quick Start

Option 1: Workshop Event (CloudFormation — zero local tooling required)

Use this path if you're attending a guided workshop with AWS-provided accounts, or if you want a one-click deployment without installing Terraform/kubectl locally. Everything runs on an EC2 workstation provisioned by the stack.

Cost: ~$6/hour (EKS + Redis + EC2 workstation). Destroy the stack when done.

Launch the stack in your AWS account:
- Open CloudFormation → Create Stack → Upload a template file
- Upload cfn/workshop-infra.yaml — or use the S3 URL provided by your workshop facilitator
- Region: us-east-1 (recommended)
- Parameters: leave all defaults → Next → Next → ✅ Acknowledge IAM capabilities → Create Stack
Wait ~20 minutes for CREATE_COMPLETE — the stack provisions VPC, EKS, Redis, deploys all 5G NFs, and only signals complete when the environment is healthy.
Open a terminal on the workstation — go to CloudFormation → Outputs tab, click the SSMSessionUrl link. This opens a browser-based shell on the EC2 instance (no SSH or key pair needed).
```
cd /opt/workshop/repo
cat /opt/workshop/status.txt   # should say ✅ READY
./verify.sh                    # confirm all green
```
Create DevOps Agent Space (AWS Console — ~5 min):
- DevOps Agent Console → Create Agent Space → name: 5g-core-demo, let it auto-create the IAM role
- EKS Console → Cluster (devops-agent-demo) → Access tab → Create access entry
  - Principal ARN: copy the agent role from Agent Space → Capabilities → Cloud
  - Access Policy: AmazonAIOpsAssistantPolicy — Scope: Cluster
- Verify — click Operator access in the left sidebar to open the chat interface, then ask: "List all pods in the demo-5g namespace"
- You should see NRF, AMF, SMF, UPF, PCF, ue-simulator pods listed

Run scenarios!

./scripts-5g/scenario-1-sg-change.sh inject
# → Open Agent Space → Operator access → paste the prompt:
#   "The NRF service is returning errors. Investigate."
# → Watch the cross-layer investigation (~2 min)
./scripts-5g/scenario-1-sg-change.sh restore

See docs/scenario-1.md through docs/scenario-4.md for all four scenarios.

Workshop Studio events: If your facilitator pre-provisioned the environment, skip steps 1–2. Start at step 3 — the SSM link is in your Event Dashboard.

Option 2: Self-Paced (Terraform — full control)

Run these commands from your local machine (macOS or Linux) with AWS CLI configured for an account with AdministratorAccess.

Cost: ~$5/hour while running. Destroy with terraform destroy when done.

# Clone
git clone https://github.com/aws-samples/sample-devops-agent-5g-core-workshop.git
cd sample-devops-agent-5g-core-workshop

# 1. Check prerequisites (tools + AWS credentials)
./prerequisites.sh

# 2. Deploy infrastructure (~15 min)
cd terraform/
cp terraform.tfvars.example terraform.tfvars   # edit region if needed
terraform init && terraform apply
cd ..

# 3. Deploy 5G Core application (~2 min)
#    This auto-configures kubectl and deploys all network functions
./deploy.sh

# 4. Verify everything is healthy
./verify.sh

5. Create DevOps Agent Space (AWS Console — ~5 min)

This is a one-time manual step:

AWS Console → DevOps Agent → Create Agent Space
- Name: 5g-core-demo — let it auto-create the IAM role
EKS Console → Cluster (devops-agent-demo) → Access tab → Create access entry
- IAM Principal ARN: copy the role from Agent Space → Capabilities → Cloud
- Access Policy: AmazonAIOpsAssistantPolicy — Scope: Cluster
Verify — click Operator access in the left sidebar to open the chat interface, then ask: "List all pods in the demo-5g namespace"
- You should see NRF, AMF, SMF, UPF, PCF, ue-simulator pods

(Optional) Add the agent role to Terraform for alarm permissions:

# Edit terraform/terraform.tfvars — set devops_agent_role_arn to the role ARN from step 1
cd terraform/ && terraform apply && cd ..

6. Run scenarios!

# Inject a failure
./scripts-5g/scenario-1-sg-change.sh inject

# → Open Agent Space → Operator access → paste the prompt:
#   "The NRF service is returning errors. Investigate."
# → Watch the cross-layer investigation (~2 min)

# Restore when done
./scripts-5g/scenario-1-sg-change.sh restore

See docs/scenario-1.md through docs/scenario-4.md for all four scenarios with expected investigation paths.

Cleanup

kubectl delete namespace demo-5g
cd terraform/ && terraform destroy

Repository Structure

├── cfn/                    CloudFormation template (Workshop Event path)
│   ├── workshop-infra.yaml One-click stack: EC2 workstation + auto-provisioning
│   └── bootstrap.sh       Standalone bootstrap (for manual EC2/Cloud9 use)
├── terraform/              Infrastructure (VPC, EKS, Redis, SQS, IAM, Alarms)
│   ├── main.tf            Provider + VPC
│   ├── eks.tf             Cluster, node groups, addons, IRSA
│   ├── elasticache.tf     Redis cluster
│   ├── alarms.tf          CloudWatch alarms (per-scenario + baseline)
│   └── outputs.tf         Cluster name, Redis endpoint, SQS URL
├── k8s-5g/                 5G Core Kubernetes manifests
│   ├── namespace.yaml
│   ├── nrf.yaml           Network Repository Function (+ Redis connection)
│   ├── amf.yaml           Access & Mobility Management (+ HPA)
│   ├── smf.yaml           Session Management Function
│   ├── upf.yaml           User Plane Function
│   ├── pcf.yaml           Policy Control Function
│   └── ue-simulator.yaml  Load generator (simulates subscriber traffic)
├── scripts-5g/             Scenario inject/restore scripts
│   ├── scenario-1-sg-change.sh
│   ├── scenario-2-asg-ceiling.sh
│   ├── scenario-3-bad-deploy.sh
│   └── scenario-4-scaling-storm.sh
├── docs/                   Full workshop documentation
│   ├── introduction.md    5G primer, DevOps Agent overview, telco value prop
│   ├── workshop-guide.md  Step-by-step setup instructions
│   ├── scenario-1.md      SG change walkthrough
│   ├── scenario-2.md      ASG ceiling walkthrough
│   ├── scenario-3.md      Bad deployment walkthrough
│   ├── scenario-4.md      Scaling storm walkthrough
│   └── images/            Screenshots of agent investigations
├── deploy.sh               Post-Terraform K8s deployment (Helm + manifests)
├── verify.sh               Health check (pods, connectivity, agent access)
├── prerequisites.sh        Tool + credential checker
└── README.md               This file

Documentation

Document	What's inside
Introduction	5G Core concepts, DevOps Agent capabilities, telco value proposition
Workshop Guide	Full setup walkthrough with screenshots (Terraform → EKS → Agent Space)
Scenario 1 — SG Change	Security Group investigation: inject, prompt, expected path, restore
Scenario 2 — ASG Ceiling	Compute capacity investigation
Scenario 3 — Bad Deploy	CI/CD audit trail investigation
Scenario 4 — Scaling Storm	HPA feedback loop investigation

Why 5G?

The 5G Network Functions use proper 3GPP vocabulary (SUPI, PDU sessions, S-NSSAI, DNN, 5QI, Nnrf reference points) so the demo resonates with telco engineers and NOC teams. But the underlying failure modes are universal EKS patterns — Security Group misconfigurations, ASG scaling limits, bad deployments, and HPA tuning issues happen in every Kubernetes environment. The same scenarios apply to any microservices architecture running on EKS.

The telco framing amplifies the business impact narrative: "pod restart" becomes "2 million subscribers can't register," which makes the value of fast root-cause identification viscerally clear.

Cleanup

# Remove application
kubectl delete namespace demo-5g

# Destroy infrastructure
cd terraform/
terraform destroy

Contributing

This is a demo/workshop repository. If you're adapting it for a different vertical (e-commerce, fintech, gaming), the pattern is:

Replace k8s-5g/ manifests with your domain's microservices
Keep the same Terraform infrastructure (it's generic EKS + Redis)
Rewrite scenario scripts to target your app's failure points
Update docs with your domain's vocabulary and impact language

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS DevOps Agent × 5G Core — Cross-Layer Investigation Demo

The Problem: Incidents Don't Respect Team Boundaries

The Solution: AWS DevOps Agent

What This Demo Does

Architecture

Demo Scenarios

Example: Scenario 1 Investigation

Infrastructure Deployed

Prerequisites

Quick Start

Option 1: Workshop Event (CloudFormation — zero local tooling required)

Option 2: Self-Paced (Terraform — full control)

5. Create DevOps Agent Space (AWS Console — ~5 min)

6. Run scenarios!

Cleanup

Repository Structure

Documentation

Why 5G?

Cleanup

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
cfn		cfn
docs		docs
k8s-5g		k8s-5g
scripts-5g		scripts-5g
terraform		terraform
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
deploy.sh		deploy.sh
prerequisites.sh		prerequisites.sh
verify.sh		verify.sh

Folders and files

Latest commit

History

Repository files navigation

AWS DevOps Agent × 5G Core — Cross-Layer Investigation Demo

The Problem: Incidents Don't Respect Team Boundaries

The Solution: AWS DevOps Agent

What This Demo Does

Architecture

Demo Scenarios

Example: Scenario 1 Investigation

Infrastructure Deployed

Prerequisites

Quick Start

Option 1: Workshop Event (CloudFormation — zero local tooling required)

Option 2: Self-Paced (Terraform — full control)

5. Create DevOps Agent Space (AWS Console — ~5 min)

6. Run scenarios!

Cleanup

Repository Structure

Documentation

Why 5G?

Cleanup

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages