|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# Find the Root Cause of a Failing Service |
| 6 | + |
| 7 | +Checkout is failing. The alerts are loud: HTTP 500s climbing on `PlaceOrder`. But the checkout pods are green: running, no restarts, nothing in the events. The symptom is in one service; the cause is somewhere else. |
| 8 | + |
| 9 | +This is the investigation NuBi is built for. Below is a single conversation that goes from "checkout is broken" to the exact line of code responsible, and the fix. You ask in plain language and drill down on what comes back. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Start with the symptom |
| 14 | + |
| 15 | +Ask the question you'd ask a senior engineer. |
| 16 | + |
| 17 | +> **You ask NuBi** |
| 18 | +> |
| 19 | +> Checkout is throwing errors in the demo namespace. What's the root cause? |
| 20 | +
|
| 21 | +NuBi checks checkout, then walks the [Knowledge Graph](../../knowledge-graph.md) downstream to follow the failure to its source. It answers with a 5-Whys chain: |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +Checkout is healthy. The problem is downstream: `product-catalog` is returning gRPC code 13 (INTERNAL) because a feature flag named **"Product Catalog Fail"** is switched on. |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## Confirm the dependency |
| 30 | + |
| 31 | +You want it explicit: which service, and how the failure travels. |
| 32 | + |
| 33 | +> **You ask NuBi** |
| 34 | +> |
| 35 | +> Which service is actually causing it, and how are they connected? |
| 36 | +
|
| 37 | + |
| 38 | + |
| 39 | +Now the relationship is explicit: `checkout` calls `product-catalog` during `PlaceOrder` to fetch product details, so a failed lookup breaks the order. NuBi traced that through the dependency graph. |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## Demand evidence |
| 44 | + |
| 45 | +Conclusions are cheap. Ask for proof. |
| 46 | + |
| 47 | +> **You ask NuBi** |
| 48 | +> |
| 49 | +> Show me the exact error and a failing trace. |
| 50 | +
|
| 51 | +NuBi pulls the literal error strings and a real failing request, with the endpoint, status, duration, and trace ID: |
| 52 | + |
| 53 | + |
| 54 | + |
| 55 | +`failed to get product "OLJCESPC7Z"`. `13 INTERNAL: Product Catalog Fail Feature Flag Enabled`. A `PlaceOrder` trace returning 500. No hand-waving. The cause is on the record. |
| 56 | + |
| 57 | +--- |
| 58 | + |
| 59 | +## Go to the code |
| 60 | + |
| 61 | +The runtime cause is settled. Now find the line. |
| 62 | + |
| 63 | +:::info Needs a source-control integration |
| 64 | +This step uses NuBi's Code Agent, which reads the service's repository. Connect it as a [source-control integration](../../../integrations/Code%20Repository/GitHub/github-integration.md) first. Without it, you still get everything above. |
| 65 | +::: |
| 66 | + |
| 67 | +> **You ask NuBi** |
| 68 | +> |
| 69 | +> Look at the product-catalog source and tell me which code path returns the error. |
| 70 | +
|
| 71 | + |
| 72 | + |
| 73 | +There it is: `checkProductFailure` returns an error when the requested product is `OLJCESPC7Z` and the flag is on. NuBi went from a customer-facing 500 to a named function in a cited file. |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Close it out |
| 78 | + |
| 79 | +> **You ask NuBi** |
| 80 | +> |
| 81 | +> How do I fix it? |
| 82 | +
|
| 83 | +NuBi doesn't just name the fix. It hands you the exact steps: which ConfigMap holds the flag, the command to edit it, the field to flip, and what happens after you save. |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +That's the whole fix. If you have a [ticketing](../../tickets.md) or source-control integration connected, follow with **"open a ticket with the evidence"** or **"open a pull request to fix it"** and NuBi carries the investigation into a tracked action. |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +## Tips for your own investigations |
| 92 | + |
| 93 | +- **Lead with the symptom, not your theory.** Tell NuBi what's failing and let it find the cause. "Checkout is erroring" beats "is product-catalog down?", because you don't yet know it's product-catalog. |
| 94 | +- **Drill down, don't restart.** NuBi keeps the thread. "Which service?" → "show me a trace" → "now the code" gets further than three cold questions. |
| 95 | +- **Always ask for the trace.** A failing trace ID turns a plausible answer into a provable one. |
| 96 | +- **Name the scope.** A namespace or service in the question makes the answer faster and sharper. |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +That whole chain ran in one conversation across several agents: triage, knowledge-graph traversal, a failing trace, the offending code. Along the way NuBi recovered from a failed log fetch on its own and compressed its own working context to stay focused. You stitched nothing together; you just asked. |
0 commit comments