How to build a debugging agent that works for your whole support team
How to build a debugging agent your whole team can rely on, on every shift: it answers from curated error patterns first, falls back to your docs and read-only source, and keeps the troubleshooting docs current on its own.
Investigating a production issue tends to slow down at the same two points, and I think it is the same at any growing startup.
The first is the troubleshooting documentation, or the lack of it. The fix for a recurring problem usually lives in some engineer's head, so the first step is always finding the go-to person for it. Teams do try to keep support docs current and every investigation ends with a note to update the doc. But generally, the team gets busy with something else and that note rarely gets done, at least not regularly enough to make a difference.
The second is dev team escalation. Some problems are completely new, the kind no support engineer has seen before, and they have to go to the dev team. An issue reported at midnight has to wait until morning. If reported during the day, a developer has to context-switch into an incident investigation, which eats into their own sprint deliverables.
We had both of these at XPR. To be fair to the team, we iterate quickly and every three-week sprint enhances existing functionality and adds a lot of new features and integrations, which makes the documentation hard to keep current and sometimes introduces a new issue.
So, two problems:
- Knowledge sitting with individuals, not the team.
- Needing developer involvement for anything that requires reading code.
Solution
The fix was to stop relying on people for this and let an AI agent do the initial investigation and documentation.
We built a debugging agent that the support team pings in Slack whenever they are working on a client issue. It is always on, and every person on every shift gets the same answer from it. The knowledge lives in the agent and its documents now, not in whoever happens to be awake.
It also keeps the documentation up to date on its own. When it works out a new problem, it writes the answer back into the troubleshooting docs so the next person does not have to. And when an issue turns out to be a real bug in our product rather than a known fix, it opens a Jira ticket for the development team.
How it works
A request usually arrives in a support ticket as a screenshot or an error message from a live client location. The support person tags Shikau (our AI agent) on Slack and uploads the screenshots, logs, etc., or refers to the ticket.
For each product, the agent has access to an architecture reference doc covering the architecture, tech stack, workflow, and key source files, so it has context before it begins an investigation. With that in hand, it works through three layers in order, cheapest first.
- Error patterns. It first checks a store of common error patterns, each mapped to what the error is and the steps to resolve it. Most incoming issues are repeats, so this is where most of them end. The agent recognizes the pattern and replies with the fix.
- Troubleshooting docs. If the error is not a known pattern, it reads the troubleshooting articles in detail. These cover the cases that are documented but not yet reduced to a clean pattern.
- Read-only source. If neither layer explains the error, the agent reads the product source, which it has in read-only mode, and works out what could have caused it. This is the slowest and most expensive path.
When the agent resolves something from the source that was not documented, it updates the error patterns and troubleshooting docs so the same error does not send it back into the code next time. If the problem is a bug in our application and needs a code change, the agent opens a Jira ticket for the development team with the context it gathered.
The architecture and setup
The agent is centrally hosted, the same as the rest of our automations. It runs on its own EC2 instance in our AWS account. I cover the hosting in the overview post and go deeper in an upcoming architecture post.
A few things define what it can reach.
It maintains two documents of its own: the common error patterns and the more detailed troubleshooting articles. These are the first and second things it consults, ahead of any source code. They are a curated layer in front of the model. The more the agent answers from them, the less it reasons from scratch. Along with these, it has the per-product architecture reference doc mentioned earlier, which we maintain and it only reads.
The product source is available to it in read-only mode, and only as a last resort. It can read any file in the repository when it needs to trace an error to a cause, but it cannot push a commit or open a pull request. A code change is always routed through Jira by opening a ticket.
For cloud logs, the CloudWatch access is scoped with a specific IAM policy. The agent can read the log streams it needs and nothing else. Running on our own instance ensures we can set and audit the boundaries, which is why we host these agents this way rather than using a hosted solution.
Key highlights
Here is a summary of the key choices:
Patterns first, source last, to control cost. Reading a large log file against a large codebase is expensive. A single log can carry the same error repeated ninety times, and re-deriving the fix from source on every one of them would burn tokens for no reason. Checking the documented patterns first means the common case is answered from a cheap lookup, and the model only does the expensive work on genuinely new errors.
The docs maintain themselves. The agent updates the troubleshooting articles as it resolves new issues, so the documentation actually stays current. Because every support person reads the same documents through the same agent, that knowledge is identical on every shift. An issue worked out on the night shift is already documented when it recurs on a different shift the next day. There is no go-to person to wait for.
Real bugs become Jira tickets. When something is an actual defect, the agent files it for the development team with the log and context attached. The dev team gets less noise, because the agent has already filtered out everything that was a known fix, and the issues that do reach them are logged properly instead of relayed through a Slack message that may get lost.