Peter Wang

Managing human workflows is more complex than writing code

When I began my on-call rotation last winter, I started looking for patterns by making a spreadsheet, posting stacktraces and grouping them manually. It worked; we quickly identified the alert types responsible for roughly 80% of the volume. But our vision is to build a truly AI-native company.

So I decided to automate on-call.

The core problem was that resolving issues was split between two unconnected workflows managed in two separate Slack channels. Customer complaints arrived through one channel, staffed by our Deployed Intelligence (DI) team, while automated system errors came through another. A user reporting stale General Ledger (GL) data might be caused by the same underlying bug showing up as an exception in the system channel, but we were only making that connection if the on-call engineer happened to notice both. Most of the time, the on-call would push an immediate fix, create a ticket for root-cause investigation, and move on.

Each of these tasks - maintaining attention across repetitive alerts and pattern-matching across disconnected channels - was a natural fit for LLMs. So I built Mantis, an orchestration layer extending from raw alert ingestion all the way to drafting customer communication (though a person approves anything that reaches customers).

The system we ended up building was a strong first step. Before Mantis, being on-call consumed at least 90% of the on-call engineer's time; with Mantis, it was reduced to about one-third, without any degradation in the overall level of customer support. But adoption of the system by our engineers was mixed. This after action report covers the technical decisions we made, as well as the challenges we encountered, as we made our first foray into automating on-call.

Building the Orchestration System

At Basis, bugs are investigated by an agent called Clueso. Clueso has access to our monorepo, BetterStack logs, and our unified MCP server. Clueso traces function interactions across the codebase, queries internal systems, and generates root-cause analyses. This saves about 20 minutes of manual debugging time per complex investigation.

We considered third-party options for the overall workflow but determined we would be uncomfortable plugging Clueso into an external product because that would have exposed deep access to our codebase, databases, and logs to a vendor.

So Mantis became an orchestration layer around Clueso and also a notification system that extends all the way to customer communication. It runs as a Python/FastAPI application with a daemon process that operates sequential processing loops.

Ingest. We consume alerts and store them in a Postgres database. Our core system dual-publishes alerts to Slack and the database; alerts raised by the DI team are consumed by a Slack webhook listener. Each alert carries structured metadata - prediction IDs, user and firm identifiers, and the originating product within the application - which gives downstream steps rich context to work with.
Cluster. The alerts are first normalized deterministically: we strip UUIDs, timestamps, and other high-cardinality fields so that only the error structure remains. This deterministic pass handles the easy cases: identical errors that differ only by which user or book they affect. We then feed the normalized alerts into Haiku (chosen for its speed) to identify semantic clusters that the regex pass would miss. Daily alert volume drops roughly 10x at the clustering step.
Investigate. Each cluster is sent to Clueso. Clueso returns a structured report with an overview of the issue, a timeline, code references, and recommended next steps. Clueso's biggest strength is its ability to traverse multiple systems (e.g. Modal, Prefect, etc.), which enables it to understand interactions and cross-check with the codebase to identify root causes.
Aggregate. Clueso's investigation reports go through another Mantis layer that groups investigations that share a root cause. This was implemented with a two-layer filter: first, Gemini Flash (chosen for its speed and 1-million-token context) performs a fast scan across all issues to find the most related issues. Then Sonnet (chosen for its reasoning capacity while avoiding the additional cost of Opus) conducts a more detailed analysis to determine if they truly share a root cause. This aggregation step condenses volume yet again: the number of underlying issues is generally about 10% of the number of clusters, or about 1% of raw alert volume.

The final result of these steps are reports containing all the necessary information and data for a human engineer to reproduce, fix, and verify the bug and its resolution.

The Hard Problem: Intelligence vs Determinism

The primary technical challenge raised by Mantis has been weighing the use of LLM intelligence versus deterministic pattern matching. Proper issue clustering ended up requiring both approaches.

We had alerts that were scoped to particular users where errors would be identical except for a different UUID. The smaller models we used for this task were surprisingly bad at grouping these together, even though a human would do so easily. The solution was simple: deterministic normalization first. Strip the variable fields with regex, then compare structures.

There were also cases where alerts with different structures should also be clustered together. For example, a reconciliation error in different months could have different alert structures, such as "reconciliation error" and "GL tie-out exception." These could be the same issue if our system ingested malformed GL data. We erred on the side of less de-duplication: running an investigation too many times was less costly than missing a truly different error.

Deployment and Initial Impact

We rolled Mantis out gradually, maintaining redundancy with our original workflow. We continued to publish alerts to our Slack channels while also feeding them to the Postgres database for processing by Mantis. We added more alert classes to this dual-publish pattern over time, increasing the volume of data Mantis processed.

To close the investigation loop, Mantis includes an issue-management routine that creates Linear tickets and assigns them to an engineer by parsing git history to determine original authorship. Optionally, it also sends the ticket directly to a coding agent for an automated fix.

On-call engineers generally appreciated Mantis. Rather than manually scan multiple Slack channels, they would monitor the unified Mantis dashboard, which bubbled new issues to the top. By the time a human saw an issue, most of the initial assessment - investigation, root cause analysis, customer impact analysis - had already been automated and completed.

Helping Our Teams Close the Loop with Customers

The impact of Mantis that most surprised us was what happened after triage. In most incident management systems, the pipeline ends with the creation of an engineering ticket. The questions, “Which customers were affected, and should we tell them?” are generally a separate, manual process. We automated part of it.

When an on-call engineer reviews an issue in the Mantis dashboard and determines that it impacts customers, they can trigger a specialized outreach investigation. A second Clueso agent queries internal systems and produces a structured result that includes a recommendation (reach out or just monitor), a customer-friendly summary of what happened, a list of impacted users grouped by firm, and a pre-drafted customer message.

Here's what an actual auto-generated outreach ticket looks like (lightly redacted):

System Issue: Customer Outreach Potentially Needed

Issue: Bank Statement Uploads Stuck in Syncing State - 12 books affected

Severity: 9

Recommendation: Outreach Recommended

What Happened: When users upload bank statements as PDF or spreadsheet files, some uploads get stuck showing a ‘syncing’ status indefinitely. This affects about 12 client books across 4 accounting firms.

Impacted User: [Name] - Blocked. Uploaded a bank statement then spent 37 minutes across two sessions repeatedly refreshing the page, clicking ‘Sync All’, re-uploading, and cancelling 3 times before logging out.

Suggested Customer Message: We've identified an issue where some bank statement uploads may appear stuck in a 'syncing' state. Our engineering team is actively working on a fix. In the meantime, please avoid re-uploading the same statements, as this could create duplicate transactions once the issue is resolved.

Internal Notes: Root cause: in [redacted].py, when extraction returns 0 transactions, the code returns early without creating a VendorRun. The status API checks for a VendorRun - no run means PENDING forever.

The system then automatically creates one Pylon support ticket per affected firm, using a database uniqueness constraint to prevent duplicate tickets. Each ticket lands in our DI team’s queue already matched to the correct customer account, complete with the draft message, a users-with-access table, and internal technical notes.

The on-call engineer's total involvement in this entire flow is about 30 seconds. The DI team member reviews the pre-drafted message, adjusts tone if needed, and sends it.

‍

Challenges

In addition to Mantis’s clear benefits, two challenges emerged relatively quickly.

Challenge #1: Confidently wrong investigations. Agent investigations could sometimes be wrong - convincingly, confidently wrong - and verifying investigations took almost the same amount of time as conducting an original inquiry. For senior engineers who couldn't be fooled as easily, Mantis was a huge help because they could validate, or invalidate, an investigation. But for more junior engineers, the agent's logic often seemed sufficiently plausible (without being fully concrete). For example, we noticed that Clueso would often cite “race conditions” as the reason for an error when in fact the underlying issue was something else entirely. When an erroneous report was validated by an engineer and acted upon, it created more work for the team.

As Andrej Karpathy observed, writing code (generation) and reading code (discrimination) are different capabilities. There is a corollary in investigating issues: conducting an investigation and reading an investigation are different skills. To deeply understand a system, an engineer must take time wrestling with nuances, going down wrong paths, and chasing down correlated logs. When AI does that wrestling for you, you lose the muscle memory and can't always tell when the AI got it wrong. Over time, this eroded trust in the system.

Challenge #2: Incomplete adoption. Mantis didn't have the benefit of full internal network effects because it was an experimental side project rather than an organizational refactor of our overall on-call structure. A lot of work continued to be done outside the Mantis framework and the Linear tickets it created, so when an engineer did see an issue on Mantis, it might not reflect operational reality within the team. At times an issue raised by Mantis had already been fixed by someone else without being marked.

This was not a technical deficiency. But from a human workflow perspective, the fact that Mantis was being used by only half the technical staff - the on-call people but not the day-to-day engineering teams - made the system far less effective. This led to an adoption-versus-iteration problem: the best way to iterate on Mantis and improve it would be through wider adoption, but the fact that it wasn't perfect undermined adoption.

Where Mantis Is Now

Mantis is still running. The core alert pipeline continues to process alerts and the customer outreach automation remains one of our most-used internal workflows. Our DI team relies on it as a primary channel for proactive customer communication. But the automated investigation and triage features have seen reduced adoption. On-call engineers increasingly use Mantis as a utility layer - looking up specific error codes from prediction pushes, checking alert lineage, viewing prediction payloads - rather than an overall orchestration interface.

What's Next

We're exploring three directions: proactive scanning that continuously monitors logs for anomalies that haven't triggered alerts yet; a resolution feedback loop that verifies whether deployed fixes actually resolve the underlying issue, using this signal to improve future investigations; and auto-fixing low-risk issues like config updates and typo fixes, where changes could merge automatically after passing tests without human review.

But the biggest remaining lever is organizational, not technical. The customer outreach pipeline succeeded because it delivered clear value to a specific team (DI) who adopted it as their own. The investigation pipeline stalled because it asked engineers to change how they work without making that change feel inevitable. If we could go back, we'd spend more time making Mantis the canonical place where operational state lives.

If you’re interested in working with AI-native engineers who experiment, learn, and iterate, we’re hiring.

Peter Wang is a Member of Technical Staff at Basis. Seth Schiesel contributed to this post.

LinkedIn X

We Automated On-Call. Building the System Was the Easy Part.

Building the Orchestration System

The Hard Problem: Intelligence vs Determinism

Deployment and Initial Impact

Helping Our Teams Close the Loop with Customers

Challenges

Where Mantis Is Now

What's Next

Recommended reading

How Accounting Landed Me a Sales Job

Seven Questions with Annette Garcia, Accounting Product Operations Lead

Your team needs a unified MCP. Here’s a recipe.

Put Basis to work.