We run AI agents inside BauGPT. Here's what it taught us about building them.

We build AI for the construction industry. We also run AI agents inside our own company to handle scheduling, ticket routing, code review, and content ops.

That puts us in a weird position. We're the builder and the customer at the same time. When an agent screws up internally, I don't read about it in a support ticket. I'm dealing with the fallout.

That's been more valuable than any benchmark.

Here's what broke, what we learned, and what we changed.

What we actually run

Not in the abstract. The specific list:

A scheduling agent reads our engineering calendar and Slack, then flags conflicts and missed standups. It's been running for about eight months.

A ticket routing agent takes incoming work requests and classifies them by project and urgency. We rebuilt it twice.

A code review agent does a first pass on pull requests. A human still approves the merge, but the agent does the initial read and leaves comments.

A content ops agent monitors our LinkedIn queue, tracks approved post outlines, and drafts content based on research briefs. This one needs the most hands-on supervision.

Between these four, we're running agent-assisted ops pretty much every day. Some of it works smoothly. Some of it has failed in ways I didn't see coming.

Where they fail

The failure I expected going in: hallucinations. The agent gives confidently wrong information. A summary that gets a detail backwards. A routing decision that makes no sense.

Those happen. They're annoying. They're not the failure that actually hurt us.

The failure that surprised me: trust collapse.

Here's the specific case. Our scheduling agent had been running well for a few months. It was catching conflicts. It was right probably 85% of the time. Good, not perfect. Useful.

Then it missed two back-to-back conflicts in a single week. Both caused real problems: one was a missed standup that blocked a release, one was a double-booked external call. Not catastrophic, but genuinely costly.

After that week, three people on the team effectively stopped using the agent's output. Not because it stopped working. Its accuracy didn't change. But the two misses were the kind that sting, and the informal conclusion was "we can't rely on this for scheduling."

The agent didn't break. The trust did. And once trust breaks, you're running no agent at all.

I keep coming back to that failure pattern because it's so different from how we usually talk about AI reliability. The conversation is almost always about accuracy rates. 80%, 90%, 95%. But a scheduling agent that fails during high-stakes moments is worse than one that fails during low-stakes ones, even if the raw accuracy number is identical. The type of error matters more than the count.

A 2025 survey of 306 AI agent practitioners found that reliability is the top barrier to enterprise AI adoption. Not cost. Not setup complexity. Reliability. RAND's 2026 meta-analysis of 65 enterprise AI initiatives put the overall failure rate at 80.3%. The breakdown is more interesting than the headline: a third of those projects were abandoned before they ever reached production. Another 28% made it to production but couldn't earn real usage.

We almost became that second category with our scheduling agent. It was in production. It wasn't earning usage.

What those failures revealed about our users

We build BauGPT for Bauleiter and project managers. People running construction sites, coordinating 30 or 50 subcontractors, managing material deliveries against tight timelines. Wrong information in that environment doesn't just waste an hour. It can blow a concrete pour window or trigger delay penalties.

Living with our own agents taught us something we'd technically understood but hadn't felt: in high-stakes environments, users need to know the failure mode before they'll trust the tool. Not after the first failure. Before it.

Only 27% of AEC professionals currently use AI in their operations, per a 2026 ASCE survey. When asked why adoption is low, 57% of that group cited reliability and accuracy concerns. I think that number is actually low. The ones who didn't cite it probably haven't tried AI in a genuinely high-stakes scenario yet.

Construction has been here before. BIM software was supposed to automate coordination. It added a new category of coordination instead. Estimating tools that saved three hours in the demo added two back in corrections. Scheduling software that required constant manual overrides. There's a deep institutional skepticism in the industry, and it's completely rational. These workers have been let down by overpromised tech for 20 years.

Our internal scheduling failure reminded me of this: the construction workers using BauGPT aren't wrong to approach AI with skepticism. They've calibrated their trust based on what they've actually seen. We need to earn it, not assume it.

The one design principle we changed

This is the part that directly changed how we build BauGPT features.

We stopped optimizing for coverage and started optimizing for confidence.

Coverage means the agent attempts to answer 100% of queries. It handles every case. No edge case left behind.

Confidence means when the agent gives an answer, the user can trust it's right. The agent declines to answer when it doesn't know.

Old approach: build a feature that answers every question about project schedules.

New approach: build a feature that answers only when it has high confidence, and says "I'm not sure, you should check with the site manager" when it doesn't.

The "I don't know" answer sounds like a product shortcoming. It isn't. A confident wrong answer is a product failure. An honest "I can't reliably answer that" keeps the trust relationship alive.

We rebuilt our ticket routing agent on this principle. It now routes only the tickets it's confident about and flags everything else for human triage. It routes fewer tickets today than before the rebuild. Usage went up. People trust the ones it does route because they're almost always correct.

When you optimize for coverage, you're optimizing for the demo. When you optimize for confidence, you're optimizing for adoption after the demo.

This is probably part of why 88% of AI agents reportedly never reach production. Coverage is easy to show in a controlled environment. Confidence only shows up under real conditions. Most teams are measuring the wrong thing.

The fastest way to understand what your users need

I didn't write this to sell BauGPT. I wrote it because I think the "build AI products while using AI internally" pattern is an actual competitive advantage, and not enough teams are treating it that way.

When our scheduling agent failed, I felt the same frustration a Bauleiter feels when software lets them down. That's not something you can get from user interviews. You have to live it.

The lesson isn't "AI agents fail and that's a problem to solve." It's more specific: trust in AI agents is fragile in exactly the ways that matter most to users with high-stakes jobs. An agent can have great accuracy metrics and still lose adoption if it fails in the wrong moment. High-stakes users need the agent to be honest about what it doesn't know, not optimized to always have an answer.

We changed BauGPT because of what we noticed inside our own ops. Only 16% of contractors currently use AI for scheduling, with 60% reporting no plans to adopt it. I don't think the problem is awareness or cost. The tools haven't earned trust yet. Ours hadn't either, until we figured out that confidence beats coverage.

The fastest way to understand what your users need from AI is to use it yourself and notice what you stop trusting.

We stopped trusting our scheduling agent. We rebuilt it to be more honest about what it didn't know. That's the same thing we're doing in BauGPT.

Trust is the product. Everything else is implementation.

We run AI agents inside BauGPT. Here's what it taught us about building them.

What we actually run

Where they fail

What those failures revealed about our users

The one design principle we changed

The fastest way to understand what your users need

Keep reading

Our enterprise onboarding takes 90 minutes. The procurement took 11 weeks.

We process 40,000 WhatsApp messages a week. Here's why we built there.

AWS Amplify PR Preview Links in 15 Minutes

What we actually run

Where they fail

What those failures revealed about our users

The one design principle we changed

The fastest way to understand what your users need

Keep reading

Our enterprise onboarding takes 90 minutes. The procurement took 11 weeks.

We process 40,000 WhatsApp messages a week. Here's why we built there.

AWS Amplify PR Preview Links in 15 Minutes

One note a week.No fluff, just what works.

One note a week.
No fluff, just what works.