I deployed 40+ TestFlight builds in a single week while I was at a team offsite in Munich. I barely touched my laptop.
That's not an exaggeration — I was in meetings most of the day. The builds shipped anyway.
Here's the system behind it, including the two times it broke catastrophically, and what I learned.
The Problem With Human-in-the-Loop Development
When I started BauGPT, I was the bottleneck for everything.
A bug fix would sit in my head until I had two free hours to actually write and test it. Feature ideas piled up in Notion. My phone would buzz with user complaints at 11pm and I'd either ignore it (bad) or try to fix it at midnight (worse).
The traditional developer workflow:
- Identify task
- Open laptop
- Write code
- Test manually
- Build
- Deploy
At every step, a human is required. Which means nothing happens while you're sleeping, traveling, or living your life.
The Shift: AI Agents That Actually Ship
About four months ago, I started treating my AI assistant (James) not as a code assistant but as a full engineer. Not "help me write this function" — more like "here's the task, figure it out and deploy when done."
The difference sounds subtle. It isn't.
When you give Claude Code a task and say "push to TestFlight when it works," something interesting happens: it does pre-flight checks, writes the code, runs local tests, fixes errors, and ships — without waiting for you.
The key ingredients:
1. Trusted build infrastructure
Everything runs through Fastlane. The beta_quick lane builds, signs, and uploads in about 4 minutes. No manual steps. If the deploy step needs a human, the whole system breaks.
2. A real task tracking system
I use a custom TKT system (PostgreSQL, exposed as an API). Every task has a status, an owner, and a next step. When the agent finishes, it updates the task to resolved. I can look at the board and see exactly what shipped without reading a single log.
3. Explicit definition of done This took me a while to get right. Vague tasks ("improve the chat UI") produce vague results. Now every task spec includes specific behavior to change, acceptance criteria, and edge cases to handle.
4. A Mattermost channel per ticket Every non-trivial task gets its own channel. The agent posts updates there. I can ignore it if it's going well, or read the thread if something smells off.
What a Real Deployment Looks Like
I'll walk through a recent one: adding skeleton loaders with shimmer animation to the BauGPT mobile app.
Task description:
Add shimmer animation to skeleton loaders in empty state.
1.2s smooth animation loop (0.3 to 0.7 opacity).
Stops when real content loads. Use darker textMuted color.
Acceptance: app visible in simulator, no crashes, shimmer obvious on Conversations screen.
The agent found the existing skeleton component, added Animated.Value with interpolation, used useEffect to start the loop on mount, tested in the Expo dev server, ran beta_quick to build and upload, and posted to Mattermost: "Build #24 live on TestFlight. Shimmer animation pulsing at 1.2s. Stops on content load."
Start to TestFlight: 11 minutes. I was in the shower.
The Failures (Important)
Two things broke badly.
Failure 1: The render rules crash
Build #38 deployed an app that crashed on launch. The agent added custom render rules to the markdown component, which replaced ALL default render rules — not just the table ones. Text, headings, lists had nothing to render with. Crash on first message.
What I learned: the pre-flight check was "does it compile?" — it didn't catch this logic error. Now every task that touches core render or data pipelines requires a smoke test in the simulator before uploading. The Fastlane verify_launch lane catches this now.
Failure 2: The nested scroll conflict
We tried adding horizontal scroll to markdown tables inside the chat feed (vertical FlatList). The gesture conflict took 5 builds to resolve. The agent kept "fixing" it with approaches that compiled but failed at runtime in ways you only see with actual touch input.
What I learned: some bugs require human judgment. I now include in task specs: "If this touches gesture handling or nested scrolls, do a manual test and describe what you see before deploying."
The Numbers After 4 Months
Before the system: ~2-3 deploys per week, all manual, mostly nights and weekends.
After: 10-20 deploys per week. Most happen during business hours or while I'm asleep. Build velocity roughly 5x. Bug resolution time dropped from "eventually" to same-day for anything clearly scoped.
More importantly: I stopped being the bottleneck. Ideas go in, working builds come out. The gap is hours instead of weeks.
What This Isn't
This isn't "AI replaces engineers." The hard parts — figuring out what to build, why it matters, what users actually need — still require a human.
What's automated is the mechanical part: writing the implementation once you've decided what to build, running tests, deploying, updating the task board, posting the update.
That's probably 40% of my old time. Now I handle it by writing a clear spec instead.
The Setup
The components:
- Claude Code as the primary coding agent (better than ChatGPT at autonomous decisions without asking for permission every step)
- Fastlane for iOS builds (
beta_quicklane) - TKT for task tracking (PostgreSQL + REST API — any API-accessible system works)
- Mattermost for notifications (one channel per ticket)
- OpenClaw as the orchestration layer that manages sessions and schedules cron jobs
Total setup time to get this working: about 2 weeks of iteration. Mostly on the task spec format and the definition-of-done problem.
The payoff: I shipped a full mobile app with auth, AI chat, prompt library, skeleton loaders, gesture fixes, App Store submission, and TestFlight builds — largely while traveling.
Building BauGPT — AI for construction in Germany. If you're doing something similar with autonomous AI agents, let's talk.