Everybody's building with AI right now. OpenAI has a free tier. Anthropic gives you credits. Google hands out Gemini API access like candy at a conference.
So you build your MVP on free credits. Ship it. Get users. Life is good.
Then your free tier runs out and you discover what AI actually costs at scale. That's where it gets interesting.
Our First API Bill
When BauGPT hit about 500 daily active users, our OpenAI bill was around €2,800/month. For context, that was a single API. Just chat completions. No embeddings, no fine-tuning, no image generation.
The math is simple and brutal. Each construction worker query hits our RAG pipeline, which means:
- Embedding the query (~0.1 cent)
- Vector search (free, self-hosted)
- LLM completion with retrieved context (~3-8 cents depending on context length)
At 500 users doing maybe 10 queries a day, you're looking at 5,000 completions. Average cost per completion around 5 cents. That's €250/day or €7,500/month if everyone used the expensive model.
We got it down to €2,800 through model routing. But getting there took a month of work.
Rate Limits Will Ruin Your Launch Day
Here's something nobody warns you about: rate limits on "free" and even paid tiers are aggressive.
OpenAI's rate limits are per-minute and per-day. When we launched a new feature and got a spike of 200 concurrent users, we hit the TPM (tokens per minute) ceiling within 20 minutes. Requests started failing. Users got error messages. On launch day.
The fix? You either:
- Pay significantly more for higher rate limit tiers
- Build a queue system (we use Bull + Redis) that throttles requests
- Add a fallback model from a different provider
We ended up doing all three. The queue system alone took two days to build properly. That's two days of engineering time on infrastructure that has nothing to do with your actual product.
And it gets worse during peak hours. German construction workers start their day at 6 AM. Between 6 and 8 AM, our traffic spikes 4x. Rate limits don't care about your users' schedules.
Quality Degradation Is Real
This one's subtle and dangerous.
GPT-4 in January 2025 was not the same model as GPT-4 in June 2025. Same API endpoint, same model name, different behavior. Our construction-specific prompts that worked perfectly started producing vague answers. Nothing broke. Nothing threw errors. The answers just got worse.
We caught it because we run weekly eval sets — 50 construction questions with known correct answers. Accuracy dropped from 94% to 87% over two months. No code changes on our end.
This is the hidden cost nobody talks about. You're building on someone else's foundation, and they can renovate it without telling you. You need monitoring, evaluation pipelines, and the ability to switch models fast.
What Vendor Lock-In Actually Looks Like
"Just switch to Anthropic if OpenAI gets expensive."
Sure. Let me walk you through what "just switch" means in practice:
Prompt migration. Every model responds differently to the same prompt. Our construction safety prompts — the ones that need to be accurate because wrong answers can cause injuries — took two weeks to re-tune for Claude. Two weeks of an engineer testing edge cases, adjusting system prompts, running eval sets.
Output parsing. Models format responses differently. GPT-4 loves markdown tables. Claude prefers structured lists. If your frontend parses model output (and it does), you're rewriting parsers.
Embedding compatibility. You can't mix embeddings from different providers in the same vector store. Switching embedding models means re-embedding your entire corpus. For us that's thousands of construction documents. The re-embedding job runs about 6 hours and costs around €200.
Behavioral quirks. Claude is more cautious about safety-critical answers (good for construction). GPT-4 is better at structured data extraction. Gemini is cheaper but less reliable on German technical terminology. Each model has a personality that affects your product.
"Just switch" is a two-week project minimum. And that's if you planned for it.
What We Actually Spend (Real Numbers)
After a year of optimizing, here's roughly what BauGPT's AI infrastructure costs monthly:
- LLM completions: ~€1,200 (down from €2,800 after model routing)
- Embeddings: ~€80 (re-embed new documents weekly)
- Vector DB hosting: €0 (self-hosted Postgres + pgvector)
- Queue infrastructure: ~€50 (Redis on a small instance)
- Monitoring and evals: ~€30 (custom scripts, minimal compute)
Total: about €1,360/month for ~500 DAU.
The model routing was the biggest win. Simple queries ("What does DIN 1045 say about minimum cover?") go to a smaller, cheaper model. Complex interpretation questions ("Who's liable under VOB/B §13 if the defect appears after 3 years?") go to the expensive model. This one change cut costs by 55%.
The Actual Cost of "Free"
Here's what your free API tier doesn't include:
- Rate limit handling infrastructure (2-5 days engineering)
- Model monitoring and eval pipeline (3-5 days engineering)
- Multi-model routing for cost optimization (1-2 weeks engineering)
- Fallback system for outages (1-2 days engineering)
- Prompt migration capability (ongoing, ~2 weeks per new model)
That's roughly 4-6 weeks of engineering work before you have a production-ready AI system. At startup engineering rates, that's €20-40K in labor.
The API credits? Those were maybe worth €500.
What I'd Do Differently
If I started over:
Build for multi-model from day one. Abstract the LLM layer behind an interface. We didn't do this and paid for it later. It took us a week to refactor when we should have spent a day on it upfront.
Set up evals before launch. Not after you notice quality dropping. Before. 50 test cases with known answers. Run them weekly. Automate the comparison.
Budget 3x your estimated API costs. Whatever you think AI will cost in production, triple it. You'll hit rate limit upgrades, re-embedding costs, and usage spikes you didn't predict.
Self-host what you can. We moved embeddings to a local model and vector search to pgvector on our existing Postgres. Saved about €400/month and eliminated two external dependencies.
The "free" AI API is a great way to prototype. It's a terrible way to plan a business.