What I Learned Shipping LLMs to Production
“It's not about the model size. It's about everything else.”
The first time I deployed an LLM to production, I thought the hard part was over. I had fine-tuned the model, benchmarked it against a dozen metrics, and even got approval from the legal team. Easy, right?
Wrong. So very wrong.
The Wake-Up Call
Three hours after launch, our API response times spiked to 15 seconds. Users were abandoning sessions. The support queue was filling up. And I was debugging in production at 2am with a cold cup of coffee.
Here's what nobody tells you about production LLMs:
1. Latency is Everything
Users will wait 2 seconds. Maybe 3 if they're patient. After that, they're gone. Your beautiful 70B parameter model means nothing if it takes 10 seconds to respond.
2. Context Windows Are Expensive
Every token in your context window costs money and time. I learned to treat context like premium real estate.
“The best context is the context you don't need to include.”
We built a retrieval system that surfaces only the most relevant information. Cut our costs by 60%.
3. Graceful Degradation Saves Lives
When the model fails (and it will), have a plan. A simple rule-based fallback is infinitely better than a spinner that never stops.
The Real Lesson
The model is maybe 20% of the work. The other 80%? Infrastructure, monitoring, fallbacks, rate limiting, cost optimization, and explaining to stakeholders why the AI sometimes says weird things.
Ship ugly, ship early, ship often. Then make it beautiful.
margin scribbles:
— 2am debugging sessions build character... allegedly
— our first bill was $12k. for ONE day.
— I should put this on a t-shirt
thanks for reading!
→ found this useful? let me know at hello@meghavi.me