back to thoughts
June 15, 2024

What I Learned Shipping LLMs to Production

8 min

“It's not about the model size. It's about everything else.”

The first time I deployed an LLM to production, I thought the hard part was over. I had fine-tuned the model, benchmarked it against a dozen metrics, and even got approval from the legal team. Easy, right?

Wrong. So very wrong.

The Wake-Up Call

Three hours after launch, our API response times spiked to 15 seconds. Users were abandoning sessions. The support queue was filling up. And I was debugging in production at 2am with a cold cup of coffee.

Here's what nobody tells you about production LLMs:

1. Latency is Everything

Users will wait 2 seconds. Maybe 3 if they're patient. After that, they're gone. Your beautiful 70B parameter model means nothing if it takes 10 seconds to respond.

  • What actually worked:
  • Aggressive caching (90% of queries are variations of the same thing)
  • Streaming responses (start showing output immediately)
  • Model distillation (sometimes 7B is enough)
  • 2. Context Windows Are Expensive

    Every token in your context window costs money and time. I learned to treat context like premium real estate.

    “The best context is the context you don't need to include.”

    We built a retrieval system that surfaces only the most relevant information. Cut our costs by 60%.

    3. Graceful Degradation Saves Lives

    When the model fails (and it will), have a plan. A simple rule-based fallback is infinitely better than a spinner that never stops.

    The Real Lesson

    The model is maybe 20% of the work. The other 80%? Infrastructure, monitoring, fallbacks, rate limiting, cost optimization, and explaining to stakeholders why the AI sometimes says weird things.

    Ship ugly, ship early, ship often. Then make it beautiful.

    margin scribbles:

    — 2am debugging sessions build character... allegedly

    — our first bill was $12k. for ONE day.

    — I should put this on a t-shirt

    thanks for reading!

    → found this useful? let me know at hello@meghavi.me

    more thoughts:

    Embeddings: The Unsung Hero of Modern AI

    6 min

    Stop Building AI Products, Start Building AI Tools

    5 min

    Meghavi Rao

    Applied AI Engineer

    githublinkedintwitter

    ~ 2026 ~

    this diary was written with code