June 15, 2024

What I Learned Shipping LLMs to Production

8 min

“It's not about the model size. It's about everything else.”

The first time I deployed an LLM to production, I thought the hard part was over. I had fine-tuned the model, benchmarked it against a dozen metrics, and even got approval from the legal team. Easy, right?

Wrong. So very wrong.

The Wake-Up Call

Three hours after launch, our API response times spiked to 15 seconds. Users were abandoning sessions. The support queue was filling up. And I was debugging in production at 2am with a cold cup of coffee.

Here's what nobody tells you about production LLMs:

1. Latency is Everything

Users will wait 2 seconds. Maybe 3 if they're patient. After that, they're gone. Your beautiful 70B parameter model means nothing if it takes 10 seconds to respond.

What actually worked:

Aggressive caching (90% of queries are variations of the same thing)

Streaming responses (start showing output immediately)

Model distillation (sometimes 7B is enough)

2. Context Windows Are Expensive

Every token in your context window costs money and time. I learned to treat context like premium real estate.

“The best context is the context you don't need to include.”

We built a retrieval system that surfaces only the most relevant information. Cut our costs by 60%.

3. Graceful Degradation Saves Lives

When the model fails (and it will), have a plan. A simple rule-based fallback is infinitely better than a spinner that never stops.

The Real Lesson

The model is maybe 20% of the work. The other 80%? Infrastructure, monitoring, fallbacks, rate limiting, cost optimization, and explaining to stakeholders why the AI sometimes says weird things.

Ship ugly, ship early, ship often. Then make it beautiful.

margin scribbles:

— 2am debugging sessions build character... allegedly

— our first bill was $12k. for ONE day.

— I should put this on a t-shirt

thanks for reading!

→ found this useful? let me know at hello@meghavi.me

more thoughts:

Embeddings: The Unsung Hero of Modern AI

6 min