November 28, 2023

Debugging ML Systems: A Survival Guide

10 min

“When print statements aren't enough.”

You've been there. The model works on your laptop. It works in the notebook. But in production? Complete garbage.

Welcome to ML debugging hell. I've spent way too much time here. Let me share what I've learned.

The Fundamental Problem

ML systems fail silently. A bug in traditional software crashes. A bug in ML just... produces slightly wrong results. Maybe. Sometimes. Under certain conditions.

This is why debugging ML is so hard.

The Debugging Hierarchy

When something goes wrong, check these in order:

1. Data, Data, Data

90% of ML bugs are data bugs. I'm not exaggerating.

Training/serving skew - Your training data looks nothing like production

Label noise - Your labels are wrong

Feature leakage - You're cheating without knowing it

Distribution shift - The world changed, your model didn't

2. The Preprocessing Pipeline

The second most common source of bugs. Things I've seen:

Normalization applied during training but not inference

Tokenizers with different vocabularies

Image resizing with different interpolation methods

Timezone bugs (always timezone bugs)

3. The Model Itself

Only after ruling out data and preprocessing should you look at the model.

Check gradients (vanishing? exploding?)

Check activations (all zeros? all the same?)

Check loss curves (actually decreasing?)

Practical Tools

Assertions everywhere. Check shapes. Check ranges. Check types. Be paranoid.

Log everything. Input, output, intermediate states. You'll need it later.

Build a test set you trust. If you can't measure it, you can't debug it.

Version control your data. DVC is your friend.

The Hardest Lesson

Sometimes the model is working exactly as expected. The problem is your expectations.

“The model isn't wrong. Your mental model of the model is wrong.”

Step back. Understand what the model actually learned. Often, it's doing something clever that you didn't anticipate.

Final Thoughts

ML debugging is a skill. Like any skill, it gets easier with practice. Keep a debug journal. Write down what went wrong and how you fixed it. Future you will thank present you.

And always, always check the data first.

margin scribbles:

— I have PTSD from timezone bugs

— literally 90% of my debugging time

— I need to frame this quote

thanks for reading!

→ found this useful? let me know at hello@meghavi.me

more thoughts:

What I Learned Shipping LLMs to Production

8 min

Embeddings: The Unsung Hero of Modern AI

6 min