back to thoughts
November 28, 2023

Debugging ML Systems: A Survival Guide

10 min

“When print statements aren't enough.”

You've been there. The model works on your laptop. It works in the notebook. But in production? Complete garbage.

Welcome to ML debugging hell. I've spent way too much time here. Let me share what I've learned.

The Fundamental Problem

ML systems fail silently. A bug in traditional software crashes. A bug in ML just... produces slightly wrong results. Maybe. Sometimes. Under certain conditions.

This is why debugging ML is so hard.

The Debugging Hierarchy

When something goes wrong, check these in order:

1. Data, Data, Data

90% of ML bugs are data bugs. I'm not exaggerating.

  • Training/serving skew - Your training data looks nothing like production
  • Label noise - Your labels are wrong
  • Feature leakage - You're cheating without knowing it
  • Distribution shift - The world changed, your model didn't
  • 2. The Preprocessing Pipeline

    The second most common source of bugs. Things I've seen:

  • Normalization applied during training but not inference
  • Tokenizers with different vocabularies
  • Image resizing with different interpolation methods
  • Timezone bugs (always timezone bugs)
  • 3. The Model Itself

    Only after ruling out data and preprocessing should you look at the model.

  • Check gradients (vanishing? exploding?)
  • Check activations (all zeros? all the same?)
  • Check loss curves (actually decreasing?)
  • Practical Tools

  • Assertions everywhere. Check shapes. Check ranges. Check types. Be paranoid.
  • Log everything. Input, output, intermediate states. You'll need it later.
  • Build a test set you trust. If you can't measure it, you can't debug it.
  • Version control your data. DVC is your friend.
  • The Hardest Lesson

    Sometimes the model is working exactly as expected. The problem is your expectations.

    “The model isn't wrong. Your mental model of the model is wrong.”

    Step back. Understand what the model actually learned. Often, it's doing something clever that you didn't anticipate.

    Final Thoughts

    ML debugging is a skill. Like any skill, it gets easier with practice. Keep a debug journal. Write down what went wrong and how you fixed it. Future you will thank present you.

    And always, always check the data first.

    margin scribbles:

    — I have PTSD from timezone bugs

    — literally 90% of my debugging time

    — I need to frame this quote

    thanks for reading!

    → found this useful? let me know at hello@meghavi.me

    more thoughts:

    What I Learned Shipping LLMs to Production

    8 min

    Embeddings: The Unsung Hero of Modern AI

    6 min

    Meghavi Rao

    Applied AI Engineer

    githublinkedintwitter

    ~ 2026 ~

    this diary was written with code