Debugging ML Systems: A Survival Guide
“When print statements aren't enough.”
You've been there. The model works on your laptop. It works in the notebook. But in production? Complete garbage.
Welcome to ML debugging hell. I've spent way too much time here. Let me share what I've learned.
The Fundamental Problem
ML systems fail silently. A bug in traditional software crashes. A bug in ML just... produces slightly wrong results. Maybe. Sometimes. Under certain conditions.
This is why debugging ML is so hard.
The Debugging Hierarchy
When something goes wrong, check these in order:
1. Data, Data, Data
90% of ML bugs are data bugs. I'm not exaggerating.
2. The Preprocessing Pipeline
The second most common source of bugs. Things I've seen:
3. The Model Itself
Only after ruling out data and preprocessing should you look at the model.
Practical Tools
The Hardest Lesson
Sometimes the model is working exactly as expected. The problem is your expectations.
“The model isn't wrong. Your mental model of the model is wrong.”
Step back. Understand what the model actually learned. Often, it's doing something clever that you didn't anticipate.
Final Thoughts
ML debugging is a skill. Like any skill, it gets easier with practice. Keep a debug journal. Write down what went wrong and how you fixed it. Future you will thank present you.
And always, always check the data first.
margin scribbles:
— I have PTSD from timezone bugs
— literally 90% of my debugging time
— I need to frame this quote
thanks for reading!
→ found this useful? let me know at hello@meghavi.me