A Hidden Error That Can Seriously Affect Your Deep Learning Models

...and here’s how to prevent it.

Dec 29, 2023

Deep learning models may fail to converge due to various reasons.

Some causes are obvious and common, and therefore, quickly rectifiable, like too high/low learning rate, no data normalization, no batch normalization, etc.

But the problem arises when the cause isn’t that apparent. Therefore, it may take some serious time to debug if you are unaware of them.

Today, I want to talk about one such data-related mistake, which I once committed during my early days in machine learning. Admittedly, it took me quite some time to figure it out back then because I had no idea about the issue.

Let’s understand!

Experiment

Consider a classification neural network trained using mini-batch gradient descent:

Mini-batch gradient descent: Update network weights using a few data points at a time.

Here, we train two different neural networks:

Version 1: The dataset is ordered by labels.
Version 2: The dataset is properly shuffled by labels.

And, of course, before training, we ensure that both networks had the same initial weights, learning rate, and other settings.

The following image depicts the epoch-by-epoch performance of the two models. On the left, we have the model trained on label-ordered data, and the one on the right was trained on the shuffled dataset.

It is clear that the model receiving a label-ordered dataset miserably fails to converge while the other model converges seamlessly.

Why does that happen?

Now, if you think about it for a second, overall, both models received the same data, didn’t they?

Yet, the order in which the data was fed to these models totally determined their performance.

I vividly remember that when I faced this issue, I knew that my data was ordered by labels.

Yet, it never occurred to me that ordering may influence the model performance because the data will always be the same regardless of the ordering.

But later, I realized that this point will only be valid when the model sees the entire data and updates the model weights in one go, i.e., in batch gradient descent, as depicted below:

But in the case of mini-batch gradient descent, the weights are updated after every mini-batch.

Thus, the prediction and weight update on a subsequent mini-batch is influenced by the previous mini-batches.

In the context of label-ordered data, where samples of the same class are grouped together, mini-batch gradient descent will lead the model to learn patterns specific to the class it excessively saw early on in training.

In contrast, randomly ordered data ensures that each mini-batch contains a balanced representation of classes. This allows the model to learn a more comprehensive set of features throughout the training process.

Of course, the idea of shuffling is not valid for time-series datasets as their temporal structure is important.

The good thing is that if you happen to use, say, PyTorch DataLoader, you are safe. This is because it already implements shuffling. But if you have a custom implementation, ensure that you are not making any such error.

Before I end, one thing that you must ALWAYS remember when training neural networks is that these models can proficiently learn entirely non-existing patterns about your dataset. So never give them any chance to do so.

👉 Over to you: What are some other uncommon sources of error in training deep learning models?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.