Daily Dose of Data Science

Share this post

Why Is It Important To Shuffle Your Dataset Before Training An ML Model

www.blog.dailydoseofds.com

Discover more from Daily Dose of Data Science

High-quality insights on Data Science and Python, along with best practices — shared daily. Get a 550+ Page Data Science PDF Guide and 450+ Practice Questions Notebook, FREE.
Over 36,000 subscribers
Continue reading
Sign in

Why Is It Important To Shuffle Your Dataset Before Training An ML Model

…And what happens if you don't.

Avi Chawla
Mar 18, 2023
3
Share this post

Why Is It Important To Shuffle Your Dataset Before Training An ML Model

www.blog.dailydoseofds.com
2
Share

ML models may fail to converge for many reasons. Here's one of them which many folks often overlook.

If your data is ordered by labels, this could negatively impact the model's convergence and accuracy. This is a mistake that can typically go unnoticed.

In the above demonstration, I trained two neural nets on the same data. Both networks had the same initial weights, learning rate, and other settings.

However, in one of them, the data was ordered by labels, while in another, it was randomly shuffled.

As shown, the model receiving a label-ordered dataset fails to converge. However, shuffling the dataset allows the network to learn from a more representative data sample in each batch. This leads to better generalization and performance.

In general, it's a good practice to shuffle the dataset before training. This prevents the model from identifying any label-specific yet non-existing patterns.

In fact, it is also recommended to alter batch-specific data in every epoch.

P.S. This was a mistake that I once made a few years back, and it never occurred to me that shuffling is that important.

What are your thoughts on this? Let me know :)

Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new about Python and Data Science every day.

👉 Read what others are saying about this post on LinkedIn: Post Link.

👉 If you love reading this newsletter, feel free to share it with friends!

Share Daily Dose of Data Science


Check out Sourcery, an automated code refactoring tool for Python to make your code more elegant, concise, and pythonic.

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn.

3
Share this post

Why Is It Important To Shuffle Your Dataset Before Training An ML Model

www.blog.dailydoseofds.com
2
Share
Previous
Next
2 Comments
Share this discussion

Why Is It Important To Shuffle Your Dataset Before Training An ML Model

www.blog.dailydoseofds.com
Surya Neelakandan
Mar 18Liked by Avi Chawla

Hey,

In this, I couldn't understand what shuffling batch epochs meant. I did undergo a formal ML course, so I am familiar with the basic terminology. But I don't understand this batch epoch. Could you help me with that?

Thank you.

Expand full comment
Reply
Share
1 reply by Avi Chawla
1 more comment...
Top
New
Community

No posts

Ready for more?

© 2023 Avi Chawla
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing