The Most Common Misconception Pandas Users Have About Apply() Method
Avoid using apply() method at all times.
apply() method in Pandas is the most common approach to apply a function along an axis of a DataFrame/Series.
In my experience, when using
apply(), most Pandas users believe that it is a vectorized method.
In other words, they believe that
apply() operates efficiently and performs element-wise operations like other vectorized operations in Pandas.
But this is NOT true.
Contrary to this common belief, every Pandas user MUST know that Pandas’
apply() method is NOT vectorized.
Instead, it’s just a glorified Python for-loop, which never offers any inherent vectorization-based optimization that one might expect.
As a result, the code always runs at native Python speed, i.e., slow.
What are the solutions?
One solution is to eliminate the
apply() method by using a vectorized approach instead.
But I understand that, at times, coming up with a vectorized approach is difficult.
Another solution that I find handy is to parallelize the
apply() method by using third-party optimized libraries instead.
The image below compares the run-time of Pandas
apply() with four alternatives that support parallelization:
It is evident that Pandas’
apply() is not the optimal way to apply a method. In fact, it’s the slowest of all five.
There are a couple of reasons for this:
Pandas ALWAYS run on a single core of a CPU. Therefore, it does not possess any parallelization capabilities that it could possibly leverage.
apply()method is not vectorized. Therefore, it does not possess any vectorization capabilities either.
Honestly speaking, while the four external libraries shown in the visual above do not possess any vectorization capabilities either, they do leverage parallelization.
That is how we get to see a massive run-time improvement when we use them.
Here, please note that even though
mapply() is the fastest here, it does not mean it will always be the fastest. Consider benchmarking on your own dataset first.
Moreover, I know that the
add_row() method I demonstrated in the image above can be easily vectorized. I picked this particular example just for the sake of simplicity.
As a departing note, remember that your first possible attempt must ALWAYS be to write vectorized operations.
Consider these third-party libraries only when you see no scope to write vectorized code, and you see no other option but to use
Get started with these libraries here:
Parallel Pandas: https://pypi.org/project/parallel-pandas/
👉 Over to you: What other techniques do you commonly use to optimize Pandas’ operations?
Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (550+ pages) with 320+ tips.
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!