Avoid Using Pandas' Apply() Method At All Times
Clearing a common misconception about a popular method.
apply() method in Pandas is the most common approach to apply a function along an axis of a DataFrame/Series.
But contrary to common belief, Pandas'
is NOT vectorized
instead, it's a glorified for-loop
Thus, it does not offer any inherent optimization and the code runs at native Python speed.
One solution is to eliminate the
apply() method by using a vectorized approach.
But it is understandable that at times, coming up with a vectorized approach is difficult. (Here’s one of my previous guides on this: If You Are Not Able To Code A Vectorized Approach, Try This)
Another solution is to parallelize the
apply() method by using external libraries.
The image above compares the run-time of alternatives that support parallelization.
It is evident that Pandas’
apply() is not the optimal way to apply a method.
Get started with these libraries here:
Parallel Pandas: https://pypi.org/project/parallel-pandas/
👉 Over to you: What are some other techniques you commonly use to optimize Pandas’ operations?
Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (350+ pages) with 250+ tips.
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.
Find the code for my tips here: GitHub.