Discover more from Daily Dose of Data Science
How to Read Multiple CSV Files Efficiently
In many situations, the data is often split into multiple CSV files and transferred to the DS/ML team for use.
As Pandas does not support parallelization, one has to iterate over the list of files and read them one by one for further processing.
"Datatable" can provide a quick fix for this. Instead of reading them iteratively with Pandas, you can use Datatable to read a bunch of files. Being parallelized, it provides a significant performance boost as compared to Pandas.
The performance gain is not just limited to I/O but is observed in many other tabular operations as well.
Read more here: DataTable Docs.
Thanks for reading Daily Dose of Data Science! Subscribe for free to learn something new and insightful about Python and Data Science every day. Also, get a Free Data Science PDF (250+ pages) with 200+ tips.
Read this post on LinkedIn: Post Link.
I like to explore, experiment, and write about data science concepts and tools. You could connect with me on LinkedIn.