It’s Time to Say GoodBye to pd.read_csv() and pd.to_csv()

<p>Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It&#39;s frustrating when I see ample scope for parallelization here, but unfortunately, Pandas does not provide this functionality (yet). Although I am never in favor of creating CSVs in the first place with Pandas (read my post below to know why), I understand that there might be situations where one has no other choice but to work with CSVs.</p> <p>Therefore, in this post, we will explore&nbsp;<strong>Dask</strong>&nbsp;and&nbsp;<strong>DataTable</strong>, two of the most trending Pandas-like libraries for Data Scientists. We&rsquo;ll rank Pandas, Dask and Datatable based on their performance on the following parameters:</p> <ol> <li><strong>Time taken to read the CSV and obtain a PANDAS DATAFRAME</strong></li> </ol> <p>If we read a CSV through Dask and DataTable, they will generate a Dask DataFrame and DataTable DataFrame respectively,&nbsp;<strong>not the Pandas DataFrame</strong>. Assuming that we want to stick to the traditional Pandas syntax and functions (due to familiarity), we would have to convert these to a Pandas DataFrame first, as shown below.</p> <p><strong>Time taken to store a PANDAS DATAFRAME to a CSV</strong></p> <p>The objective is to generate a CSV file from a given Pandas DataFrame. For Pandas, we are already aware of the&nbsp;<em>df.to_csv()</em>&nbsp;method. However, to create a CSV from Dask and DataTable, we first need to convert the given Pandas DataFrame to their respective DataFrames and then store them in a CSV. Thus, we&rsquo;ll also consider the time taken for this DataFrame conversion in this analysis.</p> <ol> <li>For experimentation purposes, I generated a random dataset in Python with variable rows and thirty columns &mdash; encompassing string, float, and integer data types.</li> <li>I repeated each experiment described below five times to reduce randomness and draw fair conclusions from the observed results. The figures I report in the section below are averages across the five experiments.</li> <li>Python environment and libraries:</li> </ol> <ul> <li>Python 3.9.12</li> <li>Pandas 1.4.2</li> <li>DataTable 1.0.0</li> <li>Dask 2022.02.1</li> </ul> <h1><strong>Experiment 1: Time taken to read the CSV</strong></h1> <p>The plot below depicts the time taken (in seconds) by Pandas, Dask, and DataTable to read a CSV file and generate a Pandas DataFrame. The number of rows of the CSV ranges from 100k to 5 million.</p> <p><a href="https://towardsdatascience.com/its-time-to-say-goodbye-to-pd-read-csv-and-pd-to-csv-27fbc74e84c5">Read More</a></p> <p>&nbsp;</p>
Tags: CSV DataFrame