It’s Time to Say GoodBye to pd.read_csv() and pd.to_csv()
<p>Input-output operations with Pandas to a CSV are serialized, making them incredibly inefficient and time-consuming. It's frustrating when I see ample scope for parallelization here, but unfortunately, Pandas does not provide this functionality (yet). Although I am never in favor of creating CSVs in the first place with Pandas (read my post below to know why), I understand that there might be situations where one has no other choice but to work with CSVs.</p>
<p>Therefore, in this post, we will explore <strong>Dask</strong> and <strong>DataTable</strong>, two of the most trending Pandas-like libraries for Data Scientists. We’ll rank Pandas, Dask and Datatable based on their performance on the following parameters:</p>
<ol>
<li><strong>Time taken to read the CSV and obtain a PANDAS DATAFRAME</strong></li>
</ol>
<p>If we read a CSV through Dask and DataTable, they will generate a Dask DataFrame and DataTable DataFrame respectively, <strong>not the Pandas DataFrame</strong>. Assuming that we want to stick to the traditional Pandas syntax and functions (due to familiarity), we would have to convert these to a Pandas DataFrame first, as shown below.</p>
<p><strong>Time taken to store a PANDAS DATAFRAME to a CSV</strong></p>
<p>The objective is to generate a CSV file from a given Pandas DataFrame. For Pandas, we are already aware of the <em>df.to_csv()</em> method. However, to create a CSV from Dask and DataTable, we first need to convert the given Pandas DataFrame to their respective DataFrames and then store them in a CSV. Thus, we’ll also consider the time taken for this DataFrame conversion in this analysis.</p>
<ol>
<li>For experimentation purposes, I generated a random dataset in Python with variable rows and thirty columns — encompassing string, float, and integer data types.</li>
<li>I repeated each experiment described below five times to reduce randomness and draw fair conclusions from the observed results. The figures I report in the section below are averages across the five experiments.</li>
<li>Python environment and libraries:</li>
</ol>
<ul>
<li>Python 3.9.12</li>
<li>Pandas 1.4.2</li>
<li>DataTable 1.0.0</li>
<li>Dask 2022.02.1</li>
</ul>
<h1><strong>Experiment 1: Time taken to read the CSV</strong></h1>
<p>The plot below depicts the time taken (in seconds) by Pandas, Dask, and DataTable to read a CSV file and generate a Pandas DataFrame. The number of rows of the CSV ranges from 100k to 5 million.</p>
<p><a href="https://towardsdatascience.com/its-time-to-say-goodbye-to-pd-read-csv-and-pd-to-csv-27fbc74e84c5">Read More</a></p>
<p> </p>