Pandas 2.0: A Game-Changer for Data Scientists?
<p><strong>Due to its extensive functionality and versatility, </strong><code>pandas</code><strong> has secured a place in every data scientist’s heart.</strong></p>
<p>From data input/output to data cleaning and transformation, it’s nearly impossible to think about data manipulation without <code>import pandas as pd</code>, <em>right</em>?</p>
<p><em>Now, bear with me:</em> with such a buzz around LLMs over the past months, I have somehow let slide the fact that <code>pandas</code> has just undergone a major release! Yep, <code>pandas 2.0</code> <a href="https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html" rel="noopener ugc nofollow" target="_blank">is out and came with guns blazing</a>!</p>
<p>Although I wasn’t aware of all the hype, the <a href="https://tiny.ydata.ai/dcai-medium" rel="noopener ugc nofollow" target="_blank">Data-Centric AI Community </a>promptly came to the rescue:</p>
<p><img alt="" src="https://miro.medium.com/v2/1*FupguULZd5TceCbWPPq9lg.png" style="width:700px" /></p>
<p>The 2.0 release seems to have created quite an impact in the data science community, with a lot of users praising the modifications added in the new version. Screenshot by Author.</p>
<p><strong>Fun fact:</strong> <em>Were you aware this release was in the making for an astonishing 3 years? Now that’s what I call “commitment to the community”!</em></p>
<p><em>So what does </em><code><em>pandas 2.0</em></code><em> bring to the table? Let’s dive right into it!</em></p>
<h1>1. Performance, Speed, and Memory-Efficiency</h1>
<p>As we all know, <code>pandas</code> was built using <code>numpy</code>, which was not intentionally designed as a backend for dataframe libraries. For that reason, one of the major limitations of <code>pandas</code> was handling in-memory processing for larger datasets.</p>
<p><strong>In this release, the big change comes from the introduction of the Apache Arrow backend for pandas data.</strong></p>
<p>Essentially, Arrow is a standardized in-memory columnar data format with available libraries for several programming languages (C, C++, R, Python, among others). For Python there is <a href="https://arrow.apache.org/docs/python/" rel="noopener ugc nofollow" target="_blank">PyArrow</a>, which is based on the C++ implementation of Arrow, and therefore, <em>fast</em>!</p>
<p><a href="https://towardsdatascience.com/pandas-2-0-a-game-changer-for-data-scientists-3cd281fcc4b4">Website</a></p>