Sneak peek of topics you better know before taking the Associate ML Certification exam — Part 1: Pandas UDFs

<p>Pandas UDF was introduced in Apache Spark 2.3 and is designed to allow users to implement pandas functionality in the Spark context. Pandas UDFs built on top of Apache Arrow to speed up computation and improve the efficiency of UDFs, which allows vectorized operations. Apache Arrow is a columnar in-memory analytics layer that allows you to transfer data between JVM and Python processes. More information can be found in the official Apache Arrow in the PySpark user&nbsp;<a href="https://spark.apache.org/docs/3.2.1/api/python/user_guide/sql/arrow_pandas.html" rel="noopener ugc nofollow" target="_blank">guide</a>. By leveraging Apache Arrow under the hood, Pandas UDFs can increase performance up to 100x compared to per row UDF invocation behavior of Python UDF&rsquo;s.</p> <p>Originally in Spark 2.3, there were two types of Pandas UDFs: scalar and grouped map. Later in Spark 3.0, Pandas UDFs were split into 2 API categories: Pandas UDFs and Pandas function APIs that are now supporting grouped maps. In Spark 3.0 to address the complexity of the old Pandas UDFs, the new UDFs have been redesigned to leverage Python-type hints. The result is more Pythonic code that is easier to learn, read, and debug. You no longer need to remember any UDF types. You can use type hints to express the new Pandas UDF types.</p> <p><a href="https://medium.com/@mojganmazouchi/sneak-peek-of-topics-you-better-know-before-taking-the-associate-ml-certification-exam-part-1-2c5aaa53eeec"><strong>Visit Now</strong></a></p>
Tags: Pandas UDFs