SOLID Principles for Spark Applications

<p>This blog post explores whether PySpark can incorporate SOLID principles for data engineering tasks.</p> <p>Here&rsquo;s a series on SOLID principles with Python for data engineering tasks I&rsquo;ve been working on:</p> <ul> <li><a href="https://medium.com/data-engineer-things/solid-principles-in-data-engineering-part-1-49d6025fe0c9" rel="noopener">SOLID principles in data engineering &mdash; Part 1</a></li> <li><a href="https://medium.com/data-engineer-things/solid-principles-in-data-engineering-part-2-52b9ce2c7070" rel="noopener">SOLID principles in data engineering &mdash; Part 2</a></li> <li><a href="https://medium.com/data-engineer-things/solid-principles-in-data-engineering-part-3-249d5869266f" rel="noopener">SOLID principles in data engineering &mdash; Part 3</a></li> </ul> <p>I&rsquo;ve also explored functional programming&rsquo;s application in data engineering in these posts too:</p> <ul> <li><a href="https://medium.com/data-engineer-things/functional-programming-in-data-engineering-with-python-part-1-c2c4f677f749" rel="noopener">Functional programming in data engineering &mdash; Part 1</a></li> <li><a href="https://medium.com/data-engineer-things/functional-programming-in-data-engineering-with-python-part-2-3bf4c04769df" rel="noopener">Functional programming in data engineering &mdash; Part 2</a></li> </ul> <h1>What is Spark?&nbsp;</h1> <p>Spark is a framework for processing large volumes of data distributed across multiple machines at the same time. It is originally written in Scala, a programming language that supports object-oriented programming (OOP) and functional programming (FP).</p> <p>Spark mainly deals with immutable data types like RDDs and DataFrames. If a data type is immutable, this means that once it is created, it cannot be modified. The only way to change them is to create an entirely new structure that includes the desired changes.</p> <h1>What is PySpark?</h1> <p>PySpark is the Python API for interacting with the services provided by Spark. This allows developers to leverage the power of Spark while using the simplicity of Python to facilitate the creation of robust applications that process large amounts of data.</p> <p>So Python programmers can stick with Python and take advantage of Spark&rsquo;s distributed computing capabilities through the use of PySpark.</p> <h1>Object-oriented programming (OOP)</h1> <p>Object-oriented programming (OOP) is a programming paradigm that emphasises the use of objects. In OOP, data is stored in the&nbsp;<strong>instances</strong>&nbsp;of classes (<em>not the classes themselves</em>). So the class would have to first be read (or instantiated) into an object before the data can be stored in it. The data stored in a class&rsquo;s instance is called &ldquo;<strong>state</strong>&rdquo;.</p> <p>OOP often deals with mutable data, that is, data can be modified once it is created.</p> <p><a href="https://blog.det.life/solid-principles-for-spark-applications-e604577837b">Click Here</a></p>