Spark Job Optimizations & Databricks

<p>There are several ways a spark job can be optimized. Using the right optimization is crucial to reduce the overall runtime and compute cost.</p> <p>In my project, I was given a task to optimize the spark jobs and reduce the overall run time of long-running Airflow DAGs. In this blog, I am going to discuss a few spark optimizations i have used and changes I have done on cluster configurations to reduce runtime and compute cost.</p> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*YG_eQp839_Q4J4vZxJfxkw.png" style="height:380px; width:700px" /></p> <h1>1. JDBC connection:</h1> <p>One of the Long Running DAG was taking about 1 hr. In spark code, jdbc connection was used to fetch data from a table. Using the right parameters and configurations to fetch data using JDBC connection is important to fetch the data faster.</p> <p><a href="https://medium.com/@rakeshreddy1618/spark-job-optimizations-databricks-87d8095937e1"><strong>Click Here</strong></a></p>