Spark Job Optimizations & Databricks

There are several ways a spark job can be optimized. Using the right optimization is crucial to reduce the overall runtime and compute cost. In my project, I was given a task to optimize the spark jobs and reduce the overall run time of long-running Airflow DAGs. In this blog, I am going to discuss a few spark optimizations i have used and changes I have done on cluster configurations to reduce runtime and compute cost. <img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*YG_eQp839_Q4J4vZxJfxkw.png" style="height:380px; width:700px" /> <h1>1. JDBC connection:</h1> One of the Long Running DAG was taking about 1 hr. In spark code, jdbc connection was used to fetch data from a table. Using the right parameters and configurations to fetch data using JDBC connection is important to fetch the data faster. <a href="https://medium.com/@rakeshreddy1618/spark-job-optimizations-databricks-87d8095937e1">Click Here</a>