Automated Web Scraping on Databricks

<p>Web Scraping on local device is pretty straight forward, install webdriver-manager and launch the Chrome Browser. But, more often than not you want to periodically fetch latest information through scraping. If you want to run the same code automatically at regular intervals and get the latest info from a web page you need to setup the process on some cloud based platform. This raises a myriad of problems with chrome and driver installation on the compute cluster. This article explains the above setup in Databricks and shows you how to deal with some common issues that arise.</p> <p>The process can be divided into 3 simple steps:</p> <ul> <li>Setup a dynamic shell to install necessary drivers and packages</li> <li>Add the script to the cluster to run the script while initiation</li> <li>Launch the driver to scrape the webpage successfully</li> </ul> <p>To go about setting up your dynamic shell launch your Databricks workspace spin up a cluster and attach it to a notebook. Create a dynamic shell to install chrome driver and google chrome by running the below script.</p> <p><a href="https://medium.com/@pratyushaaddula/automated-web-scraping-on-databricks-74db2ea01dbc"><strong>Click Here</strong></a></p>
Tags: web Scraping