Data Engineering End-to-End Project — Spark, Kafka, Airflow, Docker, Cassandra, Python

<p>First of all, please visit my repo to be able to understand the whole process better. This project will illustrate a streaming data pipeline and also includes many modern Data tech stack. I also want to mention that I used MacOS for this project.</p> <h1>Tech Stack</h1> <ol> <li>Python</li> <li>API</li> <li>Apache Airflow</li> <li>Apache Kafka</li> <li>Apache Spark</li> <li>Apache Cassandra</li> <li>Docker</li> </ol> <h1>Introduction</h1> <p>We will use&nbsp;<a href="https://randomuser.me/" rel="noopener ugc nofollow" target="_blank">Random Name API</a>&nbsp;to get the data. It generates new random data every time we trigger the API. We will get the data using our first&nbsp;<a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/stream_to_kafka.py" rel="noopener ugc nofollow" target="_blank">Python script</a>. We will run this script regularly to illustrate the streaming data. This script will also write the API data to the Kafka topic. We will also schedule and orchestrate this process using the&nbsp;<a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/stream_to_kafka_dag.py" rel="noopener ugc nofollow" target="_blank">Airflow DAG script</a>. Once the data is written to the Kafka producer, we can get the data via&nbsp;<a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/spark_streaming.py" rel="noopener ugc nofollow" target="_blank">Spark Structured Streaming script</a>. Then, we will write the modified data to Cassandra using the same script. All the services will be running as&nbsp;<a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/docker-compose.yml" rel="noopener ugc nofollow" target="_blank">Docker containers</a>.</p> <h1>Apache Airflow</h1> <p>We will use&nbsp;<a href="https://github.com/dogukannulu/docker-airflow" rel="noopener ugc nofollow" target="_blank">Puckel&rsquo;s Docker-Airflow repo</a>&nbsp;to run the Airflow as a container. Special thanks to Puckel!</p> <p>We should first run the following command to clone the necessary repo on our local machine.</p> <pre> git clone https://github.com/dogukannulu/docker-airflow.git</pre> <p>After cloning the repo, we should run the following command to adjust the dependencies.</p> <pre> docker build --rm --build-arg AIRFLOW_DEPS=&quot;datadog,dask&quot; --build-arg PYTHON_DEPS=&quot;flask_oauthlib&gt;=0.9&quot; -t puckel/docker-airflow .</pre> <p>I&rsquo;ve modified the&nbsp;<a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/docker-compose-LocalExecutor.yml" rel="noopener ugc nofollow" target="_blank">docker-compose-LocalExecutor.yml file</a>&nbsp;and added it to the repo as well. With this modified version, we will bind the Airflow container with Kafka and Spark containers, and necessary modules and libraries will automatically be installed. Hence, we should add&nbsp;<code><a href="https://github.com/dogukannulu/kafka_spark_structured_streaming/blob/main/requirements.txt" rel="noopener ugc nofollow" target="_blank">requirements.txt</a></code>&nbsp;file in the working directory. We can start the container with the following command.</p> <pre> docker-compose -f docker-compose-LocalExecutor.yml up -d</pre> <p>Now you have a running Airflow container and you can access the UI at&nbsp;<code><a href="https://localhost:8080/" rel="noopener ugc nofollow" target="_blank">https://localhost:8080</a></code>. If some error occurs with the libraries and packages, we can go into the Airflow container and install it all manually.</p> <p><a href="https://medium.com/@dogukannulu/data-engineering-end-to-end-project-1-7a7be2a3671">Website</a></p>