This is the second article of the series “Building a managed streaming data pipeline”. If you haven’t read the first one, I leave the link here for you to have a look:
Unleash the Spark: Create an Amazon EMR Serverless Cluster with Terraform and run PySpark jobs
Ignite Your Data Revolution and Harness the Power of Amazon EMR Serverless
In my first article I showed you how to use Amazon EMR Serverless to run your PySpark job managing the infrastructure in Terraform. In this case, I’ll show you how to use Amazon MSK to deploy your Kafka Cluster using also Terraform. Finally, the next article will be about putting all the pieces together and building the streaming data pipeline using both technologies.
Basic knowledge and some experience with Apache Kafka is recommended to fully understand all the configurations provided in the Terraform files.