AWS Cloud Data Engineering End-to-End Project — EMR, EC2, Glue, S3, Spark, Zeppelin

<h1>Overview</h1> <p>In this project, we are going to upload a CSV file into an S3 bucket either with automated Python/Shell scripts or manually. We are going to create a corresponding Glue Data Catalog table. The main part will be establishing a new EMR cluster. After creating it, we are going to run a Spark job with Zeppelin Notebook and modify the data. After modifications, we are going to write the data to S3 as a parquet file. A Glue Data Catalog table will also be created. We will monitor the data using AWS Athena and S3 Select in the end.</p> <h1>S3 Bucket</h1> <p>In this project, we will need 3 buckets: source, target, and log. We will upload the source data into the source bucket. The source bucket’s name will be a unique name that describes the process (<strong>dirty-transactions-from-csv-to-parquet</strong> for this project). We will upload our <a href="https://github.com/dogukannulu/glue_etl_job_data_catalog_s3/blob/main/data_sources/dirty_transactions.csv" rel="noopener ugc nofollow" target="_blank">initial CSV file</a> into this bucket with the key <strong>dirty_transactions/dirty_transactions.csv</strong>. If we want to upload the data automatically from inside the EC2 instance, all details can be found in the below article.</p> <p><a href="https://medium.com/@dogukannulu/aws-cloud-data-engineering-end-to-end-project-emr-ec2-glue-s3-spark-zeppelin-7c8b61f8fcc"><strong>Read More</strong></a></p>