Overview
In this project, we are going to upload a CSV file into an S3 bucket either with automated Python/Shell scripts or manually. We are going to create a corresponding Glue Data Catalog table. The main part will be establishing a new EMR cluster. After creating it, we are going to run a Spark job with Zeppelin Notebook and modify the data. After modifications, we are going to write the data to S3 as a parquet file. A Glue Data Catalog table will also be created. We will monitor the data using AWS Athena and S3 Select in the end.
S3 Bucket
In this project, we will need 3 buckets: source, target, and log. We will upload the source data into the source bucket. The source bucket’s name will be a unique name that describes the process (dirty-transactions-from-csv-to-parquet for this project). We will upload our initial CSV file into this bucket with the key dirty_transactions/dirty_transactions.csv. If we want to upload the data automatically from inside the EC2 instance, all details can be found in the below article.