In one of my previous articles on using AWS Glue, I showed how you could use an external Python database library (pg8000) in your AWS Glue job to perform database operations. Click on the link below to read that story.
Connecting external database libraries with AWS Glue
Using pg8000
At the end of that article, I warned that such database operations are not parallelisable by the Glue job and that you might run into issues when processing large data sets.
In fact, that issue is exactly what happened to me and this is what I did to fix it.
At the end of one particular job that ran daily, I wanted to clear out stale data from a Postgres table after inserting new data, keeping about 6 days’ worth of data in the table at any one time. So, effectively, I needed to do something like this at the end of my job.
delete from myschema.mytable where last_update_date < current_date - 6
My first try was using the pg8000 python library but I hit an issue and the Glue job started to fail with an error similar to this.