Tech Talk: Is ETL in the cloud possible without coding? (Part 3)

In this last part of my Tech Talk (Part 1 and Part 2 here), I would like to describe the AWS experience with an ETL project for a customer in the digital marketing industry. The need for a Data Warehouse came from the growing data volume and complexity. At first, I found that joining data from different sources in Tableau (blending) resulted in very slowly reacting dashboards. We needed to put another solution in place.

There were two data sources for the use cases we worked on: a PostgreSQL database hosted in the AWS cloud and Google BigQuery containing analytics data. The client has experience with the cloud and is proficient with some good DevOps practices like Infrastructure as Code (IaC). With the vendor selected, the only decision for us to make was about the ETL tool. The two options we considered were: AWS Glue or Lambda functions written in Python triggered by Step. Tempted by the data catalog functionality, we chose Glue.

Extraction (ETL)

Glue is, in essence, a managed Spark cluster. According to the AWS terminology, since you don’t get to set up any details other than the number of nodes, it is ‘serverless.’ Glue Studio allows to create no-code ETL flows, but I will elaborate on this further. As described in the previous articles, the first step in ETL is Extraction. This part is where the first problems appeared, and Glue, unfortunately, proved that it’s still in its very early infancy.

In the Glue environment, there are so-called crawlers that can scan databases to list all the tables and the respective metadata. Then you can extract the desired tables. Unfortunately, the crawlers can only recognize tables and not materialized views, which contain certain filters and are prevalent in the client’s PostgreSQL setup. It was no big issue; we could work with the raw tables.

The second problem was with the BigQuery connection. By default, this connector is not available out of the box. You need to download it from Amazon’s marketplace. It has two consequences. Firstly, you cannot crawl databases from Google, so you need to know the table names and database schemas. We solved it by copying the desired tables onto S3 buckets and running the crawler over them. Secondly, you will not be able to recreate this connector with IaC tools, e.g., such as Terraform. It means that fully automated infrastructure deployment is impossible. A human operator is needed.

Transformation (ETL)

Glue uses a PySpark dialect. Instead of data frames, you have dynamic frames that accept column type as a list. Also, in theory, there is a UI designed for you to perform ETL without code. From my experience, the number of transformations provided by AWS is minimal. You have joins, basic aggregations, select, and a limited filter. For example, there is no possibility to create calculated fields. You have a custom code block, but using it is very cumbersome since you cannot debug/preview your data (dynamic) frames. I lost a couple of days before realizing I couldn’t develop anything with this tool.

So, how do you do it? 100% code! First, you have to create so-called development endpoints. It’s a 2 step activity, taking some 20 minutes every time. They even charge you for the time when it’s launching! Based on dev endpoints, you create a Sagemaker notebook (similar concept to Jupyter Notebooks or Google Colab), and only there can you do something productive. An excellent way to go is to generate some boilerplate code with all the imports and the creation of Glue and Spark contexts. The notebook is user-friendly and allows you to debug your code view variables and data frames. If you worked with Databricks, you know what I mean.

Load (ETL)

The third and last step is Loading. Just as the last time, the task was to put several dimension and fact tables stored in the parquet format into the data warehouse, another PostgreSQL database. As with the Azure solution, we cannot define important load options such as truncating/dropping/overwriting tables. This led us to apply the loading Pyspark scripts with a JDBC driver manually.

Summary

Both Azure Data Factory and AWS Glue turned out to be limited solutions providing no code ETL tools. Maybe it will be possible one day, but as of early 2022, it’s not an option. In the ETL world, you will have to get your hands dirty with coding!

Previous
Previous

2022 Cycling Hackathon: innovative mobility solutions for Belgium

Next
Next

Celebrating International Women’s Day 2022 at Agilytic: #BreakTheBias