Tech Talk: Is ETL in the cloud possible without coding? (Part 2)

Images: https://www.agilytic.be/blog/tech-talk-etl-cloud-coding-part-2

In this article, following Part 1, I would like to illustrate the setup of an ETL flow project for a customer from the hospitality industry. Since the customer’s desired front-end tool is Power BI, we have proposed Microsoft as the right vendor for the cloud migration. Azure Portal is the most user-friendly cloud UI with thorough documentation, so it is a good option for starters. The only downside is that Azure is slightly more expensive compared with competitors.

There was only one data source for this proof of concept, here highlighted in purple:

As-is data architecture

Extraction (ETL)

The first step in ETL is Extraction. Our task was to get nearly 30 tables from the MySQL database. We used Azure Data Factory (ADF) for this. Indeed, it is one of a few activities where coding is unnecessary. The only tricky part is that a private network hosts the MySQL database. That means you need to set up your ADF on a VM belonging to this network. The self-hosted integration runtime installation process is not the easiest. Remember also to provide this VM with a Java Runtime Environment to be able to extract tables into parquet files. It is an effective format that contains metadata. Snappy is the default compression mode, and we highly recommend it.

ADF has the following hierarchy: Data Factory > Pipelines > Activities. Once you link your ADF with all the needed resources (in our case: MySQL database, Azure Data storage, and Databricks), you can start extracting data with the so-called ‘copy activities.’ ADF has two options: ingest templates and user-created pipelines.

In the first case, you can scan your data source and choose which tables are of interest. You can copy whole tables into the sink you choose.

In the second case, you must create a separate copy activity per database table. Yes, it’s tedious, but you can attach a SQL query to it. This can save you a lot of storage space if you want to remove deprecated columns or filter some rows (e.g., based on the date). Also, you might want to check if you extracted the columns of the type datetime correctly. If not, type cast is needed.

Transformation (ETL)

The second step in ETL is Transformation. We did this using two Spark clusters (one for development and one for batch jobs) managed by the Databricks environment. Databricks notebooks are a convenient way to write your code in Python (this dialect is called Pyspark) or debug and monitor the particular jobs within the cluster in Scala.

Load (ETL)

The third and last step is Loading. The task was to put several dimension and fact tables stored in parquet into the data warehouse, which is an Azure SQL database. Surprisingly ADF turned out to be quite limited here. You cannot load multiple parquet files automatically. Also, We cannot define important load options such as truncating/dropping/overwriting tables. This led us to put the loading scripts with a JDBC driver in the Databricks using the following code:

df_name.write \    

.format("jdbc") \    

.option("url", "jdbc:postgresql:dbserver") \    

.option("dbtable", "schema.tablename") \    

.option("user", "username") \    

.option("password", "password") \  

.save()

There are plenty of options with this method. They are described here: JDBC To Other Databases - Spark 3.2.1 Documentation (apache.org)

Up next

Look out for Part 3, where I'll describe another cloud story using AWS to help the client establish their ETL flows and data warehouses.

Previous
Previous

An internship at Agilytic: Clément’s experience

Next
Next

Tech Talk: Is ETL in the cloud possible without coding? (Part 1)