ETL – From SSIS to Azure

by

I recently built a set of traditional on-prem SSIS ETL processes for a data warehouse, only to discover that the client had chosen Azure as the server platform, post development. SSIS (SQL Server Integration Services) is a Microsoft tool that is used to extract, transform and load (ETL) data from one database/source to another. Azure provides a similar service which they call Azure Data Factory (ADF). This service does similar data preparation and transformation but differs in configuration from a developer’s point of view while adding additional functionality such as machine learning and power query. Since the ETL processes had already been designed in SSIS, the simplest option was to create the Staging and DWH (Production) SQL databases on Azure.

So far so good… what followed was a simple ETL deployment via SQL Server Management Studio (SSMS) to Azure and the processes were ready for their first execution. The events that transpired soon clarified that this approach was not going to work – the problem being that Azure SQL databases do not allow queries initiated in one database to use data from another database, known as cross database querying!

In this first option I discovered there are workarounds such as Azure Elastic Database Query which allow for cross database querying, but upon digging deeper, I realised that SSIS redesign and additional data movement processing costs were imminent.

(Above diagram) Elastic database query allows you to query multiple different databases with Transact-SQL (T-SQL) and therefore negates the need to move data from one DB to another in ADF.

Option 2 was to merge the Staging and DWH SQL databases into a single DB, which bypassed the no cross querying Azure rule, utilising the table schema to separate Staging and DWH objects. This involved the modification of a couple hundred stored procedures, views and functions which I estimated would only require a couple of days to complete! That was still quicker than redesigning and retesting the ETL processes.

Option 3 was to utilise an Azure SQL managed instance. This turned out to be the simplest option with no ETL redesign and no object modification required. Azure SQL Managed Instance (SMI) is essentially the same as a traditional on-prem SQL server, offering all of the functionality you would be accustomed to in that environment.

Once the ETL processes had been deployed to the SMI it was necessary to chain up the respective ETL processes in the correct order and to schedule accordingly. This was achieved by creating pipelines, which is an Azure Data Factory feature that automatically builds and tests code projects and delivers code to the database. These pipelines then simply allowed for the execution of traditional SSIS processes in the Azure environment, which meant I did not have to do any redesign and that was great news.

Azure Pipeline Example

(Above diagram) An example of a pipeline within Azure data factory. You will notice the similarities between SSIS and ADF such as the regular precedence constraints between tasks.

Well done and thank you Microsoft for incorporating a SQL managed instance within Azure!