Migration of Data Pipelines from a Costly SaaS Tool to the Ultimate Open-Source Stack

In the era of easy funding and high valuations, growth was the name of the game. The question of ‘How can we quickly grow our user-base?’ It was on everyone’s minds. A key enabling factor for this is the data related to a product.

In the era of easy funding and high valuations, growth was the name of the game. The question of ‘How can we quickly grow our user-base?’ It was on everyone’s minds. A key enabling factor for this is the data related to a product. Marketing wants to see the conversion rate from the newly implemented advertising platform to see how they should distribute their budget. The Product Manager wants to see how the test run with the new AB infrastructure ran to decide whether to implement the new feature or not. Both of these use cases lead pretty directly to growth, and both of these use cases depend on making data available through some sort of ETL, modelling, and dashboarding process. So, we can understand how making data available quickly can lead to growth. This is where Software as a Service (SaaS) ETL tools enter the fray.

At first, it is easy and cheap. And maybe it stays like that, maybe the company has some very basic reporting needs, it imports 1Gb of data a week for weekly reporting, does some pretty standard transformations in SQL, and makes the results available to a few users across a couple of dashboards. Sometimes this is the case. But the fact that you are reading this article makes it unlikely that this is your case. Frequently, the company gets more use-cases, with both more complexity and increasing data volumes. As a result, the implementation of new data pipelines in the SaaS tool becomes difficult and messy (or sometimes you can’t even implement your more complex use-cases), and the cost of running the growing data volumes in your existing data pipelines becomes very expensive.

Therefore, more and more companies, both growing and mature, are looking to move away from SaaS ETL platforms to a more efficient and suitable data infrastructure for their needs. In this article, we will discuss one example of how DataZ helped one of its customers do just that, in this case, DataZ was heavily involved with all aspects of the design, implementation, and client upskilling. But DataZ prides itself on doing what is best for it’s clients, and so the level of involvement is specific to the client’s needs and skill set.

Migration from Legacy ETL tool to AWS hosted Airflow, Meltano, and dbt on Snowflake

Motivation

A growing FinTech company was looking to transform its data operations. They were using a legacy ETL tool as their SaaS data to orchestrate their data Extract, Load, and Transform tasks, with Snowflake as their data warehouse (see figure 1 below). Whilst they were pretty happy with Snowflake, they were less so with the ETL tool. They found that:

As a result, they bought DataZ in to review the current infrastructure, gain an understanding of the requirements of any new infrastructure, in order to then design and implement the new infrastructure alongside them.

After a short period of assessment, where DataZ spent time integrated with the client data team to really understand their work, the company’s strategy, and the team’s capability, a new infrastructure was proposed as seen in Figure 2 below. The new infrastructure was hosted on AWS cloud infrastructure and based around Airflow for orchestration, with Meltano deployed via Docker to Kubernetes pods to run the Extract-Load, dbt Core running the transformations on Snowflake, and the code-base version controlled on GitHub and leveraging GitHub Actions for CI/CD. As part of the process, a new data model would be designed and implemented within Snowflake in place of the old data model.

Below we see the before and after states of the data Infrastructure.

This addressed the pain points of the current infrastructure discussed above:

We will pull out tangible examples of how the new architecture addressed each of the above points day-to-day later in this article.

Implementation

How did we go about implementing this?

We adopted a staged and agile approach to implementation to allow benefits to start to be realised as soon as possible, ensure tweaks to the exact implementation can be made as new implementations are tested in real scenarios, and give the users confidence that the new data infrastructure works as they expect and makes their jobs easier.

The migration to the new infrastructure was made in 3 steps:

1. Migrating the EL processes to Meltano, orchestrated by Airflow

Once the source (Mongo, Postgres, MSSQL, various APIs & external systems) connections and target connections (mainly Snowflake) were set up in Meltano and deployed as a Docker image, then Airflow was set to orchestrate the triggering of K8s to run the Docker image. This enabled all the data sources to be loaded to their respective targets.

Before using the tables originating from the data loaded using Meltano, the incremental loading of data sources was continued for a test period and checked against the current data loads from an ETL tool to see that the same data was being loaded.

Once this was seen to be the case, the data transformation processes were then re-pointed (simply via views) at these new data sources loaded by Meltano to give a seamless transition to loading via Meltano rather than an ETL tool.

All of the data loads in an ETL tool were then switched off, immediately leading to a massive cost saving (95% reduced cost in terms of the data EL processes).

2. Migrate the transformations to Airflow as is

With the data now loading via Meltano, orchestrated by Aiflow, the transformations (mostly SQL scripts) were migrated to be orchestrated by Airflow. Again, these were run in parallel in different schemas in Snowflake for a short period to monitor and test that the transformations gave the same output. Then the transformations were switched off in the ETL tool, and all the reporting and analytics were migrated to the tables produced by the Airflow orchestration.

The implementation of this stage of the migration was very quick, and at this point, the ETL tool was redundant. All of the Extract, Load, Transform, and Orchestration had been migrated off it, and so it was completely switched off, and the contract was terminated.

3. Migrate to the new data model and transformation procedure implemented in dbt

A new data model was designed, then incrementally implemented in dbt and initially run alongside the old data model. Where the business logic of the new data model was the same as the old data model, these fields were aliased and injected into the old data model to replace old data model feeds, allowing us to slowly deprecate some of the old transformations. This prevented having the same thing defined in two places and eased the transition into the new data model, reducing the opportunities for error whilst ensuring the new transformation processes were stable and suitable for the business.

It should be noted that the benefits of having put transformations for the old data model in a version-controlled environment (in order to run them on Airflow) were already being seen here. The process of analysing the business logic in the old data model was massively streamlined just by being able to full text search the old SQL scripts and have them open and navigable in an IDE, rather than having them buried in layers in a web UI in the ETL tool, where you need to click through several screens every time to check something different.

Eventually, there was a final switch over of all the reporting and analytics to the new data model, and all the orchestration for the old data model was turned off, and the transformation scripts were deprecated.

Summary of Staged Approach

This concluded the staged migration of the data infrastructure from the old infrastructure to the new. By adopting this staged approach, we ensured success in the migration with maximum fit for the customer’s use cases and minimised unforeseen delays. Whilst also realising the value (operationally and cost-wise) incrementally along the way, therefore maximising this value. The staged approach allowed us to deal with any issues as they came up during the process and make tweaks to the design as we got feedback on the implemented infrastructure, preventing the ‘surprise’ technical debt and iterations at the end of the project that would have occurred had we adopted a non-staged approach.

Impact

What specific benefits did the client see with this new data infrastructure?

Cost Savings

The clear and immediately tangible benefit of the project was cost savings. The largest cost savings were realised during the first stage of the implementation, one of the reasons this was chosen to be implemented first.

By moving to this implementation, the client reduced their costs related to orchestration, extract and load by 85% and their data-warehousing costs by 61%, leading to an overall 72% cost reduction of their data infrastructure.

This cost reduction was based on their data volumes at the time and would actually translate to a higher % benefit with their greater data volumes in the future vs if they had not migrated, due to economies of scale (certain fixed costs that do not strictly increase with volume).

Capability

Both the client data team’s skills and the data tasks that this new infrastructure could perform were improved. The client team upskilled as they built the infrastructure alongside us and, being based on extensible frameworks and cloud in the client’s account, the infrastructure could be customised and scaled as needed.

A couple of tangible examples include:

Iteration Speed

By introducing best practices such as CI/CD via Github Actions and Dockerisation, testability and testing features via dbt, Airflow and Meltano, and customised monitoring and alerting integrations to central tools such as Slack and Email, the client could both develop and deploy quickly and smoothly, and react quickly and precisely when there was an issue.

This allowed not just for faster iteration, but also more complex and risky deployments to be made (linking to the capability above). The deployment was now automated, minimising the chances of errors; and if issues did arise, alerting would notify the team early and the deployment could be rolled back to a stable version, minimising any potential impact.

Just a few of the tangible examples of the impact day-to-day that could not happen before:

Quality

Through testing, pull request code reviews and enabling rapid iteration, the quality of resulting work is greatly improved and any bugs or issues that do occur can be rectified quickly. Both the quality and quick reaction when issues do occur both lead to more trust of the end consumers in the data and so more value realised by the business from the data.

In addition, there are harder to quantify benefits that having quality-controlled code in one place provides relating to ensuring quality in future developments through situational awareness.

Closing Remarks

As discussed, the current market pressures are increasingly leading to companies to want to shift away from SaaS data tooling. We have seen here an example of how this can be done very successfully, both in terms of the migration between infrastructures and the advantages during the use of the new infrastructure.

In this case DataZ was heavily involved in all aspects of the migration, but we would be happy to help in the way that is most suitable to your business. From just advising on the architecture of a new solution and the migration, to also working alongside you to implement and migrate to the new solution, upskilling your team along the way – we will work with you to support the specific needs of your business.

  • All product names, logos, and brands are the property of their respective owners. Use of these names, logos, and brands does not imply endorsement.