Migration of Data Pipelines from a Costly SaaS Tool to the Ultimate Open-Source Stack

Home » Migration of Data Pipelines from a Costly SaaS Tool to the Ultimate Open-Source Stack

Migration of Data Pipelines from a Costly SaaS Tool to the Ultimate Open-Source Stack

In the era of easy funding and high valuations, growth was the name of the game. The question of ‘How can we quickly grow our user-base?’ It was on everyone’s minds. A key enabling factor for this is the data related to a product. Marketing wants to see the conversion rate from the newly implemented advertising platform to see how they should distribute their budget. The Product Manager wants to see how the test run with the new AB infrastructure ran to decide whether to implement the new feature or not. Both of these use cases lead pretty directly to growth, and both of these use cases depend on making data available through some sort of ETL, modelling, and dashboarding process. So, we can understand how making data available quickly can lead to growth. This is where Software as a Service (SaaS) ETL tools enter the fray.

At first, it is easy and cheap. And maybe it stays like that, maybe the company has some very basic reporting needs, it imports 1Gb of data a week for weekly reporting, does some pretty standard transformations in SQL, and makes the results available to a few users across a couple of dashboards. Sometimes this is the case. But the fact that you are reading this article makes it unlikely that this is your case. Frequently, the company gets more use-cases, with both more complexity and increasing data volumes. As a result, the implementation of new data pipelines in the SaaS tool becomes difficult and messy (or sometimes you can’t even implement your more complex use-cases), and the cost of running the growing data volumes in your existing data pipelines becomes very expensive.

Therefore, more and more companies, both growing and mature, are looking to move away from SaaS ETL platforms to a more efficient and suitable data infrastructure for their needs. In this article, we will discuss one example of how DataZ helped one of its customers do just that, in this case, DataZ was heavily involved with all aspects of the design, implementation, and client upskilling. But DataZ prides itself on doing what is best for it’s clients, and so the level of involvement is specific to the client’s needs and skill set.

Migration from Legacy ETL tool to AWS hosted Airflow, Meltano, and dbt on Snowflake

Motivation

A growing FinTech company was looking to transform its data operations. They were using a legacy ETL tool as their SaaS data to orchestrate their data Extract, Load, and Transform tasks, with Snowflake as their data warehouse (see figure 1 below). Whilst they were pretty happy with Snowflake, they were less so with the ETL tool. They found that:

They were running into issues that they couldn’t properly debug and would take weeks or months for the ETL tool to address.
They were struggling to implement more complex tasks due to being shoe-horned into being required to use the features available within the ETL tool.
They wanted to move towards DataOps that were more in line with Software Engineering best practices, but, again, were limited by the platform’s features.
The costs were growing massively as their data volume grew.

As a result, they bought DataZ in to review the current infrastructure, gain an understanding of the requirements of any new infrastructure, in order to then design and implement the new infrastructure alongside them.

After a short period of assessment, where DataZ spent time integrated with the client data team to really understand their work, the company’s strategy, and the team’s capability, a new infrastructure was proposed as seen in Figure 2 below. The new infrastructure was hosted on AWS cloud infrastructure and based around Airflow for orchestration, with Meltano deployed via Docker to Kubernetes pods to run the Extract-Load, dbt Core running the transformations on Snowflake, and the code-base version controlled on GitHub and leveraging GitHub Actions for CI/CD. As part of the process, a new data model would be designed and implemented within Snowflake in place of the old data model.

This addressed the pain points of the current infrastructure discussed above:

Debugging & fixing - The new infrastructure is based on open-source packages. You can debug as you would your own code and implement fixes either inside or outside that code base yourself.
Implementing more complex tasks – You cannot integrate any of your own code in any sort of infrastructure, triggered by Airflow or not.
DataOps – The packages implemented (dbt, Meltano, Airflow) all support SWE best practices out of the box, related to aspects like logging, testing, and moving between environments.
Costs:

Orchestration, Extract & Load (AWS) – no more paying extortionate costs per megabyte of data ingested and per ‘action’ run. Just pay the much cheaper cost for the uptime of the machines used in AWS, whilst taking advantage of Meltano and Airflow for their in-built support for logging.
Transform (Snowflake) – yes, Snowflake can be expensive. IF it isn’t used properly. Costs are reasonable when you take steps to optimise your Snowflake compute (and to a lesser extent, storage) usage with, for example, a well-designed and incremental data model, alongside correct clustering. Then you can also keep all the advantages for analytics use cases that Snowflake offers.

We will pull out tangible examples of how the new architecture addressed each of the above points day-to-day later in this article.

Implementation

How did we go about implementing this?

We adopted a staged and agile approach to implementation to allow benefits to start to be realised as soon as possible, ensure tweaks to the exact implementation can be made as new implementations are tested in real scenarios, and give the users confidence that the new data infrastructure works as they expect and makes their jobs easier.

The migration to the new infrastructure was made in 3 steps:

1. Migrating the EL processes to Meltano, orchestrated by Airflow

Once the source (Mongo, Postgres, MSSQL, various APIs & external systems) connections and target connections (mainly Snowflake) were set up in Meltano and deployed as a Docker image, then Airflow was set to orchestrate the triggering of K8s to run the Docker image. This enabled all the data sources to be loaded to their respective targets.

Before using the tables originating from the data loaded using Meltano, the incremental loading of data sources was continued for a test period and checked against the current data loads from an ETL tool to see that the same data was being loaded.

Once this was seen to be the case, the data transformation processes were then re-pointed (simply via views) at these new data sources loaded by Meltano to give a seamless transition to loading via Meltano rather than an ETL tool.

All of the data loads in an ETL tool were then switched off, immediately leading to a massive cost saving (95% reduced cost in terms of the data EL processes).

2. Migrate the transformations to Airflow as is

With the data now loading via Meltano, orchestrated by Aiflow, the transformations (mostly SQL scripts) were migrated to be orchestrated by Airflow. Again, these were run in parallel in different schemas in Snowflake for a short period to monitor and test that the transformations gave the same output. Then the transformations were switched off in the ETL tool, and all the reporting and analytics were migrated to the tables produced by the Airflow orchestration.

The implementation of this stage of the migration was very quick, and at this point, the ETL tool was redundant. All of the Extract, Load, Transform, and Orchestration had been migrated off it, and so it was completely switched off, and the contract was terminated.

3. Migrate to the new data model and transformation procedure implemented in dbt

A new data model was designed, then incrementally implemented in dbt and initially run alongside the old data model. Where the business logic of the new data model was the same as the old data model, these fields were aliased and injected into the old data model to replace old data model feeds, allowing us to slowly deprecate some of the old transformations. This prevented having the same thing defined in two places and eased the transition into the new data model, reducing the opportunities for error whilst ensuring the new transformation processes were stable and suitable for the business.

It should be noted that the benefits of having put transformations for the old data model in a version-controlled environment (in order to run them on Airflow) were already being seen here. The process of analysing the business logic in the old data model was massively streamlined just by being able to full text search the old SQL scripts and have them open and navigable in an IDE, rather than having them buried in layers in a web UI in the ETL tool, where you need to click through several screens every time to check something different.

Eventually, there was a final switch over of all the reporting and analytics to the new data model, and all the orchestration for the old data model was turned off, and the transformation scripts were deprecated.

Summary of Staged Approach

This concluded the staged migration of the data infrastructure from the old infrastructure to the new. By adopting this staged approach, we ensured success in the migration with maximum fit for the customer’s use cases and minimised unforeseen delays. Whilst also realising the value (operationally and cost-wise) incrementally along the way, therefore maximising this value. The staged approach allowed us to deal with any issues as they came up during the process and make tweaks to the design as we got feedback on the implemented infrastructure, preventing the ‘surprise’ technical debt and iterations at the end of the project that would have occurred had we adopted a non-staged approach.

Impact

What specific benefits did the client see with this new data infrastructure?

Cost Savings

The clear and immediately tangible benefit of the project was cost savings. The largest cost savings were realised during the first stage of the implementation, one of the reasons this was chosen to be implemented first.

By moving to this implementation, the client reduced their costs related to orchestration, extract and load by 85% and their data-warehousing costs by 61%, leading to an overall 72% cost reduction of their data infrastructure.

This cost reduction was based on their data volumes at the time and would actually translate to a higher % benefit with their greater data volumes in the future vs if they had not migrated, due to economies of scale (certain fixed costs that do not strictly increase with volume).

Capability

Both the client data team’s skills and the data tasks that this new infrastructure could perform were improved. The client team upskilled as they built the infrastructure alongside us and, being based on extensible frameworks and cloud in the client’s account, the infrastructure could be customised and scaled as needed.

A couple of tangible examples include:

Previously the team’s experience and ability to implement testing and CI/CD workflows was non-existent, but now it is part of every feature development.
When the business needed to do a major back end data migration to support a business critical restructure of one of the core components, previously this would have had to have been a manual work-around outside of the data infrastructure, or a transactional process implemented by the back end team in a code base not designed for moving and transforming large amounts of data. Now it was just implemented as a new set of data pipelines in the data infrastructure, tested to the back-end teams’ and product team’s acceptance criteria.

Iteration Speed

By introducing best practices such as CI/CD via Github Actions and Dockerisation, testability and testing features via dbt, Airflow and Meltano, and customised monitoring and alerting integrations to central tools such as Slack and Email, the client could both develop and deploy quickly and smoothly, and react quickly and precisely when there was an issue.

This allowed not just for faster iteration, but also more complex and risky deployments to be made (linking to the capability above). The deployment was now automated, minimising the chances of errors; and if issues did arise, alerting would notify the team early and the deployment could be rolled back to a stable version, minimising any potential impact.

Just a few of the tangible examples of the impact day-to-day that could not happen before:

Using dbt, analysts and engineers could deploy the data model and it’s dependencies that they were working on from their own dev branch into a separate development environment (schema) on Snowflake, but reading from production data. This allowed easy set-up and tear-down of data warehouse environments, whilst also working with the realism of production data (along with all the associated edge cases) without actually impacting anything in production.
Using Github Actions, any changes from a merged branch would be tested and deployed in minutes, without missing anything, and to the exact state of the main branch. Previously deployments of changed code were not tested or regression tested, and deployments consisted of manually copying and pasting and re-running scripts. As a result, some changes were missed, changes broke old business logic, and nobody was confident in the data that was being output. This is not the case now.
As mentioned before, just being able to full-text-search the data model, orchestration and EL code in order to find a feature or data pipeline that you know exists, but can’t remember where it was implemented, is a simple quality of life feature, but extremely valuable!
Leveraging dbt really streamlines implementing of efficient data mdoels and the running of these data models. It removes the need for almost all the boilerplate code around making your data models incremental with merges or inserts, making it easy to implement efficient models. Then dbt’s internal dependency resolution enables you to run all the transform dependencies required for each data model in a single command, without having to worry about ensuring you manually add your newly developed table to the right stage in a transform pipeline.

Quality

Through testing, pull request code reviews and enabling rapid iteration, the quality of resulting work is greatly improved and any bugs or issues that do occur can be rectified quickly. Both the quality and quick reaction when issues do occur both lead to more trust of the end consumers in the data and so more value realised by the business from the data.

In addition, there are harder to quantify benefits that having quality-controlled code in one place provides relating to ensuring quality in future developments through situational awareness.

Closing Remarks

As discussed, the current market pressures are increasingly leading to companies to want to shift away from SaaS data tooling. We have seen here an example of how this can be done very successfully, both in terms of the migration between infrastructures and the advantages during the use of the new infrastructure.

In this case DataZ was heavily involved in all aspects of the migration, but we would be happy to help in the way that is most suitable to your business. From just advising on the architecture of a new solution and the migration, to also working alongside you to implement and migrate to the new solution, upskilling your team along the way – we will work with you to support the specific needs of your business.

All product names, logos, and brands are the property of their respective owners. Use of these names, logos, and brands does not imply endorsement.