Friday, September 20, 2019

A pattern for synchronizing data from a legacy system to microservices

Abstract

A recurring need in the teams I work with seems to be to move and continually synchronize data from a legacy system - typically a monolith - to a new microservice based replacement. In this post I will outline a data synchronization solution that allows
  • The microservice side to catch up on historical data
  • Synchronizes new data to the microservices as it is produced in the legacy system
  • Lets us control the speed of synchronization and as a result the load on the legacy system
  • Is repeatable, so the same data can easily be synchronized again if need be
In the interest of keeping this post short I will only show a high level view of the solution.

Microservices need legacy data

When replacing an existing legacy system with a modern microservice based system teams (sensibly) tend to do so in steps following the strangler pattern. This means that for a while both systems run in production. Some functionality is handled in the legacy system and some in the microservices. At first very little is handled in the microservices but gradually more and more functionality is handled there. To support those small early steps implemented in the microservices data from the legacy side is often needed. For instance, in a back office system a team in the process of moving to a microservice architecture they might want to implement a new sales dashboard with microservices. To do so we will need order data and possibly other data too, but let's just focus on the order data for now. Orders are still being taken in on the legacy side, so order data is being produced on the legacy side. But the new sales dashboard needs both historical orders and new orders to work correctly.

To make things more interesting let's say the legacy system and the microservices are in different data center - maybe the system is moving from on prem to the cloud as part of the microservice effort.

Solution: A data pump 

A solution to the situation above is to implement a data pump that sends any updates to relevant data in the legacy database over to the microservices. In the example that means new orders as well as changes to orders.

This solution has two components: A data pump which is deployed in the legacy environment and a data sink which is deployed in the microservices environment. The data pump tracks which data has already been sent over to the microservices and sends new data over as it is produced in the legacy system. The data sink simply receives the data from the pump and posts it onto a queue. This enables any and all microservices interested in the data - e.g. new or updated orders - to subscribe to such messages on the queue and to build up their models of that data in their own databases.

With the described solution in place any historical data can be sent over to the microservices. That may take a a while, especially if the legacy database cannot take to much additional load. In such cases the data pump can be throttled. Once the pump has caught up sending over historical data it will continue to send over new data in near real time.

If we put a bit of up front design into the data pump we can also support restarting the process - the pump tracks what it has already sent, resetting that tracking will make it start over. That's sometimes useful if we e.g. don't get the receiving mciroservice right in the first attempt, or if we introduce new microservices that also need data.

This is a solution I have seen used with success in several of my client's systems and that I think is applicable in many more systems too.