Make data migration easy with Debezium and Apache Kafka

Posted by

tripsidor@gmail.com

On March 6, 2025

0 comments

The challenge was ensuring this aggregation occurred in real-time.

1. Fetching External Data:

The final step involved fetching additional data from an external service to complete the “God object.” Since we didn’t control this service, we needed a caching mechanism for the aggregated MySQL data. This way, if the database changes were reprocessed, we could avoid redundant table aggregations.

On a data level, the transformation looked like this:

You might wonder, “Why are we talking about Apache Kafka for this use case?”

While we already use Apache Kafka extensively as a message broker, its ecosystem offers much more than basic messaging. Here’s what makes Kafka such a powerful tool:

Debezium: A Change Data Capture (CDC) solution to stream database changes into Kafka as events.
MirrorMaker: Helps mirror topics from one Kafka cluster to another.
Kafka Streams: Enables in-flight data transformations and aggregations.
Compacted Topics: Built-in configuration that allows Kafka topics to behave like a cache, storing the latest snapshots of events for each unique key.

With these tools, Kafka becomes an end-to-end platform for streaming, processing, and storing event-driven data.

Steps we took

Let’s walk through the steps we followed to achieve our goal.

We realised that in addition to the usual Kafka producer/consumer functionality, we needed:

1. A compacted topic to aggregate changes from the database.

2. Debezium as a CDC solution to stream real-time database changes into Kafka.

However, we decided against using Kafka Streams to aggregate data across MySQL tables and Kafka topics, as it wasn’t necessary for our specific use case.

1. Streaming MySQL modifications with Debezium

The first challenge was streaming real-time changes from MySQL. To choose the right tool, we had to consider two key nuances of our task:

1. The data arrives continuously in real-time, meaning we couldn’t take a one-time snapshot.

2. We needed to ensure every record in MySQL was processed at least once, including historical data.

Debezium proved to be the perfect fit for our requirements.

According to the official website:

Debezium is an open-source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

If you already have existing Kafka Connect and Debezium infrastructure, setting up a new Debezium connector is straightforward.

That said, there is one drawback: Troubleshooting can be challenging because issues often arise at the infrastructure level rather than in the application code.

From an architectural perspective, the setup is simple—we reused the existing Debezium and Kafka Connect clusters.

No code required: Setting up the Debezium connector

Creating a new connector didn’t require writing any application code. Instead, we defined a configuration file, which was similar to this example from the official documentation: