An End to Silos: How to Integrate your Databases with Apache Kafka and CDC


Author: Robin Moffat

One of the most common questions and topics that I see come up on community resources such as StackOverflow and the Confluent Community Slack group is getting data from a database into Apache Kafka® and vice versa. The database could be SQL Server, Oracle, or DB2, but regardless of the actual technology, the options for doing this are broadly the same. In this blog, we will look and discuss each option. It may be evident to many readers, but it’s worth restating: since this is Kafka – a streaming platform, we are talking about streaming integration of data, not just bulk static copies of the data.

Databases? How Twentieth Century.

Before we dive into this, remember that you don’t always need a database. If you want to ingest data from a database into Kafka, ask yourself: how did that data get into the database to begin with and would it be better to instead ingest the data from the source directly into Kafka itself? If you’re streaming data from Kafka into a database, why are you landing it to a database?

Sometimes you need a database, but not always. Assuming you’ve decided that you do need a database, let’s look first at how we can get data from a database into Apache Kafka.

The two options to consider are using the JDBC connector for Kafka Connect or using a log-based Change Data Capture (CDC) tool which integrates with Kafka Connect. If you’re considering doing something different, make sure you understand the reason for doing it, as these are the two standard patterns generally followed.

Kafka Connect                          

The Kafka Connect API is a core component of Apache Kafka, introduced in version 0.9. It provides scalable and resilient integration between Kafka and other systems.

It is driven purely by configuration files, providing an easy integration point for developers.

Kafka Connect can be run as a clustered process across multiple nodes and handles all the tricky business of integration, including:

  • Scaleout of ingest and egress across nodes for greater throughput
  • Automatic restart and failover of tasks in the event of node failure
  • Automatic offset management
  • Automatic preservation of source data schema
  • Utilisation of data’s schema to create target objects (e.g. Hive tables when
  • streaming to HDFS, RDBMS tables when streaming to a database)
  • Schema evolution and compatibility support (in conjunction with the Confluent Schema Registry)
  • Automatic serialisation and deserialisation of data
    Single Message Transformations
  • Exactly once processing semantics (on supported connectors)

    To use Kafka Connect you simply need a plugin that integrates with the technology that you’re interested in. Confluent Open Source ships with several of these plugins including JDBC, HDFS, Elasticsearch, and S3. You can find dozens of plugins for Kafka Connect here.

 

JDBC plugin for Kafka Connect

The Confluent JDBC Connector for Kafka Connect enables you to stream data to and from Kafka and any RDBMS that supports JDBC (which is to say pretty much any). It can stream entire schemas or just individual tables. It can pull the entire contents (bulk) or do an incremental fetch of data that’s changed since the last poll using a numeric key column, an update timestamp, or both.

It’s a super-easy way to get started with streaming data into Kafka from databases. You can see an example of it in action in this article, streaming data from MySQL into Kafka. Also check out the quickstart and the full documentation. The JDBC Connector also gives you a way to stream data from Kafka into a database—see details and examples in the quickstart here.

Because the JDBC Connector uses the Kafka Connect API, it has several great features when it comes to streaming data from databases into Kafka:

  • Configuration-only interface for developers—no coding!
  • Schemas
    -The source database schema is preserved for use downstream in consuming  Kafka applications
    -When streaming data from Kafka to a database, the connector will use the schema to execute the necessary DDL on the target to create the destination objects before streaming data to them
  • Kafka Connect can scale out the number of tasks to increase the throughput. For example, it can ingest data from multiple tables in parallel. Just make sure you don’t flatten your database with too many concurrent requests!
  • Database-specific dialects (in progress)

So the JDBC Connector is a great start, and is good for prototyping, for streaming smaller tables into Kafka, and streaming Kafka topics into a relational database.

Log-based Change-Data-Capture (CDC) tools and Kafka

If you want to go “the whole hog” with integrating your database with Kafka, then log-based Change-Data-Capture (CDC) is the route to go. Done properly, CDC enables you to stream every single event from a database into Kafka. Broadly, relational databases use a transaction log (also known as a binlog or redo log), where every event in the database is written to. Update a row, insert a row, delete a row – it all goes to the database’s transaction log. CDC tools work by utilising this transaction log to extract at very low latency and low impact the events that are occurring on the database (or a schema/table within it).

Many CDC tools exist, serving a broad range of sources. Some specialise in broad coverage of source systems, others in just specific ones. The common factor uniting most of them is close integration with Apache Kafka and Confluent Platform. Being able to stream your data from a database not only into Kafka, but with support for things such as the preservation of schemas through the Schema Registry, is a defining factor of these CDC tools. Some are built using the Kafka Connect framework itself (and tend to offer a richer degree of integration), whilst others use the Kafka Producer API in conjunction with support for the Schema Registry etc.

CDC tools with support from the vendor and integration with Confluent Platform are (as of March 2018):

So what’s the catch with CDC? There isn’t one, per se. CDC is low impact, low latency, and gives you full data fidelity. But there are a few reasons why you may not use a CDC tool when integrating a database with Kafka, at least to start with:

  • CDC tools are more complex than the straightforward JDBC connector.
    -This complexity may well be worth it given your requirements but be aware that you are adding more moving parts to your overall solution.
    -For rapid prototyping CDC can be overkill (and the JDBC connector fits well here)
    -More complex to set up at first, because of the nature of the integration with the relatively low level log files
    -Often requires administration access to the database for initial setup—can be a speedbump to rapid prototyping
  • Cost considerations: many CDC tools are commercial offerings (typically those that work with proprietary sources).

Conclusion

My steer on the CDC vs JDBC debate, is that JDBC is great for prototyping and fine for low-volume workloads. But if you are using the JDBC connector, you should consider:

  • It doesn’t give true CDC (capture delete records, want before/after record versions)
  • There is latency in detecting new events
  • The impact of polling the source database continually (and balancing this with the desired latency)
  • Unless you’re doing a bulk pull from a table, you need to have an ID and/or timestamp that you can use to spot new records. If you don’t own the schema, this becomes a problem.

Also bear in mind what your requirements are for the data. Are you wanting to simply stream data from the database so that you can use it in a traditional analytics/ETL sense? Or are you building event-driven applications? The former gives you more leeway on how you implement. The latter almost certainly necessitates a log-based CDC approach, because you need not only every event (rather than just the state at an arbitrary point in time), and you also need delete events.

For analytics and ETL, streaming data in from a database is a powerful way to integrate existing systems with Kafka. The great thing is that because Apache Kafka decouples sources and targets, you can easily switch one out for another. For example, an iterative development could look like:

  1. Stream data from database into Kafka using JDBC Connector – prove the value of the data being in Kafka in realtime to drive applications and requirements. Typically done as a prototype/sandbox/skunk-works project.
  2. Once proven out, more formally adopt the source of data into Kafka, and consider the use of CDC. Existing consuming applications and targets stay untouched, because the data is still the same (just being streamed into Kafka using a different method)
     

Whichever way you decide, hopefully we have shown you that integrating your database with Kafka doesn’t have to be a daunting task