By Tyler Keenan
As more and more organizations have come to rely on streaming data to provide real-time insights, a number of applications have sprung up to handle the myriad technical challenges that streaming data presents. One of the most popular options is Apache Kafka. In this article, we’ll take a brief look at what Kafka is, how it works, and what challenges it’s meant to solve.
What Is Kafka?
Let’s start with the basic question: What is Kafka and how does it work? Kafka started out as a project at LinkedIn to make data ingestion with Hadoop easier. Since then, it’s evolved into what Apache describes as a “distributed streaming platform,” but what does that mean? In short, Kafka is a publish-subscribe messaging system that also processes and stores data as it passes through.
Like other publish-subscribe messaging brokers, it enables different systems to broadcast and receive events without having to know exactly where it’s going or coming from. That said, Kafka has a few key advantages over other message brokers:
- It’s general purpose. Kafka is meant to connect multiple systems, which makes it extremely attractive for large enterprise systems as well as small startups that are cobbling together their own applications. It’s equally adept at activity tracking, monitoring operational data, log aggregating, and stream processing.
- It takes availability seriously. While all message broker systems act as storage systems for messages that are in transit, Kafka goes to the trouble of writing data to disk and replicating it. That means data is much less likely to be lost while in transit, making Kafka an attractive option for applications that demand both speed and retention, for example applications that are subject to regulatory and compliance mandates.
- It enables real-time processing of data streams. The Kafka Streams library is designed for building streaming applications that can handle core business functions without adding additional complexity or dependencies. This can be a major advantage for applications that want to be able to process streaming data in real time but don’t need the heavy analytic tools of a Spark or Flink-type service.
How Does It Work?
Before we go any further, let’s cover a little vocabulary. Kafka’s main abstraction is the topic, which represents a stream of records in a specific category. For example, all records of users logging in or out of an application might go to a topic called “Logins.” Any number of subscribers can subscribe to “Logins,” and they’ll be able to pull new messages from that stream at whatever rate they can.
What happens when a subscriber can’t keep up? Kafka holds on to all messages for a set period of time to prevent that data from being lost. This emphasis on availability is one of the main reasons Kafka is attractive for large enterprises where data retention and event logging are major considerations.
Tools like Spark are great for heavy-duty streaming analytics. But what if your needs are more in the realm of simple processing?