You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
577 lines
41 KiB
577 lines
41 KiB
<!--~ |
|
~ Licensed to the Apache Software Foundation (ASF) under one or more |
|
~ contributor license agreements. See the NOTICE file distributed with |
|
~ this work for additional information regarding copyright ownership. |
|
~ The ASF licenses this file to You under the Apache License, Version 2.0 |
|
~ (the "License"); you may not use this file except in compliance with |
|
~ the License. You may obtain a copy of the License at |
|
~ |
|
~ http://www.apache.org/licenses/LICENSE-2.0 |
|
~ |
|
~ Unless required by applicable law or agreed to in writing, software |
|
~ distributed under the License is distributed on an "AS IS" BASIS, |
|
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
~ See the License for the specific language governing permissions and |
|
~ limitations under the License. |
|
~--> |
|
|
|
<script><!--#include virtual="js/templateData.js" --></script> |
|
|
|
<pre id="streams-template" type="text/x-handlebars-template"> |
|
<h1>Streams</h1> |
|
|
|
<ol class="toc"> |
|
<li> |
|
<a href="#streams_overview">Overview</a> |
|
</li> |
|
<li> |
|
<a href="#streams_concepts">Overview</a> |
|
</li> |
|
<li> |
|
<a href="#streams_developer">Developer guide</a> |
|
<ul> |
|
<li><a href="#streams_concepts">Core concepts</a> |
|
<li><a href="#streams_processor">Low-level processor API</a> |
|
<li><a href="#streams_dsl">High-level streams DSL</a> |
|
</ul> |
|
</li> |
|
<li> |
|
<a href="#streams_upgrade">Upgrade guide and API changes</a> |
|
</li> |
|
</ol> |
|
|
|
<h2><a id="streams_overview" href="#streams_overview">Overview</a></h2> |
|
|
|
<p> |
|
Kafka Streams is a client library for processing and analyzing data stored in Kafka and either write the resulting data back to Kafka or send the final output to an external system. It builds upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple yet efficient management of application state. |
|
Kafka Streams has a <b>low barrier to entry</b>: You can quickly write and run a small-scale proof-of-concept on a single machine; and you only need to run additional instances of your application on multiple machines to scale up to high-volume production workloads. Kafka Streams transparently handles the load balancing of multiple instances of the same application by leveraging Kafka's parallelism model. |
|
</p> |
|
<p> |
|
Some highlights of Kafka Streams: |
|
</p> |
|
|
|
<ul> |
|
<li>Designed as a <b>simple and lightweight client library</b>, which can be easily embedded in any Java application and integrated with any existing packaging, deployment and operational tools that users have for their streaming applications.</li> |
|
<li>Has <b>no external dependencies on systems other than Apache Kafka itself</b> as the internal messaging layer; notably, it uses Kafka's partitioning model to horizontally scale processing while maintaining strong ordering guarantees.</li> |
|
<li>Supports <b>fault-tolerant local state</b>, which enables very fast and efficient stateful operations like joins and aggregations.</li> |
|
<li>Employs <b>one-record-at-a-time processing</b> to achieve millisecond processing latency, and supports <b>event-time based windowing operations</b> with late arrival of records.</li> |
|
<li>Offers necessary stream processing primitives, along with a <b>high-level Streams DSL</b> and a <b>low-level Processor API</b>.</li> |
|
|
|
</ul> |
|
<br> |
|
|
|
<h2><a id="streams_concepts" href="#streams_concepts">Overview</a></h2> |
|
|
|
<p> |
|
We first summarize the key concepts of Kafka Streams. |
|
</p> |
|
|
|
<h5><a id="streams_topology" href="#streams_topology">Stream Processing Topology</a></h5> |
|
|
|
<ul> |
|
<li>A <b>stream</b> is the most important abstraction provided by Kafka Streams: it represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a <b>data record</b> is defined as a key-value pair.</li> |
|
<li>A <b>stream processing application</b> is any program that makes use of the Kafka Streams library. It defines its computational logic through one or more <b>processor topologies</b>, where a processor topology is a graph of stream processors (nodes) that are connected by streams (edges).</li> |
|
<li>A <b>stream processor</b> is a node in the processor topology; it represents a processing step to transform data in streams by receiving one input record at a time from its upstream processors in the topology, applying its operation to it, and may subsequently produce one or more output records to its downstream processors. </li> |
|
</ul> |
|
<img class="centered" src="/{{version}}/images/streams-concepts-topology.jpg"> |
|
|
|
|
|
<p> |
|
Kafka Streams offers two ways to define the stream processing topology: the <a href="#streams_dsl"><b>Kafka Streams DSL</b></a> provides |
|
the most common data transformation operations such as <code>map</code>, <code>filter</code>, <code>join</code> and <code>aggregations</code> out of the box; the lower-level <a href="#streams_processor"><b>Processor API</b></a> allows |
|
developers define and connect custom processors as well as to interact with <a href="#streams_state">state stores</a>. |
|
</p> |
|
|
|
<h5><a id="streams_time" href="#streams_time">Time</a></h5> |
|
|
|
<p> |
|
A critical aspect in stream processing is the notion of <b>time</b>, and how it is modeled and integrated. |
|
For example, some operations such as <b>windowing</b> are defined based on time boundaries. |
|
</p> |
|
<p> |
|
Common notions of time in streams are: |
|
</p> |
|
|
|
<ul> |
|
<li><b>Event time</b> - The point in time when an event or data record occurred, i.e. was originally created "at the source". <b>Example:</b> If the event is a geo-location change reported by a GPS sensor in a car, then the associated event-time would be the time when the GPS sensor captured the location change.</li> |
|
<li><b>Processing time</b> - The point in time when the event or data record happens to be processed by the stream processing application, i.e. when the record is being consumed. The processing time may be milliseconds, hours, or days etc. later than the original event time. <b>Example:</b> Imagine an analytics application that reads and processes the geo-location data reported from car sensors to present it to a fleet management dashboard. Here, processing-time in the analytics application might be milliseconds or seconds (e.g. for real-time pipelines based on Apache Kafka and Kafka Streams) or hours (e.g. for batch pipelines based on Apache Hadoop or Apache Spark) after event-time.</li> |
|
<li><b>Ingestion time</b> - The point in time when an event or data record is stored in a topic partition by a Kafka broker. The difference to event time is that this ingestion timestamp is generated when the record is appended to the target topic by the Kafka broker, not when the record is created "at the source". The difference to processing time is that processing time is when the stream processing application processes the record. <b>For example,</b> if a record is never processed, there is no notion of processing time for it, but it still has an ingestion time. |
|
</ul> |
|
<p> |
|
The choice between event-time and ingestion-time is actually done through the configuration of Kafka (not Kafka Streams): From Kafka 0.10.x onwards, timestamps are automatically embedded into Kafka messages. Depending on Kafka's configuration these timestamps represent event-time or ingestion-time. The respective Kafka configuration setting can be specified on the broker level or per topic. The default timestamp extractor in Kafka Streams will retrieve these embedded timestamps as-is. Hence, the effective time semantics of your application depend on the effective Kafka configuration for these embedded timestamps. |
|
</p> |
|
<p> |
|
Kafka Streams assigns a <b>timestamp</b> to every data record |
|
via the <code>TimestampExtractor</code> interface. |
|
Concrete implementations of this interface may retrieve or compute timestamps based on the actual contents of data records such as an embedded timestamp field |
|
to provide event-time semantics, or use any other approach such as returning the current wall-clock time at the time of processing, |
|
thereby yielding processing-time semantics to stream processing applications. |
|
Developers can thus enforce different notions of time depending on their business needs. For example, |
|
per-record timestamps describe the progress of a stream with regards to time (although records may be out-of-order within the stream) and |
|
are leveraged by time-dependent operations such as joins. |
|
</p> |
|
|
|
<p> |
|
Finally, whenever a Kafka Streams application writes records to Kafka, then it will also assign timestamps to these new records. The way the timestamps are assigned depends on the context: |
|
<ul> |
|
<li> When new output records are generated via processing some input record, for example, <code>context.forward()</code> triggered in the <code>process()</code> function call, output record timestamps are inherited from input record timestamps directly.</li> |
|
<li> When new output records are generated via periodic functions such as <code>punctuate()</code>, the output record timestamp is defined as the current internal time (obtained through <code>context.timestamp()</code>) of the stream task.</li> |
|
<li> For aggregations, the timestamp of a resulting aggregate update record will be that of the latest arrived input record that triggered the update.</li> |
|
</ul> |
|
</p> |
|
|
|
<h5><a id="streams_state" href="#streams_state">States</a></h5> |
|
|
|
<p> |
|
Some stream processing applications don't require state, which means the processing of a message is independent from |
|
the processing of all other messages. |
|
However, being able to maintain state opens up many possibilities for sophisticated stream processing applications: you |
|
can join input streams, or group and aggregate data records. Many such stateful operators are provided by the <a href="#streams_dsl"><b>Kafka Streams DSL</b></a>. |
|
</p> |
|
<p> |
|
Kafka Streams provides so-called <b>state stores</b>, which can be used by stream processing applications to store and query data. |
|
This is an important capability when implementing stateful operations. |
|
Every task in Kafka Streams embeds one or more state stores that can be accessed via APIs to store and query data required for processing. |
|
These state stores can either be a persistent key-value store, an in-memory hashmap, or another convenient data structure. |
|
Kafka Streams offers fault-tolerance and automatic recovery for local state stores. |
|
</p> |
|
<p> |
|
Kafka Streams allows direct read-only queries of the state stores by methods, threads, processes or applications external to the stream processing application that created the state stores. This is provided through a feature called <b>Interactive Queries</b>. All stores are named and Interactive Queries exposes only the read operations of the underlying implementation. |
|
</p> |
|
<br> |
|
|
|
<h2><a id="streams_developer" href="#streams_developer">Developer Guide</a></h2> |
|
|
|
<p> |
|
There is a <a href="#quickstart_kafkastreams">quickstart</a> example that provides how to run a stream processing program coded in the Kafka Streams library. |
|
This section focuses on how to write, configure, and execute a Kafka Streams application. |
|
</p> |
|
|
|
|
|
<p> |
|
As we have mentioned above, the computational logic of a Kafka Streams application is defined as a <a href="#streams_topology">processor topology</a>. |
|
Currently Kafka Streams provides two sets of APIs to define the processor topology, which will be described in the subsequent sections. |
|
</p> |
|
|
|
<h4><a id="streams_processor" href="#streams_processor">Low-Level Processor API</a></h4> |
|
|
|
<h5><a id="streams_processor_process" href="#streams_processor_process">Processor</a></h5> |
|
|
|
<p> |
|
Developers can define their customized processing logic by implementing the <code>Processor</code> interface, which |
|
provides <code>process</code> and <code>punctuate</code> methods. The <code>process</code> method is performed on each |
|
of the received record; and the <code>punctuate</code> method is performed periodically based on elapsed time. |
|
In addition, the processor can maintain the current <code>ProcessorContext</code> instance variable initialized in the |
|
<code>init</code> method, and use the context to schedule the punctuation period (<code>context().schedule</code>), to |
|
forward the modified / new key-value pair to downstream processors (<code>context().forward</code>), to commit the current |
|
processing progress (<code>context().commit</code>), etc. |
|
</p> |
|
|
|
<pre> |
|
public class MyProcessor extends Processor<String, String> { |
|
private ProcessorContext context; |
|
private KeyValueStore<String, Integer> kvStore; |
|
|
|
@Override |
|
@SuppressWarnings("unchecked") |
|
public void init(ProcessorContext context) { |
|
this.context = context; |
|
this.context.schedule(1000); |
|
this.kvStore = (KeyValueStore<String, Integer>) context.getStateStore("Counts"); |
|
} |
|
|
|
@Override |
|
public void process(String dummy, String line) { |
|
String[] words = line.toLowerCase().split(" "); |
|
|
|
for (String word : words) { |
|
Integer oldValue = this.kvStore.get(word); |
|
|
|
if (oldValue == null) { |
|
this.kvStore.put(word, 1); |
|
} else { |
|
this.kvStore.put(word, oldValue + 1); |
|
} |
|
} |
|
} |
|
|
|
@Override |
|
public void punctuate(long timestamp) { |
|
KeyValueIterator<String, Integer> iter = this.kvStore.all(); |
|
|
|
while (iter.hasNext()) { |
|
KeyValue<String, Integer> entry = iter.next(); |
|
context.forward(entry.key, entry.value.toString()); |
|
} |
|
|
|
iter.close(); |
|
context.commit(); |
|
} |
|
|
|
@Override |
|
public void close() { |
|
this.kvStore.close(); |
|
} |
|
}; |
|
</pre> |
|
|
|
<p> |
|
In the above implementation, the following actions are performed: |
|
|
|
<ul> |
|
<li>In the <code>init</code> method, schedule the punctuation every 1 second and retrieve the local state store by its name "Counts".</li> |
|
<li>In the <code>process</code> method, upon each received record, split the value string into words, and update their counts into the state store (we will talk about this feature later in the section).</li> |
|
<li>In the <code>punctuate</code> method, iterate the local state store and send the aggregated counts to the downstream processor, and commit the current stream state.</li> |
|
</ul> |
|
</p> |
|
|
|
<h5><a id="streams_processor_topology" href="#streams_processor_topology">Processor Topology</a></h5> |
|
|
|
<p> |
|
With the customized processors defined in the Processor API, developers can use the <code>TopologyBuilder</code> to build a processor topology |
|
by connecting these processors together: |
|
|
|
<pre> |
|
TopologyBuilder builder = new TopologyBuilder(); |
|
|
|
builder.addSource("SOURCE", "src-topic") |
|
|
|
.addProcessor("PROCESS1", MyProcessor1::new /* the ProcessorSupplier that can generate MyProcessor1 */, "SOURCE") |
|
.addProcessor("PROCESS2", MyProcessor2::new /* the ProcessorSupplier that can generate MyProcessor2 */, "PROCESS1") |
|
.addProcessor("PROCESS3", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1") |
|
|
|
.addSink("SINK1", "sink-topic1", "PROCESS1") |
|
.addSink("SINK2", "sink-topic2", "PROCESS2") |
|
.addSink("SINK3", "sink-topic3", "PROCESS3"); |
|
</pre> |
|
|
|
There are several steps in the above code to build the topology, and here is a quick walk through: |
|
|
|
<ul> |
|
<li>First of all a source node named "SOURCE" is added to the topology using the <code>addSource</code> method, with one Kafka topic "src-topic" fed to it.</li> |
|
<li>Three processor nodes are then added using the <code>addProcessor</code> method; here the first processor is a child of the "SOURCE" node, but is the parent of the other two processors.</li> |
|
<li>Finally three sink nodes are added to complete the topology using the <code>addSink</code> method, each piping from a different parent processor node and writing to a separate topic.</li> |
|
</ul> |
|
</p> |
|
|
|
<h5><a id="streams_processor_statestore" href="#streams_processor_statestore">Local State Store</a></h5> |
|
|
|
<p> |
|
Note that the Processor API is not limited to only accessing the current records as they arrive, but can also maintain local state stores |
|
that keep recently arrived records to use in stateful processing operations such as aggregation or windowed joins. |
|
To take advantage of this local states, developers can use the <code>TopologyBuilder.addStateStore</code> method when building the |
|
processor topology to create the local state and associate it with the processor nodes that needs to access it; or they can connect a created |
|
local state store with the existing processor nodes through <code>TopologyBuilder.connectProcessorAndStateStores</code>. |
|
|
|
<pre> |
|
TopologyBuilder builder = new TopologyBuilder(); |
|
|
|
builder.addSource("SOURCE", "src-topic") |
|
|
|
.addProcessor("PROCESS1", MyProcessor1::new, "SOURCE") |
|
// create the in-memory state store "COUNTS" associated with processor "PROCESS1" |
|
.addStateStore(Stores.create("COUNTS").withStringKeys().withStringValues().inMemory().build(), "PROCESS1") |
|
.addProcessor("PROCESS2", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1") |
|
.addProcessor("PROCESS3", MyProcessor3::new /* the ProcessorSupplier that can generate MyProcessor3 */, "PROCESS1") |
|
|
|
// connect the state store "COUNTS" with processor "PROCESS2" |
|
.connectProcessorAndStateStores("PROCESS2", "COUNTS"); |
|
|
|
.addSink("SINK1", "sink-topic1", "PROCESS1") |
|
.addSink("SINK2", "sink-topic2", "PROCESS2") |
|
.addSink("SINK3", "sink-topic3", "PROCESS3"); |
|
</pre> |
|
|
|
</p> |
|
|
|
In the next section we present another way to build the processor topology: the Kafka Streams DSL. |
|
|
|
<h4><a id="streams_dsl" href="#streams_dsl">High-Level Streams DSL</a></h4> |
|
|
|
To build a processor topology using the Streams DSL, developers can apply the <code>KStreamBuilder</code> class, which is extended from the <code>TopologyBuilder</code>. |
|
A simple example is included with the source code for Kafka in the <code>streams/examples</code> package. The rest of this section will walk |
|
through some code to demonstrate the key steps in creating a topology using the Streams DSL, but we recommend developers to read the full example source |
|
codes for details. |
|
|
|
<h5><a id="streams_duality" href="#streams_duality">Duality of Streams and Tables</a></h5> |
|
|
|
<p> |
|
Before we discuss concepts such as aggregations in Kafka Streams we must first introduce tables, and most importantly the relationship between tables and streams: |
|
the so-called <a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">stream-table duality</a>. |
|
Essentially, this duality means that a stream can be viewed as a table, and vice versa. Kafka’s log compaction feature, for example, exploits this duality. |
|
</p> |
|
|
|
<p> |
|
A simple form of a table is a collection of key-value pairs, also called a map or associative array. Such a table may look as follows: |
|
</p> |
|
<img class="centered" src="/{{version}}/images/streams-table-duality-01.png"> |
|
|
|
The <b>stream-table duality</b> describes the close relationship between streams and tables. |
|
<ul> |
|
<li><b>Stream as Table</b>: A stream can be considered a changelog of a table, where each data record in the stream captures a state change of the table. A stream is thus a table in disguise, and it can be easily turned into a “real” table by replaying the changelog from beginning to end to reconstruct the table. Similarly, in a more general analogy, aggregating data records in a stream – such as computing the total number of pageviews by user from a stream of pageview events – will return a table (here with the key and the value being the user and its corresponding pageview count, respectively).</li> |
|
<li><b>Table as Stream</b>: A table can be considered a snapshot, at a point in time, of the latest value for each key in a stream (a stream’s data records are key-value pairs). A table is thus a stream in disguise, and it can be easily turned into a “real” stream by iterating over each key-value entry in the table.</li> |
|
</ul> |
|
|
|
<p> |
|
Let’s illustrate this with an example. Imagine a table that tracks the total number of pageviews by user (first column of diagram below). Over time, whenever a new pageview event is processed, the state of the table is updated accordingly. Here, the state changes between different points in time – and different revisions of the table – can be represented as a changelog stream (second column). |
|
</p> |
|
<img class="centered" src="/{{version}}/images/streams-table-duality-02.png"> |
|
|
|
<p> |
|
Interestingly, because of the stream-table duality, the same stream can be used to reconstruct the original table (third column): |
|
</p> |
|
<img class="centered" src="/{{version}}/images/streams-table-duality-03.png"> |
|
|
|
<p> |
|
The same mechanism is used, for example, to replicate databases via change data capture (CDC) and, within Kafka Streams, to replicate its so-called state stores across machines for fault-tolerance. |
|
The stream-table duality is such an important concept that Kafka Streams models it explicitly via the <a href="#streams_kstream_ktable">KStream and KTable</a> interfaces, which we describe in the next sections. |
|
</p> |
|
|
|
<h5><a id="streams_kstream_ktable" href="#streams_kstream_ktable">KStream and KTable</a></h5> |
|
The DSL uses two main abstractions. A <b>KStream</b> is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded data set. |
|
A <b>KTable</b> is an abstraction of a changelog stream, where each data record represents an update. More precisely, the value in a data record is considered to be an update of the last value for the same record key, |
|
if any (if a corresponding key doesn't exist yet, the update will be considered a create). To illustrate the difference between KStreams and KTables, let’s imagine the following two data records are being sent to the stream: |
|
|
|
<pre> |
|
("alice", 1) --> ("alice", 3) |
|
</pre> |
|
|
|
If these records a KStream and the stream processing application were to sum the values it would return <code>4</code>. If these records were a KTable, the return would be <code>3</code>, since the last record would be considered as an update. |
|
|
|
<h5><a id="streams_dsl_source" href="#streams_dsl_source">Create Source Streams from Kafka</a></h5> |
|
|
|
<p> |
|
Either a <b>record stream</b> (defined as <code>KStream</code>) or a <b>changelog stream</b> (defined as <code>KTable</code>) |
|
can be created as a source stream from one or more Kafka topics (for <code>KTable</code> you can only create the source stream |
|
from a single topic). |
|
</p> |
|
|
|
<pre> |
|
KStreamBuilder builder = new KStreamBuilder(); |
|
|
|
KStream<String, GenericRecord> source1 = builder.stream("topic1", "topic2"); |
|
KTable<String, GenericRecord> source2 = builder.table("topic3", "stateStoreName"); |
|
</pre> |
|
|
|
<h5><a id="streams_dsl_windowing" href="#streams_dsl_windowing">Windowing a stream</a></h5> |
|
A stream processor may need to divide data records into time buckets, i.e. to <b>window</b> the stream by time. This is usually needed for join and aggregation operations, etc. Kafka Streams currently defines the following types of windows: |
|
<ul> |
|
<li><b>Hopping time windows</b> are windows based on time intervals. They model fixed-sized, (possibly) overlapping windows. A hopping window is defined by two properties: the window's size and its advance interval (aka "hop"). The advance interval specifies by how much a window moves forward relative to the previous one. For example, you can configure a hopping window with a size 5 minutes and an advance interval of 1 minute. Since hopping windows can overlap a data record may belong to more than one such windows.</li> |
|
<li><b>Tumbling time windows</b> are a special case of hopping time windows and, like the latter, are windows based on time intervals. They model fixed-size, non-overlapping, gap-less windows. A tumbling window is defined by a single property: the window's size. A tumbling window is a hopping window whose window size is equal to its advance interval. Since tumbling windows never overlap, a data record will belong to one and only one window.</li> |
|
<li><b>Sliding windows</b> model a fixed-size window that slides continuously over the time axis; here, two data records are said to be included in the same window if the difference of their timestamps is within the window size. Thus, sliding windows are not aligned to the epoch, but on the data record timestamps. In Kafka Streams, sliding windows are used only for join operations, and can be specified through the <code>JoinWindows</code> class.</li> |
|
</ul> |
|
|
|
<p> |
|
In the Kafka Streams DSL users can specify a <b>retention period</b> for the window. This allows Kafka Streams to retain old window buckets for a period of time in order to wait for the late arrival of records whose timestamps fall within the window interval. |
|
If a record arrives after the retention period has passed, the record cannot be processed and is dropped. |
|
</p> |
|
|
|
<p> |
|
Late-arriving records are always possible in real-time data streams. However, it depends on the effective <a href="#streams_team">time semantics</a> how late records are handled. Using processing-time, the semantics are “when the data is being processed”, |
|
which means that the notion of late records is not applicable as, by definition, no record can be late. Hence, late-arriving records only really can be considered as such (i.e. as arriving “late”) for event-time or ingestion-time semantics. In both cases, |
|
Kafka Streams is able to properly handle late-arriving records. |
|
</p> |
|
|
|
<h5><a id="streams_dsl_joins" href="#streams_dsl_joins">Join multiple streams</a></h5> |
|
A <b>join</b> operation merges two streams based on the keys of their data records, and yields a new stream. A join over record streams usually needs to be performed on a windowing basis because otherwise the number of records that must be maintained for performing the join may grow indefinitely. In Kafka Streams, you may perform the following join operations: |
|
<ul> |
|
<li><b>KStream-to-KStream Joins</b> are always windowed joins, since otherwise the memory and state required to compute the join would grow infinitely in size. Here, a newly received record from one of the streams is joined with the other stream's records within the specified window interval to produce one result for each matching pair based on user-provided <code>ValueJoiner</code>. A new <code>KStream</code> instance representing the result stream of the join is returned from this operator.</li> |
|
|
|
<li><b>KTable-to-KTable Joins</b> are join operations designed to be consistent with the ones in relational databases. Here, both changelog streams are materialized into local state stores first. When a new record is received from one of the streams, it is joined with the other stream's materialized state stores to produce one result for each matching pair based on user-provided ValueJoiner. A new <code>KTable</code> instance representing the result stream of the join, which is also a changelog stream of the represented table, is returned from this operator.</li> |
|
<li><b>KStream-to-KTable Joins</b> allow you to perform table lookups against a changelog stream (<code>KTable</code>) upon receiving a new record from another record stream (KStream). An example use case would be to enrich a stream of user activities (<code>KStream</code>) with the latest user profile information (<code>KTable</code>). Only records received from the record stream will trigger the join and produce results via <code>ValueJoiner</code>, not vice versa (i.e., records received from the changelog stream will be used only to update the materialized state store). A new <code>KStream</code> instance representing the result stream of the join is returned from this operator.</li> |
|
</ul> |
|
|
|
Depending on the operands the following join operations are supported: <b>inner joins</b>, <b>outer joins</b> and <b>left joins</b>. Their semantics are similar to the corresponding operators in relational databases. |
|
|
|
<h5><a id="streams_dsl_aggregations" href="#streams_dsl_aggregations">Aggregate a stream</a></h5> |
|
An <b>aggregation</b> operation takes one input stream, and yields a new stream by combining multiple input records into a single output record. Examples of aggregations are computing counts or sum. An aggregation over record streams usually needs to be performed on a windowing basis because otherwise the number of records that must be maintained for performing the aggregation may grow indefinitely. |
|
|
|
<p> |
|
In the Kafka Streams DSL, an input stream of an aggregation can be a <code>KStream</code> or a <code>KTable</code>, but the output stream will always be a <code>KTable</code>. |
|
This allows Kafka Streams to update an aggregate value upon the late arrival of further records after the value was produced and emitted. |
|
When such late arrival happens, the aggregating <code>KStream</code> or <code>KTable</code> simply emits a new aggregate value. Because the output is a <code>KTable</code>, the new value is considered to overwrite the old value with the same key in subsequent processing steps. |
|
</p> |
|
|
|
<h5><a id="streams_dsl_transform" href="#streams_dsl_transform">Transform a stream</a></h5> |
|
|
|
<p> |
|
Besides join and aggregation operations, there is a list of other transformation operations provided for <code>KStream</code> and <code>KTable</code> respectively. |
|
Each of these operations may generate either one or more <code>KStream</code> and <code>KTable</code> objects and |
|
can be translated into one or more connected processors into the underlying processor topology. |
|
All these transformation methods can be chained together to compose a complex processor topology. |
|
Since <code>KStream</code> and <code>KTable</code> are strongly typed, all these transformation operations are defined as |
|
generics functions where users could specify the input and output data types. |
|
</p> |
|
|
|
<p> |
|
Among these transformations, <code>filter</code>, <code>map</code>, <code>mapValues</code>, etc, are stateless |
|
transformation operations and can be applied to both <code>KStream</code> and <code>KTable</code>, |
|
where users can usually pass a customized function to these functions as a parameter, such as <code>Predicate</code> for <code>filter</code>, |
|
<code>KeyValueMapper</code> for <code>map</code>, etc: |
|
|
|
</p> |
|
|
|
<pre> |
|
// written in Java 8+, using lambda expressions |
|
KStream<String, GenericRecord> mapped = source1.mapValue(record -> record.get("category")); |
|
</pre> |
|
|
|
<p> |
|
Stateless transformations, by definition, do not depend on any state for processing, and hence implementation-wise |
|
they do not require a state store associated with the stream processor; Stateful transformations, on the other hand, |
|
require accessing an associated state for processing and producing outputs. |
|
For example, in <code>join</code> and <code>aggregate</code> operations, a windowing state is usually used to store all the received records |
|
within the defined window boundary so far. The operators can then access these accumulated records in the store and compute |
|
based on them. |
|
</p> |
|
|
|
<pre> |
|
// written in Java 8+, using lambda expressions |
|
KTable<Windowed<String>, Long> counts = source1.groupByKey().aggregate( |
|
() -> 0L, // initial value |
|
(aggKey, value, aggregate) -> aggregate + 1L, // aggregating value |
|
TimeWindows.of("counts", 5000L).advanceBy(1000L), // intervals in milliseconds |
|
Serdes.Long() // serde for aggregated value |
|
); |
|
|
|
KStream<String, String> joined = source1.leftJoin(source2, |
|
(record1, record2) -> record1.get("user") + "-" + record2.get("region"); |
|
); |
|
</pre> |
|
|
|
<h5><a id="streams_dsl_sink" href="#streams_dsl_sink">Write streams back to Kafka</a></h5> |
|
|
|
<p> |
|
At the end of the processing, users can choose to (continuously) write the final resulted streams back to a Kafka topic through |
|
<code>KStream.to</code> and <code>KTable.to</code>. |
|
</p> |
|
|
|
<pre> |
|
joined.to("topic4"); |
|
</pre> |
|
|
|
If your application needs to continue reading and processing the records after they have been materialized |
|
to a topic via <code>to</code> above, one option is to construct a new stream that reads from the output topic; |
|
Kafka Streams provides a convenience method called <code>through</code>: |
|
|
|
<pre> |
|
// equivalent to |
|
// |
|
// joined.to("topic4"); |
|
// materialized = builder.stream("topic4"); |
|
KStream<String, String> materialized = joined.through("topic4"); |
|
</pre> |
|
|
|
|
|
<br> |
|
<p> |
|
Besides defining the topology, developers will also need to configure their applications |
|
in <code>StreamsConfig</code> before running it. A complete list of |
|
Kafka Streams configs can be found <a href="#streamsconfigs"><b>here</b></a>. |
|
</p> |
|
|
|
<h2><a id="streams_upgrade" href="#upgrade">Upgrade guide and API changes</a></h2> |
|
|
|
<h3><a id="streams_upgrade_1020" href="#upgrade_1020">Upgrading a Kafka Streams Application</a></h3> |
|
|
|
<h4>Upgrading from 0.10.1.x to 0.10.2.0</h4> |
|
<p> |
|
See <a href="../#upgrade_1020_streams">Upgrade Section</a> for details. |
|
</p> |
|
|
|
<h3><a id="streams_api_changes" href="#api_changes">Streams API changes in 0.10.2.0</a></h3> |
|
<li> New methods in <code>KafkaStreams</code>: |
|
<ul> |
|
<li> set a listener to react on application state change via <code>#setStateListener(StateListener listener)</code> </li> |
|
<li> retrieve the current application state via <code>#state()</code> </li> |
|
<li> retrieve the global metrics registry via <code>#metrics()</code> </li> |
|
<li> apply a timeout when closing an application via <code>#close(long timeout, TimeUnit timeUnit)</code> </li> |
|
<li> specify a custom indent when retrieving Kafka Streams information via <code>#toString(String indent)</code> </li> |
|
</ul> |
|
</li> |
|
<li> Parameter updates in <code>StreamsConfig</code>: |
|
<ul> |
|
<li> parameter <code>zookeeper.connect</code> was deprecated </li> |
|
<ul> |
|
<li> a Kafka Streams application does no longer interact with Zookeeper for topic management but uses the new broker admin protocol |
|
(cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-TopicAdminSchema.1">KIP-4, Section "Topic Admin Schema"</a>) </li> |
|
<li> thus, parameter "zookeeper.connect" is ignored in 0.10.2 and should be removed from <code>StreamsConfig</code> </li> |
|
</ul> |
|
<li> added many new parameters for metrics, security, and client configurations </li> |
|
</ul> |
|
</li> |
|
<li> Changes in <code>StreamsMetrics</code> interface: |
|
<ul> |
|
<li> removed methods: <code>#addLatencySensor()</code> </li> |
|
<li> added methods: <code>#addLatencyAndThroughputSensor()</code>, <code>#addThroughputSensor()</code>, <code>#recordThroughput()</code>, |
|
<code>#addSensor()</code>, <code>#removeSensor()</code> </li> |
|
</ul> |
|
</li> |
|
<li> New methods in <code>TopologyBuilder</code>: |
|
<ul> |
|
<li> added overloads for <code>#addSource()</code> that allow to define a <code>auto.offset.reset</code> policy per source node </li> |
|
<li> added methods <code>#addGlobalStore()</code> to add global <code>StateStore</code>s </li> |
|
</ul> |
|
</li> |
|
<li> New methods in <code>KStreamBuilder</code>: |
|
<ul> |
|
<li> added overloads for <code>#stream()</code> and <code>#table()</code> that allow to define a <code>auto.offset.reset</code> policy per input stream/table </li> |
|
<li> <code>#table()</code> always requires store name </li> |
|
<li> added method <code>#globalKTable()</code> to create a <code>GlobalKTable</code> </li> |
|
</ul> |
|
</li> |
|
<li> New joins for <code>KStream</code>: |
|
<ul> |
|
<li> added overloads for <code>#join()</code> to join with <code>KTable</code> </li> |
|
<li> added overloads for <code>#join()</code> and <code>leftJoin()</code> to join with <code>GlobalKTable</code> </li> |
|
</ul> |
|
</li> |
|
<li> Aligned <code>null</code>-key handling for <code>KTable</code> joins: |
|
<ul> |
|
<li> like all other KTable operations, <code>KTable-KTable</code> joins do not throw an exception on <code>null</code> key records anymore, but drop those records silently </li> |
|
</ul> |
|
</li> |
|
<li> New window type <em>Session Windows</em>: |
|
<ul> |
|
<li> added class <code>SessionWindows</code> to specify session windows </li> |
|
<li> added overloads for <code>KGroupedStream</code> methods <code>#count()</code>, <code>#reduce()</code>, and <code>#aggregate()</code> |
|
to allow session window aggregations </li> |
|
</ul> |
|
</li> |
|
<li> Changes to <code>TimestampExtractor</code>: |
|
<ul> |
|
<li> method <code>#extract()</code> has a second parameter now </li> |
|
<li> new default timestamp extractor class <code>FailOnInvalidTimestamp</code> |
|
(it gives the same behavior as old (and removed) default extractor <code>ConsumerRecordTimestampExtractor</code>) </li> |
|
<li> new alternative timestamp extractor classes <code>LogAndSkipOnInvalidTimestamp</code> and <code>UsePreviousTimeOnInvalidTimestamps</code> </li> |
|
</ul> |
|
</li> |
|
<li> Relaxed type constraints of many DSL interfaces, classes, and methods (cf. <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-100+-+Relax+Type+constraints+in+Kafka+Streams+API">KIP-100</a>). </li> |
|
</ul> |
|
</pre> |
|
|
|
|
|
</script> |
|
|
|
<!--#include virtual="../includes/_header.htm" --> |
|
<!--#include virtual="../includes/_top.htm" --> |
|
<div class="content documentation documentation--current"> |
|
<!--#include virtual="../includes/_nav.htm" --> |
|
<div class="right"> |
|
<!--#include virtual="../includes/_docs_banner.htm" --> |
|
<ul class="breadcrumbs"> |
|
<li><a href="/documentation">Documentation</a></li> |
|
</ul> |
|
<div class="p-streams"></div> |
|
</div> |
|
</div> |
|
<!--#include virtual="../includes/_footer.htm" --> |
|
<script> |
|
$(function() { |
|
// Show selected style on nav item |
|
$('.b-nav__streams').addClass('selected'); |
|
|
|
// Display docs subnav items |
|
$('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded'); |
|
}); |
|
</script>
|
|
|