Kafka today uses ZkClient, a wrapper client around the raw Zookeeper client. This library only exposes synchronous apis to the user. Synchronous apis mean we must wait an entire round trip before doing the next operation.
This becomes problematic with partition-heavy clusters, as we find the controller spending a significant amount of time just sending many sequential reads and writes to zookeeper at the per-partition granularity. This especially becomes an issue during:
- controller failover, where the newly elected controller effectively reads all zookeeper state.
- broker failures and controlled shutdown. The controller tries to elect a new leader for partitions previously led by the broker. The controller also removes the broker from isr on partitions for which the broker was a follower. These all incur partition-granular reads and writes to zookeeper.
As a first step in addressing these issues, we built a low-level wrapper client called ZookeeperClient in KAFKA-5501 that encourages pipelined, asynchronous apis.
This patch converts the controller to use the async ZookeeperClient to improve controller failover, broker failure handling, and controlled shutdown times.
Some notable changes made in this patch:
- All ControllerEvents now defer access to zookeeper at processing time instead of enqueue time as was intended with the single-threaded event queue model patch from KAFKA-5028. This results in a fresh view of the zookeeper state by the time we process the event. This reverts the hacks from KAFKA-5502 and KAFKA-5879.
- We refactored PartitionStateMachine and ReplicaStateMachine to process multiple partitions and replicas in batch rather than one-at-a-time so that we can send a batch of requests over to ZookeeperClient to pipeline.
- We've decouple ZookeeperClient handler registration from watcher registration. Previously, these two were coupled, which meant handler registrations actually sent out a request to the zookeeper ensemble to do the actual watcher registration. In KafkaController.onControllerFailover, we register partition modification handlers (thereby registering watchers) and additionally lookup the partition assignments for every topic in the cluster. We can shave a bit of time off failover if we merge these two operations. We can do this by decoupling ZookeeperClient handler registration from watcher registration. This means ZookeeperClient's registration apis have been changed so that they are purely in-memory operations, and they only take effect when the client sends ExistsRequest, GetDataRequest, or GetChildrenRequest.
- We've simplified the logic for updating LeaderAndIsr such that if we get a BADVERSION error code, the controller will now just retry in the next round by reading the new state and trying the update again. This simplifies logic when updating the partition leader epoch, removing replicas from isr, and electing leaders for partitions.
- We've implemented KAFKA-5083: always leave the last surviving member of the ISR in ZK. This means that if people re-disabled unclean leader election, we can still try to elect the leader from the last in-sync replica.
- ZookeeperClient's handlers have been changed so that their methods default to no-ops for convenience.
- All znode paths and definitions for znode encoding and decoding have been consolidated as static methods in ZkData.scala.
- The partition leader election algorithms have been refactored as pure functions so that they can be easily unit tested.
- PartitionStateMachine and ReplicaStateMachine now have unit tests.
Author: Onur Karaman <okaraman@linkedin.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Closes#3765 from onurkaraman/KAFKA-5642
long sizeBytes() {
long sizeInBytes = 0;
for (final NamedCache namedCache : caches.values()) {
sizeInBytes += namedCache.sizeInBytes();
}
return sizeInBytes;
}
The summation w.r.t. sizeInBytes may overflow.
Check similar to what is done in size() should be performed.
Author: siva santhalingam <siva.santhalingam@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#4041 from shivsantham/kafka-6023
Author: Dong Lin <lindong28@gmail.com>
Reviewers: Tom Bentley <tbentley@redhat.com>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Closes#3874 from lindong28/KAFKA-5163
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#3814 from omkreddy/KAFKA-4504
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax<matthias@confluent.io>, Bill Bejeck <bill@confluent.io>
Closes#4074 from dguy/minor-session-window-equals
1. Added missing Javadocs in public interfaces.
2. Added missing upgrade web docs.
3. Minor improvements on exception messages.
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>, Matthias J. Sax <matthias@confluent.io>, Antony Stubbs <antony.stubbs@gmail.com>
Closes#4071 from guozhangwang/KMinor-javadoc-gaps
Author: Jacek Laskowski <jacek@japila.pl>
Reviewers: Apurva Mehta <apurva@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#4038 from jaceklaskowski/KAFKA-4818-isolationLevel
- Remove "list commits" since we never use it
- Fix release branch detection to just look
for branches that start with digits
- Make script executable
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4067 from ijuma/merge-script-improvements
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4064 from mjsax/minor-add-state-serdes-test
Multiple inflights means that when there are rolling bounces or other cluster instability, there is an increased likelihood of having previously tried batch expire in the accumulator. This is a fatal error
for a transactional producer, causing the `TransactionalMessageCopier` to exit. To work around this, we bump the request timeout. We can get rid of this when KIP-91 is merged.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4039 from apurvam/MINOR-bump-request-timeout-in-transactional-message-copier
Author: Bill Bejeck <bill@confluent.io>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax<matthias@confluent.io>
Closes#4068 from bbejeck/MINOR_fix_java_doc_example_for_1_0_API
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#4063 from mjsax/minor-improve-store-parameter-checks
The following happens on Windows for `HUP`:
[2017-10-11 21:45:11,642] FATAL (kafka.Kafka$)
java.lang.IllegalArgumentException: Unknown signal: HUP
at sun.misc.Signal.<init>(Unknown Source)
at kafka.Kafka$.registerHandler$1(Kafka.scala:67)
at kafka.Kafka$.registerLoggingSignalHandler(Kafka.scala:73)
at kafka.Kafka$.main(Kafka.scala:82)
at kafka.Kafka.main(Kafka.scala)
I thought it was safer not to register them at all since the additional
logging is a nice to have and we haven't tested it on Windows.
Also changed map to be concurrent and removed stray
printStackTrace in test.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Damian Guy <damian.guy@gmail.com>
Closes#4066 from ijuma/dont-register-signal-handler-windows
I found this by running the tests while I happened to
have a kafka broker running.
Author: Tom Bentley <tbentley@redhat.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4065 from tombentley/MINOR-random-port
Use Scala string templates instead of format
Author: Mickael Maison <mickael.maison@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4058 from mimaison/minor_AFT_logging
guozhangwang Please review.
Author: Manjula K <manjula@kafka-summit.org>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4059 from manjuapu/redesign-streams-page
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#4054 from guozhangwang/KMinor-pre-1.0-release
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4046 from mjsax/kafka-5541-minor-follow-up
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4051 from mjsax/minor-kip-182-follow-up
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#4035 from hachikuji/KAFKA-5547-followup and squashes the following commits:
f6b04ce1a [Jason Gustafson] Add a couple missed common fields
d3473b14d [Jason Gustafson] Fix compilation errors and a few warnings
58a0ae695 [Jason Gustafson] MINOR: Avoid some unnecessary collection copies in KafkaApis
With these changes, we are ensuring that the partitions being reassigned are from non-zero offsets. We also ensure that every message in the log has producerId and sequence number.
This means that it successfully reproduces https://issues.apache.org/jira/browse/KAFKA-6003.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4029 from apurvam/KAFKA-6016-add-idempotent-producer-to-reassign-partitions
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4048 from mjsax/fix-eos-test-race-condition
Author: bartdevylder <bartdevylder@gmail.com>
Author: Bart De Vylder <bartdevylder@gmail.com>
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Closes#4044 from bartdevylder/KAFKA-6026
Author: Xin Li <Xin.Li@trivago.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Closes#4043 from lisa2lisa/fix
(cherry picked from commit bb27215ceac9caaac446b5fe43d3f3284624fcb4)
Signed-off-by: Jason Gustafson <jason@confluent.io>
Author: Jason Gustafson <jason@confluent.io>
Reviewers: tedyu <yuzhihong@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#4047 from hachikuji/factor-out-some-common-fields
…up rebalances in progress
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#3506 from cmccabe/KAFKA-5565
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4037 from mjsax/kafka-5541-dont-rethrow-on-suspend-or-close-2
- improve tests to get rid of calls to `sleep` in Python
- fixed some flaky test conditions
- improve debugging
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#3542 from mjsax/failing-eos-system-tests
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4023 from ijuma/kafka-5829-avoid-reading-older-segments-on-hard-shutdown
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Xavier Léauté <xavier@confluent.io>, Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>
Closes#4031 from guozhangwang/KMinor-assigned-task-log4j
Add a `with(Serde keySerde, Serde valSerde)` to `Materialized` for cases where people don't care about the state store name.
Author: Damian Guy <damian.guy@gmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Matthias J. Sax <matthias@confluent.io>
Closes#4009 from dguy/materialized
It is possible for batches with sequence numbers to be in the `deque` while at the same time the in flight batches in the `TransactionManager` are removed due to a producerId reset.
In this case, when the batches in the `deque` are drained, we will get a `NullPointerException` in the background thread due to this line:
```java
if (first.hasSequence() && first.baseSequence() != transactionManager.nextBatchBySequence(first.topicPartition).baseSequence())
```
Particularly, `transactionManager.nextBatchBySequence` will return null, because there no inflight batches being tracked.
In this patch, we simply allow the batches in the `deque` to be drained if there are no in flight batches being tracked in the TransactionManager. If they succeed, well and good. If the responses come back with an error, the batces will be ultimately failed in the producer with an `OutOfOrderSequenceException` when the response comes back.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4022 from apurvam/KAFKA-6015-npe-in-record-accumulator
This change allows for testing custom Processors and Transformers that call `schedule` and `commit` using KStreamTestDriver, by _not_ throwing `UnsupportedOperationException`.
This PR is my original work.
Author: Mats Julian Olsen <mats@plysjbyen.net>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#3992 from mewwts/allow-schedule-and-commit
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#3819 from guozhangwang/KMinor-rocksDB-573
Updated a couple places in docs with the 'registered' trademark symbol.
Author: Derrick Or <derrickor@gmail.com>
Reviewers: Jun Rao <junrao@gmail.com>
Closes#4028 from derrickdoo/kafka-trademark-status
Added metrics to the Connect worker and rebalancing metrics to the distributed herder.
This is built on top of #3987, and I can rebase this PR once that is merged.
Author: Randall Hauch <rhauch@gmail.com>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#4011 from rhauch/kafka-5903
Changed these topic titles:
- Write your own Streams Applications -> Tutorial: Write a Streams Application
- Play with a Streams Application -> Run the Streams Demo Application
Author: Joel Hamill <joel@Joel-Hamill-Confluent.local>
Author: Joel Hamill <11722533+joel-hamill@users.noreply.github.com>
Reviewers: Michael G. Noll <michael@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4017 from joel-hamill/joel-hamill/streams-titles