Previously, Trogdor only handled "Faults." Now, Trogdor can handle
"Tasks" which may be either faults, or workloads to execute in the
background.
The Agent and Coordinator have been refactored from a
mutexes-and-condition-variables paradigm into a message passing
paradigm. No locks are necessary, because only one thread can access
the task state or worker state. This makes them a lot easier to reason
about.
The MockTime class can now handle mocking deferred message passing
(adding a message to an ExecutorService with a delay). I added a
MockTimeTest.
MiniTrogdorCluster now starts up Agent and Coordinator classes in
paralle in order to minimize junit test time.
RPC messages now inherit from a common Message.java class. This class
handles implementing serialization, equals, hashCode, etc.
Remove FaultSet, since it is no longer necessary.
Previously, if CoordinatorClient or AgentClient hit a networking
problem, they would throw an exception. They now retry several times
before giving up. Additionally, the REST RPCs to the Coordinator and
Agent have been changed to be idempotent. If a response is lost, and
the request is resent, no harm will be done.
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#4073 from cmccabe/KAFKA-6060
Author: Bill Bejeck <bill@confluent.io>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4168 from bbejeck/MINOR_update_streams_produer_timeout_in_system_test
ijuma ewencp
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4171 from guozhangwang/KMinor-update-releasepy
It was committed inadvertently.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4172 from ijuma/remove-out-folder
guozhangwang Please review
Author: Manjula K <manjula@kafka-summit.org>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4164 from manjuapu/ny-trivago-logos
This is a followup to #4137
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Closes#4146 from apurvam/MINOR-followups-to-bump-epoch-on-expire-patch
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4128 from mjsax/minor-cleanup
minor fix
Author: Richard Yu <richardyu@Richards-Air.attlocal.net>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4110 from ConcurrencyPractitioner/trunk
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4158 from ijuma/kafka-2903-file-records-read-slice-size-greater
Author: Tom Bentley <tbentley@redhat.com>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4157 from tombentley/KAFKA-6130-verifiable-consumer-max-messages
I kept zkUtils for the call to AdminUtils.createTopic(). AdminUtils can be done in another PR.
Is there a reason why we use TopicAndPartition instead of TopicPartition in KafkaControllerZkUtils ?
Author: Mickael Maison <mickael.maison@gmail.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Closes#4111 from mimaison/KAFKA-6073
This has been disabled since the start and since
it's removed in TLS 1.3, there are no plans to
ever support it.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4034 from ijuma/remove-tls-renegotiation-support
Failure to close the producer could cause a transient failure, more details
below.
The request timeout was only 2 seconds, exceptions thrown were not
propagated and the producer would not be closed. If the exception
was thrown during `send`, we did not increment `numMessages`
allowing the test to pass.
I have increased the timeout to 10 seconds and made sure that
exceptions are propagated.
Example of the error:
```text
kafka.api.SaslSslAdminClientIntegrationTest > classMethod STARTED
kafka.api.SaslSslAdminClientIntegrationTest > classMethod FAILED
java.lang.AssertionError: Found unexpected threads, allThreads=Set(metrics-meter-tick-thread-2, Signal Dispatcher, main, Reference Handler, scala-execution-context-global-164, kafka-producer-network-thread | producer-1, scala-execution-context-global-166, Test worker, scala-execution-context-global-1249, /0:0:0:0:0:0:0:1:58910 to /0:0:0:0:0:0:0:1:43025 workers Thread 2, Finalizer, /0:0:0:0:0:0:0:1:58910 to /0:0:0:0:0:0:0:1:43025 workers Thread 3, scala-execution-context-global-163, metrics-meter-tick-thread-1)
```
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4144 from ijuma/ensure-producer-is-closed-test-alter-replica-log-dirs
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Tom Bentley <tbentley@redhat.com>, Ismael Juma <ismael@juma.me.uk>
Closes#3951 from omkreddy/KIP-103-DOCS
Author: Vahid Hashemian <vahidhashemian@us.ibm.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4139 from vahidhashemian/minor/indentation_fix_1710
A description of the problem is in the JIRA. I have added an integration test which reproduces the original scenario, and also added unit test cases.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>, Ted Yu <yuzhihong@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Closes#4137 from apurvam/KAFKA-6119-bump-epoch-when-expiring-transactions
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#4140 from rajinisivaram/KAFKA-6131-txn-concurrentmap
The output of `wmic` can be very long and could truncate the search keywords in the existing command. If those keywords are truncated no process is returned in the output. An update is suggested to the command by which the query is performed inside the `wmic` command itself instead of using pipes and `find`.
Author: Vahid Hashemian <vahidhashemian@us.ibm.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4083 from vahidhashemian/minor/improve_quickstart_for_windows_wmic
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Vahid Hashemian <vahidhashemian@us.ibm.com>, Damian Guy <damian.guy@gmail.com>, Bill Bejeck <bill@confluent.io>
Closes#4136 from guozhangwang/K6100-rocksdb-580-regression
Even though this class is internal, it's widely
used by other projects and it's better to avoid
breaking them until we have a publicly supported
test library.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4138 from ijuma/revert-embedded-zookeeper-rename
- kafka.controller.ZookeeperClient -> kafka.zookeeper.ZooKeeperClient
- kafka.controller.ControllerZkUtils -> kafka.zk.KafkaZkClient
- kafka.controller.ZkData -> kafka.zk.ZkData
- Renamed various fields to match new names and for consistency
- A few clean-ups in ZkData
- Document intent
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Onur Karaman <okaraman@linkedin.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Jun Rao <junrao@gmail.com>
Closes#4112 from ijuma/rename-zookeeper-client-and-move-to-zookeper-package
related to https://github.com/apache/kafka-site/pull/103
Author: Joel Hamill <git config --global user.email>
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Closes#4133 from joel-hamill/dev-guide-title
1. After the poll call, re-check if the state has been changed or not; if yes, initialize the tasks again.
2. Minor log4j improvements.
Author: Guozhang Wang <wangguoz@gmail.com>
Author: Damian Guy <damian.guy@gmail.com>
Author: Jason Gustafson <jason@confluent.io>
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>, Matthias J. Sax <matthias@confluent.io>, Ted Yu <yuzhihong@gmail.com>
Closes#4096 from guozhangwang/KHotfix-restore-only
A couple of root causes of this flaky test is fixed:
1. The MockTime was incorrectly used across multiple test methods within the class, as a class rule. Instead we set it on each test case; also remove the scala MockTime dependency.
2. List topics may not contain the deleted topics while their ZK paths are yet to be deleted; so the delete-check-recreate pattern may fail to successfully recreate the topic at all. Change the checking to read from zk path directly instead.
Another minor fix is to remove the misleading wait condition error message as the accumData is always empty.
Author: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>, Matthias J. Sax <matthias@confluent.io>
Closes#4095 from guozhangwang/KMinor-reset-integration-test
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Alex Ayars <alex.ayars@confluent.io>
Closes#4084 from cmccabe/KAFKA-6070
It seems to output a few false positives, but still
worth verifying.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4117 from ijuma/dependency-check
Mainly for Java 9 fixes and improved compilation times (5-10% reduction):
http://www.scala-lang.org/news/2.12.4
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4102 from ijuma/update-scala-version
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#4104 from mjsax/minor-uncaught-exception-handler
The methods resetReconnectBackoff and updateReconnectBackoff in ClusterConnectionStates both take an instance of a private inner class as parameter and thus cannot be called from outside the class anyway.
Author: Soenke Liebau <soenke.liebau@opencore.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4114 from soenkeliebau/MINOR_private
* Fix issue in `retryRequestsUntilConnected` where the same response
could appear multiple times (implies that we are lacking test coverage)
* Introduce type member in AsyncRequest for the AsyncResponse
type and refactor the code to eliminate most downcasts
* Remove a number of unnecessary collection copies in
`retryRequestsUntilConnected`
* Move ControllerContext to its own file
* Rename getACL/setACL to getAcl/setAcl to match Kafka naming
convention
* Replace tuple of 3 elements with case class in one place (we
should do this in other places too)
* Extract `send` and `shouldWatch` from
`ZooKeeperClient.handleRequests`
* Use pattern matching instead of if/else chains in a few places (we
should do it in more places)
* A couple of renames to avoid overloads and hence benefit from
better type inference
* Use Option and default arguments instead of passing null in
some places
* `Expired` is no longer a case class since it has no parameters,
but it has state
* Various minor clean-ups
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jun Rao <junrao@gmail.com>, Onur Karaman <okaraman@linkedin.com>
Closes#4088 from ijuma/async-zkclient-cleanups
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#4103 from rajinisivaram/KAFKA-6042-group-deadlock
(cherry picked from commit 5ee157126d)
Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
The idempotent producer doesn't change that setting any more and the
accepted range has changed.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Apurva Mehta <apurva@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#4097 from ijuma/fix-javadoc-wrt-max-in-flight-for-idempotent
Author: Tommy Becker <tobecker@tivo.com>
Reviewers: Bill Bejeck <bill@confluent.io>, Damian Guy <damian.guy@gmail.com>
Closes#4081 from twbecker/KAFKA-6069
Currently, in branches _trunk_, _0.11.0_, and _1.0_ the property **max.in.flight.requests.per.connection** is incorrectly misspelled as _max.inflight.requests.per.connection_
harshach ijuma guozhangwang can you please review. Thank you.
Author: Hugo Louro <hmclouro@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4094 from hmcl/trunk_MINOR_Doc_InflightProp
Rearranged the testAddPartitionDuringDeleteTopic() test to keep the
likelyhood of the race condition.
Author: Maytee Chinavanichkit <maytee.chinavanichkit@linecorp.com>
Reviewers: Jun Rao <junrao@gmail.com>
Closes#4056 from mayt/KAFKA-6051
Kafka today uses ZkClient, a wrapper client around the raw Zookeeper client. This library only exposes synchronous apis to the user. Synchronous apis mean we must wait an entire round trip before doing the next operation.
This becomes problematic with partition-heavy clusters, as we find the controller spending a significant amount of time just sending many sequential reads and writes to zookeeper at the per-partition granularity. This especially becomes an issue during:
- controller failover, where the newly elected controller effectively reads all zookeeper state.
- broker failures and controlled shutdown. The controller tries to elect a new leader for partitions previously led by the broker. The controller also removes the broker from isr on partitions for which the broker was a follower. These all incur partition-granular reads and writes to zookeeper.
As a first step in addressing these issues, we built a low-level wrapper client called ZookeeperClient in KAFKA-5501 that encourages pipelined, asynchronous apis.
This patch converts the controller to use the async ZookeeperClient to improve controller failover, broker failure handling, and controlled shutdown times.
Some notable changes made in this patch:
- All ControllerEvents now defer access to zookeeper at processing time instead of enqueue time as was intended with the single-threaded event queue model patch from KAFKA-5028. This results in a fresh view of the zookeeper state by the time we process the event. This reverts the hacks from KAFKA-5502 and KAFKA-5879.
- We refactored PartitionStateMachine and ReplicaStateMachine to process multiple partitions and replicas in batch rather than one-at-a-time so that we can send a batch of requests over to ZookeeperClient to pipeline.
- We've decouple ZookeeperClient handler registration from watcher registration. Previously, these two were coupled, which meant handler registrations actually sent out a request to the zookeeper ensemble to do the actual watcher registration. In KafkaController.onControllerFailover, we register partition modification handlers (thereby registering watchers) and additionally lookup the partition assignments for every topic in the cluster. We can shave a bit of time off failover if we merge these two operations. We can do this by decoupling ZookeeperClient handler registration from watcher registration. This means ZookeeperClient's registration apis have been changed so that they are purely in-memory operations, and they only take effect when the client sends ExistsRequest, GetDataRequest, or GetChildrenRequest.
- We've simplified the logic for updating LeaderAndIsr such that if we get a BADVERSION error code, the controller will now just retry in the next round by reading the new state and trying the update again. This simplifies logic when updating the partition leader epoch, removing replicas from isr, and electing leaders for partitions.
- We've implemented KAFKA-5083: always leave the last surviving member of the ISR in ZK. This means that if people re-disabled unclean leader election, we can still try to elect the leader from the last in-sync replica.
- ZookeeperClient's handlers have been changed so that their methods default to no-ops for convenience.
- All znode paths and definitions for znode encoding and decoding have been consolidated as static methods in ZkData.scala.
- The partition leader election algorithms have been refactored as pure functions so that they can be easily unit tested.
- PartitionStateMachine and ReplicaStateMachine now have unit tests.
Author: Onur Karaman <okaraman@linkedin.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Closes#3765 from onurkaraman/KAFKA-5642