Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.
Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
Move the error code resetting logic from the onPartitionsRevoked callback into the streamthread directly after we've decided to rejoin the group, since onPartitionsRevoked are not guaranteed to be triggered.
Ran system tests on the originally failed StreamsUpgradeTest 10 times and passed.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Jun Rao <junrao@gmail.com>
This is the implementation for [KIP-503](https://cwiki.apache.org/confluence/display/KAFKA/KIP-503%3A+Add+metric+for+number+of+topics+marked+for+deletion)
When deleting a large number of topics, the Controller can get quite bogged down. One problem with this is the lack of visibility into the progress of the Controller. We can look into the ZK path for topics marked for deletion, but in a production environment this is inconvenient. This PR adds a JMX metric `kafka.controller:type=KafkaController,name=TopicsToDeleteCount` to make it easier to see how many topics are being deleted.
Reviewers: Stanislav Kozlovski <stanislav@confluent.io>, Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
Author: asutosh936 <asutosh.pandya@hotmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Vahid Hashemian <vahid.hashemian@gmail.com>, Jason Gustafson <jason@confluent.io>
Closes#7141 from asutosh936/KAFKA-8698
In a KTable context, we should not pass null into a user-supplied serde.
Testing: I verified that the change to the test results in test failures without the patch.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>,
This patch fixes a bug in the handling of MESSAGE_TOO_LARGE errors. The large batch is split, the smaller batches are re-added to the accumulator, and the batch is deallocated, but it was not removed from the list of in-flight batches. When the batch was eventually expired from the in-flight batches, the producer would try to deallocate it a second time, causing an error. This patch changes the behavior to correctly remove the batch from the list of in-flight requests.
Reviewers: Luke Stephenson, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
`testProduceAfterLogDirFailureOnLeader` currently disables producer retries in
order to catch and validate the exception thrown by a failure, and then tries to
produce successfully once the leadership changes. This second produce can
intermittently fail, causing test flakiness. This patch splits these validations
into two tests in order to allow retries for the produce request after the
leadership change.
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Ismael Juma <ismael@juma.me.uk>
If retrieving metadata during `poll(Duration)` repeatedly takes longer than the
poll timeout, we don't make progress and `waitUntilTrue` eventually times out
causing the test to fail. This behaviour differs from the older `poll(long)` that
would block until metadata retrieval had completed. The flakiness was likely
introduced when we switched from the latter to the former.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Reverts the TreeMap -> ConcurrentSkipListMap change that caused a performance regression in 2.3, and fixes the ConcurrentModificationException by copying (just) the key set to iterate over
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Refactors the DescribeDelegationToken to use the generated RPC classes.
Author: Viktor Somogyi-Vass <viktorsomogyi@gmail.com>
Author: Viktor Somogyi <viktorsomogyi@gmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7154 from viktorsomogyi/refactor-describe-dt
This patch updates ConsumerGroupCommandTest.scala to use the maximum possible number of AdminClient retries. The test runs will still be bounded by the request timeout. This address flakiness in tests such as testResetOffsetsNotExistingGroup and testResetOffsetsExistingTopic, which was caused by group coordinators being intermittently unavailable.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Update the javadoc of SockerServer#startup(). SocketServer#startProcessors() does not exist any more and it has been replaced by SocketServer#startDataPlaneProcessors() and SocketServer#startControlPlaneProcessor().
Reviewers: Jason Gustafson <jason@confluent.io>
Credit to @lbradstreet for profiling the producer with a large number of partitions.
Cache `topicMetadata`, `brokers` and `controller` in the `MetadataResponse`
the first time it's needed avoid unnecessary recomputation. We were previously
computing`brokersMap` 4 times per partition in one code path that was invoked from
multiple places. This is a regression introduced via a42f16f980 and first released
in 2.3.0.
The `Cluster` constructor became significantly more allocation heavy due to
2c44e77e2f20, first released in 2.2.0. Replaced `merge` calls with more verbose,
but more efficient code. Added a test to verify that the returned collections are
unmodifiable.
Add `topicAuthorizedOperations` and `clusterAuthorizedOperations` to
`MetadataResponse` and remove `data()` method.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Lucas Bradstreet <lucasbradstreet@gmail.com>, Colin P. McCabe <cmccabe@confluent.io>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Justine Olshan <jolshan@confluent.io>
This patch stores `clusterId` in the `meta.properties` file. During startup, the broker checks that it joins the correct cluster and fails fast otherwise.
The `meta.properties' is versioned. I have decided to not bump the version because 1) the clusterId is null anyway if not present in the file; and 2) bumping it means that rolling back to a previous version won't work.
I have refactored the way the metadata is read and written as it was strongly coupled with the brokerId bits. Now, the metadata is read independently during the startup and used to 1) check the clusterId and 2) get or generate the brokerId (as before).
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
Add the AlterPartitionReassignments and ListPartitionReassignments APIs. Also remove an unused methodlength suppression for KafkaAdminClient.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Viktor Somogyi <viktorsomogyi@gmail.com>
Kafka ships with default ZK configuration. With the upgrade to ZK 3.5, our defaults include running ZK's AdminServer on port 8080. This is an unfortunate default as it tends to cause conflicts.
I suggest we default to disable ZK's AdminServer in the default ZK configs that we ship. Users who want to use AdminServer can enable it and set the port to something that works for them. Realistically, in most production environments, a different ZK server will be used anyway. So this is mostly to save new users who are trying Kafka on their own machine from running into accidental and frustrating port conflicts.
Author: Gwen Shapira <gwen@confluent.io>
Reviewers: Ismael Juma
Closes#7203 from gwenshap/zk_disable_adminserver
The RecordAccumulator ready calls `leaderFor` unnecessarily when the ProducerBatch
queue is empty. When producing to many partitions, the queue is often empty and the
`leaderFor` call can be expensive in comparison. Remove the unnecessary call.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>, Richard Yu <yohan.richard.yu@gmail.com>, Guozhang Wang <guozhang@confluent.io>
Iterator#remove has a default implementation that throws UnsupportedOperatorException so there's no need to override it with the same thing.
Should be cherry-picked back to whenever we switched to Java 8
Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <guozhang@confluent.io>
InMemoryKeyValueStore uses ConcurrentSkipListMap#size which takes linear time as it iterates over the entire map. We should just track size ourselves for approximateNumEntries
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>
This adds a basic system test that enables rack-aware brokers with the rack-aware replica selector for fetch from followers (KIP-392). The test asserts that the follower was read from at least once and that all the messages that were produced were successfully consumed.
Reviewers: Jason Gustafson <jason@confluent.io>
Added the ability for the connector handles and task handles, which are used by the monitorable source and sink connectors used to verify the functionality of the Connect framework, to record the number of times the connector and tasks have each been started, and to allow a test to obtain a `RestartLatch` that can be used to block until the connectors and/or tasks have been restarted a specified number of types.
Typically, a test will get the `ConnectorHandle` for a connector, and call the `ConnectorHandle.expectedRestarts(int)` method with the expected number of times that the connector and/or tasks will be restarted, and will hold onto the resulting `RestartLatch`. The test will then change the connector (or otherwise cause the connector to restart) one or more times as desired, and then call `RestartLatch.await(long, TimeUnit)` to block the test up to a specified duration for the connector and all tasks to be started the specified number of times.
This commit also increases several of the maximum wait times used in other integration tests. It doesn’t hurt to potentially wait longer, since most test runs will not need to wait the maximum amount of time anyway. However, in the rare cases that do need that extra time, waiting a bit more is fine if we can reduce the flakiness and minimize test failures that happened to time out too early.
Unit tests were added for the new `RestartLatch` and `StopAndStartCounter` utility classes. This PR only affects the tests and does not affect any runtime code or API.
**This should be merged on `trunk` and backported to the `2.3.x` branch.**
Author: Randall Hauch <rhauch@gmail.com>
Reviewers: Konstantine Karantasis, Arjun Satish
Closes#7019 from rhauch/kafka-8391
When calling readLogToEnd(), the KafkaBasedLog worker thread should catch TimeoutException and log a warning, which can occur if brokers are unavailable, otherwise the worker thread terminates.
Includes an enhancement to MockConsumer that allows simulating exceptions not just when polling but also when querying for offsets, which is necessary for testing the fix.
Author: Paul Whalen <pgwhalen@gmail.com>
Reviewers: Randall Hauch <rhauch@gmail.com>, Arjun Satish <arjun@confluent.io>, Ryanne Dolan <ryannedolan@gmail.com>
Corrected the AbstractHerder to correctly identify task configs that contain variables for externalized secrets. The original method incorrectly used `matcher.matches()` instead of `matcher.find()`. The former method expects the entire string to match the regex, whereas the second one can find a pattern anywhere within the input string (which fits this use case more correctly).
Added unit tests to cover various cases of a config with externalized secrets, and updated system tests to cover case where config value contains additional characters besides secret that requires regex pattern to be found anywhere in the string (as opposed to complete match).
Author: Arjun Satish <arjun@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
The assignReplicasToBrokers method has helpful but a large unformatted
javadoc comment that results in a big blob in generated html. This
change formats the comment so that generated javadoc is nice.
Reviewers: Stanislav Kozlovski <familyguyuser192@windowslive.com>, Jason Gustafson <jason@confluent.io>
Since `Metrics` was constructed with `enableExpiration=false`, this was
not a source of flakiness given the current implementation. This could
change in the future, so good to follow the class contract.
Included a few clean-ups with regards to redundant casts and type parameters
as well as usage of try with resources for inline usage of `Metrics`.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Avoids calling both `containsKey` and `get` from isMuted, when only a single get is necessary.
Also avoid calling `remove` unless necessary. This could be a reduction of map operations
from 3 to 1.
isMuted showed up as a hotspot in profiling when using the producer with high numbers of partitions.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Connector validation fails if an alias is used for the converter since the validation for that is done via `ConfigDef.validateAll(...)`, which in turn invokes `Class.forName(...)` on the alias. Even though the class is successfully loaded by the DelegatingClassLoader, some Java implementations will refuse to return a class from `Class.forName(...)` whose name differs from the argument provided.
This commit alters `ConfigDef.parseType(...)` to first invoke `ClassLoader.loadClass(...)` on the class using our class loader in order to get a handle on the actual class object to be loaded, then invoke `Class.forName(...)` with the fully-qualified class name of the to-be-loaded class and return the result. The invocation of `Class.forName(...)` is necessary in order to allow static initialization to take place; simply calling `ClassLoader.loadClass(...)` is insufficient.
Also corrected a unit test that relied upon the old behavior.
Author: Chris Egerton <chrise@confluent.io>
Reviewers: Robert Yokota <rayokota@gmail.com>, Randall Hauch <rhauch@gmail.com>
We recently introduced a bunch of flaky tests in the AdminClientIntegrationTest.
These tests are failing very frequently. We should ignore the tests in order to
make the build stable until we have a fix.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The introduction of KIP-480: Sticky Producer Partitioner had the
side effect that generateAndProduceMessages can often write
messages to a lower number of partitions to improve batching.
testDescribeLogDirsRequest (and potentially other tests) relies
on the messages being written somewhat uniformly to the topic
partitions. We fix the issue by including a monotonically
increasing key in the produced messages.
I also included a couple of minor clean-ups I noticed while
debugging the issue.
The test failed very frequently when executed locally before the
change and it passed 100 times consecutively after the change.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Author: Mickael Maison <mickael.maison@gmail.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Viktor Somogyi <viktorsomogyi@gmail.com>
Closes#7038 from mimaison/KAFKA-8598
1. Add onPartitionsLost into the RebalanceListener, which will be triggered when the consumer found that the generation is reset due to fatal errors in response handling.
2. Semantical behavior change: with COOPERATIVE protocol, if the revoked / lost partitions are empty, do not trigger the corresponding callback at all. For added partitions though, even if it is empty we would still trigger the callback as a way to notify the rebalance event; with EAGER protocol, revoked / assigned callbacks are always triggered.
The ordering of the callback would be the following:
a. Callback onPartitionsRevoked / onPartitionsLost triggered.
b. Update the assignment (both revoked and added).
c. Callback onPartitionsAssigned triggered.
In this way we are assured that users can still access the partitions being revoked, whereas they can also access the partitions being added.
3. Semantical behavior change (KAFKA-4600): if the rebalance listener throws an exception, pass it along all the way to the consumer.poll caller, but still completes the rest of the actions. Also, the newly assigned partitions list does not gets affected with exception thrown since it is just for notifying the users.
4. Semantical behavior change: the ConsumerCoordinator would not try to modify assignor's returned assignments, instead it will validate that assignments and set the error code accordingly: if there are overlaps between added / revoked partitions, it is a fatal error and would be communicated to all members to stop; if revoked is not empty, it is an error indicate re-join; otherwise, it is normal.
5. Minor: with the error code removed from the Assignment, ConsumerCoordinator will request re-join if the revoked partitions list is not empty.
6. Updated ConsumerCoordinatorTest accordingly. Also found a minor bug in MetadataUpdate that removed topic would still be retained with null value of num.partitions.
6. Updated a few other flaky tests that are exposed due to this change.
Reviewers: John Roesler <vvcephei@users.noreply.github.com>, A. Sophie Blee-Goldman <sophie@confluent.io>, Jason Gustafson <jason@confluent.io>
Author: Mickael Maison <mickael.maison@gmail.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Viktor Somogyi <viktorsomogyi@gmail.com>
Closes#7098 from mimaison/KAFKA-8599
2.9.9.1 and 2.9.9.2 include security fixes while 2.9.9.3 fixes a regression
introduced in 2.9.9.2.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>