Right now we only have very generic FailedProduceRequestsPerSec and FailedFetchRequestsPerSec metrics that mark whenever a record is failed on the broker side. To improve the debugging UX, I added 4 new metrics in BrokerTopicStats to log various scenarios when an InvalidRecordException is thrown when LogValidator fails to validate a record:
-- NoKeyCompactedTopicRecordsPerSec: counter of failures by compacted records with no key
-- InvalidMagicNumberRecordsPerSec: counter of failures by records with invalid magic number
-- InvalidMessageCrcRecordsPerSec: counter of failures by records with crc corruption
-- NonIncreasingOffsetRecordsPerSec: counter of failures by records with invalid offset
Reviewers: Robert Yokota <rayokota@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
This patch ensures that relevant system configurations are included in exception messages when zk security validation fails.
Reviewers: Vikas Singh <soondenana@users.noreply.github.com>, José Armando García Sancio <jsancio@users.noreply.github.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Added unit test for recent fix of `KafkaConfigBackingStore`.
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
Corrected the `KafkaConfigBackingStore` logic to notify of only the changed tasks, rather than all tasks. This was not noticed before because Connect always stopped and restarted all tasks during a rebalanced, but since 2.3 the incremental rebalance logic exposed this bug.
Author: Luying Liu <lyliu@lyliu-mac.freewheelmedia.net>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>
New Java Authorizer API and a new out-of-the-box authorizer (AclAuthorizer) that implements the new interface.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Fix a bug where ClientCompatibilityFeaturesTest fails when running multiple iterations.
Also, fix a typo in tests/docker/Dockerfile.
Reviewers: Ismael Juma <ismael@juma.me.uk>
In case of version probing we would skip the logic for setting cluster / assigned tasks; since these values are initialized as null they are vulnerable to NPE when code changes.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>
The tag key for store level metrics specified in StreamsMetricsImpl
is unified with the tag keys on thread and task level.
Reviewers: Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
The jmh LRUCacheBenchmark will exhibit an int overflow when run on a fast machine:
```
java.lang.ArrayIndexOutOfBoundsException: Index -3648 out of bounds for length 10000
at org.apache.kafka.jmh.cache.LRUCacheBenchmark.testCachePerformance(LRUCacheBenchmark.java:70)
at org.apache.kafka.jmh.cache.generated.LRUCacheBenchmark_testCachePerformance_jmhTest.testCachePerformance_thrpt_jmhStub(LRUCacheBenchmark_testCachePerformance_jmhTest.java:119)
at org.apache.kafka.jmh.cache.generated.LRUCacheBenchmark_testCachePerformance_jmhTest.testCachePerformance_Throughput(LRUCacheBenchmark_testCachePerformance_jmhTest.java:83)
```
Reviewers: Jason Gustafson <jason@confluent.io>
As part of commit 4d1ee26a13 streams
version 2.3.0 test jar was added, but there was a simple typo in the
path that specified the version.
`ducker-ak up` was failing because of that. Fixed that.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Reviewers: cpettitt-confluent <53191309+cpettitt-confluent@users.noreply.github.com>, A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Currently on commit streams will attempt to delete offsets from repartition topics. However, if a topology does not have any repartition topics, then the recordsToDelete map will be empty.
This PR adds a check that the recordsToDelete is not empty before executing the AdminClient#deleteRecords() method.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
- Call `assertNoNonDaemonThreads` in test method instead of tear down method
to avoid situation where parent's class tear down is not invoked.
- Pass the thread prefix in tests that call `assertNoNonDaemonThreads` so that it
works correctly.
- Rename `verifyNonDaemonThreadsStatus` to `assertNoNonDaemonThreads` to
make it clear that it may throw.
Reviewers: Anna Povzner <anna@confluent.io>, Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This patch fixes the quota system test whose JMX tool relies on the existence
of these metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Nikhil Bhatia <nikhil@confluent.io>, Tu V. Tran <tuvtran97@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Splits the existing StickyAssignor logic into an AbstractStickyAssignor class, which is extended by the existing (eager) StickyAssignor and by the new CooperativeStickyAssignor which supports incremental cooperative rebalancing.
There is no actual change to the logic -- most methods from StickyAssignor were moved to AbstractStickyAssignor to be shared with CooperativeStickyAssignor, and the abstract MemberData memberData(Subscription) method converts the Subscription to the embedded list of owned partitions for each assignor.
The "generation" logic is left in, however this is always Optional.empty() for the CooperativeStickyAssignor as onPartitionsLost should always be called when a generation is missed.
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Make sure to show the message key, even when the message value is null.
This changes the output of one of the tools. Is the output of the tool considered a public API? Does this need a discussion or a KIP?
Testing: Ran the tool on a compacted topic. Previously, the tool did not show any message keys for tombstone messages (messages where the value is null). Now, the tool shows message keys.
Reviewers: Mickael Maison <mimaison@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>
When sending bad records, the Trogdor task will fail if the final record produced is bad. Instead we should catch the exception to allow the task to finish since sending bad records is a valid use case.
Reviewers: Tu V. Tran <tuvtran97@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
RocksDB metrics are added to the Kafka metrics. For each segmented state store only
one set of metrics is exposed rather than one set of metrics for each segment.
The metrics are not computed yet.
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Make offsets immutable to users of RecordCollector.offsets. Fix up an
existing case where offsets could be modified in this way. Add a simple
test to verify offsets cannot be changed externally.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Guozhang Wang <guozhang@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Prior to this change an NPE is raised when calling AssignedTasks.close
under the following conditions:
1. EOS is enabled
2. The task was in a suspended state
The cause for the NPE is that when a clean close is requested for a
StreamTask the StreamTask tries to commit. However, in the suspended
state there is no producer so ultimately an NPE is thrown for the
contained RecordCollector in flush.
The fix put forth in this commit is to have AssignedTasks call
closeSuspended when it knows the underlying StreamTask is suspended.
Note also that this test is quite involved. I could have just tested
that AssignedTasks calls closeSuspended when appropriate, but that is
testing, IMO, a detail of the implementation and doesn't actually verify
we reproduced the original problem as it was described. I feel much more
confident that we are reproducing the behavior - and we can test exactly
the conditions that lead to it - when testing across AssignedTasks and
StreamTask. I believe this is an additional support for the argument of
eventually consolidating the state split across classes.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
I've spent quite a bit of time on trying to discover the root cause, with no luck so far. I have been able to reproduce it locally by running the following 100 times:
```
./gradlew connect:runtime:clean connect:runtime:test --tests org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest
```
The `testReconfigConnector` test failed 28% of the time and the others failed 0%. This issue and KAFKA-8661 suggest that `testDeleteConnector` and `testStartTwoConnectors` are also flaky, though I've not seen those tests fail locally.
Because this flakiness is causing issues for the rest of the project, I'm going to temporarily ignore several of the flaky ITs while I continue to investigate:
* `RebalanceSourceConnectorsIntegrationTest.testReconfigConnector`
* `RebalanceSourceConnectorsIntegrationTest.testDeleteConnector`
* `RebalanceSourceConnectorsIntegrationTest.testStartTwoConnectors`
**This should be backported to the `2.3` branch, which is when these integration tests were first added.**
Author: Randall Hauch <rhauch@gmail.com>
Reviewers: Ismael Juma
Closes#7237 from rhauch/kafka-8391-temporary
Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.
Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
Move the error code resetting logic from the onPartitionsRevoked callback into the streamthread directly after we've decided to rejoin the group, since onPartitionsRevoked are not guaranteed to be triggered.
Ran system tests on the originally failed StreamsUpgradeTest 10 times and passed.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Jun Rao <junrao@gmail.com>
This is the implementation for [KIP-503](https://cwiki.apache.org/confluence/display/KAFKA/KIP-503%3A+Add+metric+for+number+of+topics+marked+for+deletion)
When deleting a large number of topics, the Controller can get quite bogged down. One problem with this is the lack of visibility into the progress of the Controller. We can look into the ZK path for topics marked for deletion, but in a production environment this is inconvenient. This PR adds a JMX metric `kafka.controller:type=KafkaController,name=TopicsToDeleteCount` to make it easier to see how many topics are being deleted.
Reviewers: Stanislav Kozlovski <stanislav@confluent.io>, Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
Author: asutosh936 <asutosh.pandya@hotmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Vahid Hashemian <vahid.hashemian@gmail.com>, Jason Gustafson <jason@confluent.io>
Closes#7141 from asutosh936/KAFKA-8698
In a KTable context, we should not pass null into a user-supplied serde.
Testing: I verified that the change to the test results in test failures without the patch.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>,
This patch fixes a bug in the handling of MESSAGE_TOO_LARGE errors. The large batch is split, the smaller batches are re-added to the accumulator, and the batch is deallocated, but it was not removed from the list of in-flight batches. When the batch was eventually expired from the in-flight batches, the producer would try to deallocate it a second time, causing an error. This patch changes the behavior to correctly remove the batch from the list of in-flight requests.
Reviewers: Luke Stephenson, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
`testProduceAfterLogDirFailureOnLeader` currently disables producer retries in
order to catch and validate the exception thrown by a failure, and then tries to
produce successfully once the leadership changes. This second produce can
intermittently fail, causing test flakiness. This patch splits these validations
into two tests in order to allow retries for the produce request after the
leadership change.
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Ismael Juma <ismael@juma.me.uk>
If retrieving metadata during `poll(Duration)` repeatedly takes longer than the
poll timeout, we don't make progress and `waitUntilTrue` eventually times out
causing the test to fail. This behaviour differs from the older `poll(long)` that
would block until metadata retrieval had completed. The flakiness was likely
introduced when we switched from the latter to the former.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Reverts the TreeMap -> ConcurrentSkipListMap change that caused a performance regression in 2.3, and fixes the ConcurrentModificationException by copying (just) the key set to iterate over
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Refactors the DescribeDelegationToken to use the generated RPC classes.
Author: Viktor Somogyi-Vass <viktorsomogyi@gmail.com>
Author: Viktor Somogyi <viktorsomogyi@gmail.com>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7154 from viktorsomogyi/refactor-describe-dt
This patch updates ConsumerGroupCommandTest.scala to use the maximum possible number of AdminClient retries. The test runs will still be bounded by the request timeout. This address flakiness in tests such as testResetOffsetsNotExistingGroup and testResetOffsetsExistingTopic, which was caused by group coordinators being intermittently unavailable.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Update the javadoc of SockerServer#startup(). SocketServer#startProcessors() does not exist any more and it has been replaced by SocketServer#startDataPlaneProcessors() and SocketServer#startControlPlaneProcessor().
Reviewers: Jason Gustafson <jason@confluent.io>
Credit to @lbradstreet for profiling the producer with a large number of partitions.
Cache `topicMetadata`, `brokers` and `controller` in the `MetadataResponse`
the first time it's needed avoid unnecessary recomputation. We were previously
computing`brokersMap` 4 times per partition in one code path that was invoked from
multiple places. This is a regression introduced via a42f16f980 and first released
in 2.3.0.
The `Cluster` constructor became significantly more allocation heavy due to
2c44e77e2f, first released in 2.2.0. Replaced `merge` calls with more verbose,
but more efficient code. Added a test to verify that the returned collections are
unmodifiable.
Add `topicAuthorizedOperations` and `clusterAuthorizedOperations` to
`MetadataResponse` and remove `data()` method.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Lucas Bradstreet <lucasbradstreet@gmail.com>, Colin P. McCabe <cmccabe@confluent.io>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Justine Olshan <jolshan@confluent.io>