The "offset deletion" and "group rebalance" should not be recorded by the same sensor since they are totally different.
The code is introduced by #7276.
Reviewers: David Jacot <djacot@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The rebalance exception withholding is no longer necessary as we have better mechanism for catching and wrapping these exceptions. Throw them directly should be fine and simplify our current error handling.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
With a short timeout, a call in KafkaAdminClient may timeout and the client might disconnect. Currently this can be exposed to the user as either a TimeoutException or a DisconnectException. To be consistent, rather than exposing the underlying retriable error, we handle both cases with a TimeoutException.
Reviewers: Boyang Chen <boyang@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Author: Andras Katona <akatona@cloudera.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7953 from akatona84/kafkarunclass-auto-scala-version
The SaslClientAuthenticator incorrectly negotiates supported SaslHandshakeRequest version and uses the maximum version supported by the broker whether or not the client supports it.
This bug was exposed by a recent version bump in 0a2569e2b9.
This PR rolls back the recent SaslHandshake[Request,Response] bump, fixes the version negotiation, and adds a test to prevent anyone from accidentally bumping the version without a workaround such as a new ApiKey. The existing key will be difficult to support for clients < 2.5 due to the incorrect negotiation.
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>, Colin P. McCabe <cmccabe@apache.org>, Jason Gustafson <jason@confluent.io>
Upfront refactoring for KIP-447.
Introduces `StreamsProducer` that allows to share a producer over multiple tasks and track the TX status.
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Update the site documentation to include the endpoints introduced with KIP-558 and a short paragraph on how this feature is used in Connect.
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Toby Drake <tobydrake7@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#8148 from kkonstantine/kip-558-docs
This is a minor fix of a regression introduced in the refactoring PR: in current trunk standbyTask#commitNeeded always return false, which would cause standby tasks to never be committed until closed. To go back to the old behavior we would return true when new data has been applied and offsets being updated.
Reviewers: Boyang Chen <boyang@confluent.io>, John Roesler <john@confluent.io>
If a completed fetch has an error code signifying a _corrupt message_, throw a `KafkaException` that notes the fetch offset and the topic-partition.
Reviewers: Jason Gustafson <jason@confluent.io>
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
Closes#8072 from omkreddy/release-script
This PR is the counterpart of apache/kafka-site#253.
cc/ omkreddy
Author: Lee Dongjin <dongjin@apache.org>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8149 from dongjinleekr/feature/KAFKA-9586
*More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.*
*Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.*
Author: Ron Dagostino <rdagostino@confluent.io>
Reviewers: Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8139 from rondagostino/KAFKA-9575
1. Removed task field from TaskMigrated; the only caller that encodes a task id from StreamTask actually do not throw so we only log it. To handle it on StreamThread we just always enforce rebalance (and we would call onPartitionsLost to remove all tasks as dirty).
2. Added TaskCorruptedException with a set of task-ids. The first scenario of this is the restoreConsumer.poll which throws InvalidOffset indicating that the logs are truncated / compacted. To handle it on StreamThread we first close the corresponding tasks as dirty (if EOS is enabled we would also wipe out the state stores), and then revive them into the CREATED state.
3. Also fixed a bug while investigating KAFKA-9572: when suspending / closing a restoring task we should not commit the new offsets but only updating the checkpoint file.
4. Re-enabled the unit test.
This fixes two issues which together caused the soak to crash/some test to fail occasionally.
What happened was: In the main StreamThread loop we initialized a new task in TaskManager#checkForCompletedRestoration which includes registering, but not initializing, its changelogs. We then complete the loop and call poll, which resulted in a rebalance that revoked the newly-initialized task. In TaskManager#handleAssignment we then closed the task cleanly and go to remove the changelogs from the StoreChangelogReader only to get an IllegalStateException because the changelog partitions were not in the restore consumer's assignment (due to being uninitialized).
This by itself should^ be a recoverable error, as we catch exceptions here and retry closing the task as unclean. Of course the task actually was successfully closed (clean) so we now get an unexpected exception Illegal state CLOSED while closing active task
The fix(es) I'd propose are:
1. Keep the restore consumer's assignment in sync with the registered changelogs, ie the set ChangelogReader#changelogs but pause them until they are initialized edit: since the consumer does still perform some actions (gg fetches) on paused partitions, we should avoid adding uninitialized changelogs to the restore consumer's assignment. Instead, we should just skip them when removing.
2. Move the StoreChangelogReader#remove call to before the task.closeClean so that the task is only marked as closed if everything was successful. We should do so regardless, as we should (attempt to) remove the changelogs even if the clean close failed and we must do unclean.
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The current EOS example mixes fatal and non-fatal error handling. This patch fixes this problem and simplifies the example.
Reviewers: Jason Gustafson <jason@confluent.io>
Fixes a bug where KStream#transformValues would forward null values from the provided ValueTransform#transform operation.
A test was added for verification KStreamTransformValuesTest#shouldEmitNoRecordIfTransformReturnsNull. A parallel test for non-key ValueTransformer was not added, as they share the same code path.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
There is a race condition with the backoff sleep in the test case and setting the next allowed send time in the AdminClient. To fix it, we allow the test case to do the backoff sleep multiple times if needed.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
This PR is to fix the retry logic for `getListOffsetsCalls`. Previously, if there were partitions with errors, it would only pass in the current call object to retry after a metadata refresh. However, if there's a leader change, the call object never gets updated with the correct leader node to query. This PR fixes this by making another call to `getListOffsetsCalls` with only the error topic partitions as the next calls to be made after the metadata refresh. In addition there is an additional test to test the scenario where a leader change occurs.
Reviewers: Jason Gustafson <jason@confluent.io>
Previously, checkpointed offsets for a log were only updated if the log was chosen for cleaning once the cleaning job completes. This caused issues in cases where logs with invalid checkpointed offsets would repeatedly emit warnings if the log with an invalid cleaning checkpoint wasn't chosen for cleaning.
Proposed fix is to update the checkpointed offset for logs with invalid checkpoints regardless of whether it gets chosen for cleaning.
Reviewers: Anna Povzner <anna@confluent.io>, Jun Rao <junrao@gmail.com>
Newer versions of Java have added checks to ensure that trust anchors are CA certificates and contain proper extensions. This PR adds Basic Constraints extension with the CA field set to true for system tests.
Reviewers: ajini Sivaram <rajinisivaram@googlemail.com>
This PR fixes two bugs related to stream refactoring:
1. The subscribed topics are not updated correctly when topic gets removed from broker.
2. The remainingPartitions computation doesn't account the case when one task has a pattern subscription of multiple topics. Then the input partition change will not be assumed as containsAll
The bugs are exposed from integration test testRegexMatchesTopicsAWhenDeleted and could be used to verify the fix works.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Today if we attempt to list offsets with a fenced leader epoch, consumer will retry without updating the metadata until the timeout is reached. This affects synchronous APIs such as `offsetsForTimes`, `beginningOffsets`, and `endOffsets`. The fix in this patch is to trigger the metadata update call whenever we see a retriable error before additional attempts.
Reviewers: Jason Gustafson <jason@confluent.io>
This change is the client-side part of KIP-360. It identifies cases where it is safe to abort a transaction, bump the producer epoch, and allow the application to continue without closing the producer. In these cases, when KafkaProducer.abortTransaction() is called, the producer sends an InitProducerId following the transaction abort, which causes the producer epoch to be bumped. The application can then start a new transaction and continue processing.
For recoverable errors in the idempotent producer, the epoch is bumped locally. In-flight requests for partitions with an error are rewritten to reflect the new epoch, and in-flights of all other partitions are allowed to complete using the old epoch.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
1. Delay the initialization (producer.initTxn) from construction to maybeInitialize; if it times out we just swallow and retry in the next iteration.
2. If completeRestoration (consumer.committed) times out, just swallow and retry in the next iteration.
3. For other calls (producer.partitionsFor, producer.commitTxn, consumer.commit), treat the timeout exception as fatal.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Correct the Connect worker logic to properly disable the new topic status (KIP-558) feature when `topic.tracking.enable=false`, and fix automatic topic status reset after a connector is deleted.
Also adds new `ConnectorTopicsIntegrationTest` and expanded unit tests.
Reviewers: Randall Hauch <rhauch@gmail.com>
Correct the process-rate (and total) sensor to measure throughput (and total record processing count).
Reviewers: Guozhang Wang <guozhang@confluent.io>
- Added additional synchronization and increased timeouts to handle flakiness
- Added some pre-cautionary retries when trying to obtain lag map
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Test cases in `ConsumerPerformanceTest` were failing and causing a system exit rather than throwing the expected exception following #8023. We didn't catch this because the tests were marked as skipped and not failed.
Reviewers: Guozhang Wang <guozhang@confluent.io>
This change updates ConsoleProducer, ConsumerPerformance, VerifiableProducer, and VerifiableConsumer classes to add and prefer the --bootstrap-server flag for defining the connection point of the Kafka cluster. This change is part of KIP-499: https://cwiki.apache.org/confluence/display/KAFKA/KIP-499+-+Unify+connection+name+flag+for+command+line+tool.
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
The KafkaProducer code would set infinite retries (MAX_INT) if the producer was configured with idempotence and no retries were configured by the user. This is superfluous because KIP-91 changed the retry functionality to both be time-based and the default retries config to be MAX_INT.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch adds logic to the test case to ensure that consumer groups are in a valid state prior to attempting offset reset.
Reviewers: Jason Gustafson <jason@confluent.io>
This change mainly have 2 components:
1. extend the existing transactions_test.py to also try out new sendTxnOffsets(groupMetadata) API to make sure we are not introducing any regression or compatibility issue
a. We shrink the time window to 10 seconds for the txn timeout scheduler on broker so that we could trigger expiration earlier than later
2. create a completely new system test class called group_mode_transactions_test which is more complicated than the existing system test, as we are taking rebalance into consideration and using multiple partitions instead of one. For further breakdown:
a. The message count was done on partition level, instead of global as we need to visualize
the per partition order throughout the test. For this sake, we extend ConsoleConsumer to print out the data partition as well to help message copier interpret the per partition data.
b. The progress count includes the time for completing the pending txn offset expiration
c. More visibility and feature improvements on TransactionMessageCopier to better work under either standalone or group mode.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This PR speeds up the deletion process by doing the following:
- Batch whenever possible to minimize the number of requests sent out to other brokers;
- Refactor `onPartitionDeletion` to remove the usage of `allLiveReplicas`.
Reviewers: Jason Gustafson <jason@confluent.io>