This is now possible since `InterBrokerSend` was moved from `core` to `server-common`.
Also rewrite/move `KafkaNetworkChannelTest`.
The scala version of `KafkaNetworkChannelTest` passed with the changes here (before I
deleted it).
Reviewers: Justine Olshan <jolshan@confluent.io>, José Armando García Sancio <jsancio@users.noreply.github.com>
Some users complained they don't have a way to determine if there is something wrong in the RSM plug-in they implemented, or there's something wrong in Kafka itself. Also, if there are users who just want to try the tiered storage feature out before implementing anything, it would be good we have an RSM implementation by default.
Per the discussion in the KIP, there will be no default RSM implementation in Kafka, but we can use the LocalTieredStorage implemented for integration test, to resolve the issues above.
Reviewers: Christo Lolov <lolovc@amazon.com>, Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Satish Duggana <satishd@apache.org>
Since only the active controller is performing the dual-write to ZK during a migration, it should be the only controller
to report the ZkWriteBehindLag metric.
Currently, if the controller fails over during a migration, the previous active controller will incorrectly report its last
value for ZkWriteBehindLag forever. Instead, it should report zero.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, David Arthur <mumrah@gmail.com>
Do not return fenced brokers from metadataCache.getPartitionReplicaEndpoints, since that could lead to
them getting used as preferred read replicas.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
A few notes:
* Delete a few methods from `UnifiedLog` that were simply invoking the related method in `LogFileUtils`
* Fix `CoreUtils.swallow` to use the passed in `logging`
* Fix `LogCleanerParameterizedIntegrationTest` to close `log` before reopening
* Minor tweaks in `LogSegment` for readability
For broader context on this change, please check:
* KAFKA-14470: Move log layer to storage module
Reviewers: Divij Vaidya <diviv@amazon.com>, Satish Duggana <satishd@apache.org>
We embrace immutability and thus should return a new object instead of
`this`, similar to other config classed we use in the DSL.
Side JavaDocs cleanup for a bunch of classes.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This change adds the upgrade documentation for 3.6.0 and fixes the position of the notable changes in 3.5.0.
In previous releases, notable changes always come after the upgrade instructions.
Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
The current KafkaConsumer offsetsForTimes fails with IllegalArgumentException if negative target timestamps are provided as arguments. This change includes the same validation and tests for the new consumer implementation (and some improved comments for the updateFetchPositions)
Reviewer: Lucas Brutschy <lbrutschy@confluent.io>
Spotbugs was temporarily disabled as part of KAFKA-15485 to support Kafka build with JDK 21. This PR upgrades the spotbugs version to 4.8.0 which adds support for JDK 21 and enables it's usage on build again.
Reviewers: Divij Vaidya <diviv@amazon.com>
When tasks are found corrupted, Kafka Streams tries to commit
the non-corrupted tasks before closing and reviving the corrupted
active tasks. Besides active running tasks, Kafka Streams tries
to commit restoring active tasks and standby tasks. However,
restoring active tasks do not need to be committed since they
do not have offsets to commit and the current code does not
write a checkpoint. Furthermore, trying to commit restoring
active tasks with the state updater enabled results in the
following error:
java.lang.UnsupportedOperationException: This task is read-only
at org.apache.kafka.streams.processor.internals.ReadOnlyTask.commitNeeded(ReadOnlyTask.java:209)
...
since commitNeeded() is not a read-only method for active tasks.
In future, we can consider writing a checkpoint for active
restoring tasks in this situation. Additionally, we should
fix commitNeeded() in active tasks to be read-only.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Lucas Brutschy <lbrutschy@confluent.io>
This patch implements the groups and offsets expiration in the new group coordinator.
Reviewers: Ritika Reddy <rreddy@confluent.io>, David Jacot <djacot@confluent.io>
In KIP-778 we introduced the "unsafe" (lossy) downgrade in case metadata has changes in one of the versions between target and current, as defined in MetadataVersion.
The documentation says it is possible:
"Note that the cluster metadata version cannot be downgraded to a pre-production 3.0.x, 3.1.x, or 3.2.x version once it has been upgraded. However, it is possible to downgrade to production versions such as 3.3-IV0, 3.3-IV1, etc."
The command line tool shows that this doesn't work:
bin/kafka-features.sh --bootstrap-server :9092 downgrade --metadata 3.4 --unsafe
Could not downgrade metadata.version to 8. Invalid metadata.version 8. Unsafe metadata downgrade is not supported in this version.
1 out of 1 operation(s) failed.
In addition to unsafe, also safe metadata downgrades are not supported in practice. For example, when you upgrade to 3.5, you land on 3.5-IV2 as metadata version, which has metadata changes and won't let you to downgrade. This is true for every other release at the moment.
This change fixes the documentation to reflect that, and improves the error messages.
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Reviewers: Luke Chen <showuon@gmail.com>, Jakub Scholz <github@scholzj.com>
Fixes logging for KafkaStreams#streamThreadLeaveConsumerGroup.
In order not to lose the trace of the whole exception, passing Exception e as a second argument, while message is pre-formatted and passed as string as a first argument. With this, we won't loose the stack trace of the exception.
Reviewers: Anna Sophie Blee-Goldman <sophie@responsive.dev>
With https://issues.apache.org/jira/browse/KAFKA-10575 StateRestoreListener#onRestoreSuspended was added. But local tests show that it is never called because DelegatingStateRestoreListener was not updated to call a new method
Reviewers: Anna Sophie Blee-Goldman <sophie@responsive.dev>, Bruno Cadonna <cadonna@confluent.io>
We introduced a bunch of flaky tests in #14295 , which are normal when running locally but will always fail in CI, lets rollback them unless we find the cause before the end of today.
Reviewers: Luke Chen <showuon@gmail.com>, Justine Olshan <jolshan@confluent.io>
A race can happen in the following sequence.
1. Stale Fetch triggers the ISR expansion.
2. The first time we check whether the replica is eligible. Catch up? Yes. broker epoch match? Yes (the metadata cache update has not happened)
3. Metadata cache update happens.
4. During the second time check the eligibility
a. Catch up? Yes
b. A new fetch request comes in. It cancels the replica caught-up and updates the broker epoch
c. broker epoch match? Yes. New fetch epoch = new metadata cache epoch
5. Send an AlterPartition request with the new broker epoch.
----------------
The solution is to make sure that the 4.a) ,4.c) and 5) use the same replica state.
Reviewers: David Mao <47232755+splett2@users.noreply.github.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
AdminClient will throw IllegalStateException instead of TimeoutException if it receives new calls while closing down. This is more consistent with how Consumer and Producer clients handle new calls after closed down.
Reviewers: Luke Chen <showuon@gmail.com>, Kirk True <kirk@kirktrue.pro>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, vamossagar12 <sagarmeansocean@gmail.com>
est Cases Covered
1. Index Files already exist on disk but not in Cache i.e. RemoteIndexCache should not call remoteStorageManager to fetch it instead cache it from the local index file present.
2. RSM returns CorruptedIndex File i.e. RemoteIndexCache should throw CorruptedIndexException instead of successfull execution.
3. Deleted Suffix Indexes file already present on disk i.e. If cleaner thread is slow , then there is a chance of deleted index files present on the disk while in parallel same index Entry is invalidated. To understand more refer https://issues.apache.org/jira/browse/KAFKA-15169
Reviewers: Divij Vaidya <diviv@amazon.com>, Luke Chen <showuon@gmail.com>, Kamal Chandraprakash<kamal.chandraprakash@gmail.com>
HeartbeatRequestManager is responsible for handling the ConsumerGroupHeartbeat request and response. The manager has the following responsibilities:
1. Sending the request to the GroupCoordinator when it is possible and necessary
2. Handling the response and update the `MembershipManagerImpl` based on the error/response it receives.
3. Handles request retries and fatal failures
For Successful heartbeat response:
- Updates the MembershipManager
For Failures handling:
- Retriables Errors: backoff and retries
- Fenced: Transition to a fenced state and reset the epoch, and retry in the next poll
- Fatal: Propagate the error to the user and fail the state machine
Reviewers: Kirk True <ktrue@confluent.io>, Lianet Magrans <lianetmr@gmail.com>, David Jacot <djacot@confluent.io>
This patch fixes a problem where we migrate the current producer ID batch to KRaft instead of the next producer ID batch. Since KRaft stores the next batch in the log, we end up serving up a duplicate batch to the first caller of AllocateProducerIds once the KRaft controller has taken over.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
When producer-batch is being retried, new-leader is known for the partition Vs the leader used in last attempt, then it is worthwhile to retry immediately to this new leader. A partition-leader is considered to be newer, if the epoch has advanced.
Reviewers: Walker Carlson <wcarlson@apache.org>, Kirk True <kirk@kirktrue.pro>, Andrew Schofield <andrew_schofield@uk.ibm.com
I've added details for VerificationFailureRate and VerificationTimeMs.
I considered adding the documentation for the AddPartitionsToTxnVerification metrics, but I noticed that all the request metrics simply listed Produce|FetchConsumer|FetchFollower. If we don't already report the AddPartitionsToTxn request metrics in this file, it doesn't make sense to add the verification variant. (As well as all the other APIs we report)
Filed a followup jira if we want to redo that whole section.
Reviewers: Reviewers: Divij Vaidya <diviv@amazon.com>