KAFKA-7283 enabled lazy mmap on index files by initializing indices
on-demand rather than performing costly disk/memory operations when
creating all indices on broker startup. This helped reducing the startup
time of brokers. However, segment indices are still created on closing
segments, regardless of whether they need to be closed or not.
This is a cleaned up version of #7900, which was submitted by @efeg. It
eliminates unnecessary disk accesses and memory map operations while
deleting, renaming or closing offset and time indexes.
In a cluster with 31 brokers, where each broker has 13K to 20K segments,
@efeg and team observed up to 2 orders of magnitude faster LogManager
shutdown times - i.e. dropping the LogManager shutdown time of each
broker from 10s of seconds to 100s of milliseconds.
To avoid confusion between `renameTo` and `setFile`, I replaced the
latter with the more restricted updateParentDir` (it turns out that's
all we need).
Reviewers: Jun Rao <junrao@gmail.com>, Andrew Choi <a24choi@edu.uwaterloo.ca>
Co-authored-by: Adem Efe Gencer <agencer@linkedin.com>
Co-authored-by: Ismael Juma <ismael@juma.me.uk>
This PR works to improve high watermark checkpointing performance.
`ReplicaManager.checkpointHighWatermarks()` was found to be a major contributor to GC pressure, especially on Kafka clusters with high partition counts and low throughput.
Added a JMH benchmark for `checkpointHighWatermarks` which establishes a
performance baseline. The parameterized benchmark was run with 100, 1000 and
2000 topics.
Modified `ReplicaManager.checkpointHighWatermarks()` to avoid extra copies and cached
the Log parent directory Sting to avoid frequent allocations when calculating
`File.getParent()`.
A few clean-ups:
* Changed all usages of Log.dir.getParent to Log.parentDir and Log.dir.getParentFile to
Log.parentDirFile.
* Only expose public accessor for `Log.dir` (consistent with `Log.parentDir`)
* Removed unused parameters in `Partition.makeLeader`, `Partition.makeFollower` and `Partition.createLogIfNotExists`.
Benchmark results:
| Topic Count | Ops/ms | MB/sec allocated |
|-------------|---------|------------------|
| 100 | + 51% | - 91% |
| 1000 | + 143% | - 49% |
| 2000 | + 149% | - 50% |
Reviewers: Lucas Bradstreet <lucas@confluent.io>. Ismael Juma <ismael@juma.me.uk>
Co-authored-by: Gardner Vickers <gardner@vickers.me>
Co-authored-by: Ismael Juma <ismael@juma.me.uk>
Startup scripts for the early version of Kafka contain removed JVM options like `-XX:+PrintGCDateStamps` or `-XX:UseParNewGC`.
When system tests run on JVM that doesn't support these options we should set up
environment variables with correct options.
Reviewers: Guozhang Wang <guozhang@confluent.io>, Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <ismael@juma.me.uk
Older versions of the JoinGroup rely on a new member timeout to keep the group from growing indefinitely in the case of client disconnects and retrying. The logic for resetting the heartbeat expiration task following completion of the rebalance failed to account for an implicit expectation that shouldKeepAlive would return false the first time it is invoked when a heartbeat expiration is scheduled. This patch fixes the issue by making heartbeat satisfaction logic explicit.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
There are cases where `currentEstimation` is less than
`COMPRESSION_RATIO_IMPROVING_STEP` causing
`estimatedCompressionRatio` to be negative. This, in turn,
may result in `MESSAGE_TOO_LARGE`.
Reviewers: Ismael Juma <ismael@juma.me.uk>
SSLEngine#beginHandshake is possible to throw authentication failures (for example, no suitable cipher suites) so we ought to catch SSLException and then convert it to SslAuthenticationException so as to process authentication failures correctly.
Reviewers: Jun Rao <junrao@gmail.com>
* The Herder interface is extended with a default method that allows choosing whether to log all the connector configurations during connector validation or not.
* The `PUT /connector-plugins/{connector-type}/config/validate` is modified to stop logging the connector's configurations when a validation request is hitting this endpoint. Validations during creation or reconfiguration of a connector are still logging all the connector configurations at the INFO level, which is useful in general and during troubleshooting in particular.
Co-authored-by: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Ismael Juma <github@juma.me.uk>
This algorithm assigns tasks to clients and tries to
- balance the distribution of the partitions of the
same input topic over stream threads and clients,
i.e., data parallel workload balance
- balance the distribution of work over stream threads.
The algorithm does not take into account potentially existing states
on the client.
The assignment is considered balanced when the difference in
assigned tasks between the stream thread with the most tasks and
the stream thread with the least tasks does not exceed a given
balance factor.
The algorithm prioritizes balance over stream threads
higher than balance over clients.
Reviewers: John Roesler <vvcephei@apache.org>
When handling a WriteTxnResponse, the TransactionMarkerRequestCompletionHandler throws an IllegalStateException when the remote broker responds with a KAFKA_STORAGE_ERROR and does not retry the request. This leaves the transaction state stuck in PendingAbort or PendingCommit, with no way to change that state other than restarting the broker, because both EndTxnRequest and InitProducerIdRequest return CONCURRENT_TRANSACTIONS if the state is PendingAbort or PendingCommit. This patch changes the error handling behavior in TransactionMarkerRequestCompletionHandler to retry KAFKA_STORAGE_ERRORs. This matches the existing client behavior, and makes sense because KAFKA_STORAGE_ERROR causes the host broker to shut down, meaning that the partition being written to will move its leadership to a new, healthy broker.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
And hence restore / global consumers should never expect FencedInstanceIdException.
When such exception is thrown, it means there's another instance with the same instance.id taken over, and hence we should treat it as fatal and let this instance to close out instead of handling as task-migrated.
Reviewers: Boyang Chen <boyang@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Simple fix to return in all cases an immutable list from poll in SchemaSourceTask.
Co-authored-by: David Mollitor <dmollitor@apache.org>
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Ismael Juma <github@juma.me.uk>, Konstantine Karantasis <konstantine@confluent.io>
Relax the requirement that tasks' reported offsetSum is less than the endOffsetSum for those
tasks. This was surfaced by a test for corrupted tasks, but it can happen with real corrupted
tasks. Rather than throw an exception on the leader, we now de-prioritize the corrupted task.
Ideally, that instance will not get assigned the task and the stateDirCleaner will make
the problem "go away". If it does get assigned the task, then it will detect the corruption and
delete the task directory before recovering the entire changelog. Thus, the estimate we provide
accurately reflects the amount of lag such a corrupted task would have to recover (the whole log).
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <guozhang@confluent.io>, Bruno Cadonna <bruno@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
In Streams `StreamsMetadataState.getMetadataWithKey`, we should use the inferred max topic partitions passed in directly from the caller than relying on cluster to contain its topic-partition information.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Kafka Connect main doc required a fix to distinguish between worker level producer and consumer overrides and per-connector level producer and consumer overrides, after the latter were introduced with KIP-458.
* [KAFKA-9563] Fix Kafka connect consumer and producer override documentation
Co-authored-by: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
- KIP-447
- add new configs to enable producer per thread EOS
- updates TaskManager to use single shared producer for eos-beta
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <guozhang@confluent.io>
In Kafka Connect, a ConfigProvider instance can be used concurrently (e.g. via a PUT request to the `/connector-plugins/{connectorType}/config/validate` REST endpoint), but there is no mention of concurrent usage in the Javadocs of the ConfigProvider interface.
It's worth calling out that implementations need to be thread safe.
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
When incorrect connector configuration is detected, the returned exception message suggests to check the connector's configuration against the `{connectorType}/config/validate` endpoint.
Changing the error message to refer to the exact REST endpoint which is `/connector-plugins/{connectorType}/config/validate`
This aligns the exception message with the documentation at: https://kafka.apache.org/documentation/#connect_rest
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
This improvement fixes several linking errors to classes and methods from within javadocs.
Related to #8291
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
In step 7 of the QuickStart guide, "Writing data from the console and writing it back to the console" should be "Reading data from the console and writing it back to the console".
Co-authored-by: Alaa Zbair <alaa.zbair@grenoble-inp.org>
Reviewer: Konstantine Karantasis <konstantine@confluent.io>
Once we have encoded the offset sums per task for each client, we can compute the overall lag during assign by fetching the end offsets for all changelog and subtracting.
If the listOffsets request fails, we simply return a "completely sticky" assignment, ie all active tasks are given to previous owners regardless of balance.
Builds (but does not yet use) the statefulTasksToRankedCandidates map with the ranking:
Rank -1: active running task
Rank 0: standby or restoring task whose overall lag is within acceptableRecoveryLag
Rank 1: tasks whose lag is unknown (eg during version probing)
Rank 1+: all other tasks are ranked according to their actual total lag
Implements: KIP-441
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <vvcephei@apache.org>
We will currently retry if you run gradle test, but not unitTest or
integrationTest, which are used directly in Jenkins. This means that we
have not been achieving the expected retry behavior.
Reviewers: Ismael Juma <ismael@juma.me.uk>
If partitions are revoked, an application may want to commit the current offsets.
Using transactions, committing offsets would be done via the producer passing in the current ConsumerGroupMetadata. If the metadata is not updates before the callback, the call to commitTransaction(...) fails as and old generationId would be used.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This bug was incurred by #7994 with a too-strong consistency check. It is because a reset generation operation could be called in between the joinGroupRequest -> joinGroupResponse -> SyncGroupRequest -> SyncGroupResponse sequence of events, if user calls unsubscribe in the middle of consumer#poll().
Proper fix is to avoid the protocol name check when the generation is invalid.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Exit.exit needs to be used in code instead of System.exit.
Particularly in integration tests using System.exit is disrupting because it exits the jvm process and does not just fail the test correctly. Integration tests override procedures in Exit to protect against such cases.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Randall Hauch <rhauch@gmail.com>
This PR fixes three things:
* the state should be closed when standby task is restoring as well
* the EOS standby task should also wipe out state under dirty close
* the changelog reader should check for null as well
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Adds a currently unused component of the new Streams assignment algorithm.
Implements: KIP-441
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <vvcephei@apache.org>
Consolidate ChangelogReader state management inside of StreamThread to avoid having to reason about all execution paths in both StreamThread and TaskManager.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This patch modifies the loading time metric to account for time spent waiting for the loading time task to be scheduled.
Reviewers: Jason Gustafson <jason@confluent.io>
Rewrite ReassignPartitionsCommand to use the KIP-455 API when possible, rather
than direct communication with ZooKeeper. Direct ZK access is still supported,
but deprecated, as described in KIP-455.
As specified in KIP-455, the tool has several new flags. --cancel stops
an assignment which is in progress. --preserve-throttle causes the
--verify and --cancel commands to leave the throttles alone.
--additional allows users to execute another partition assignment even
if there is already one in progress. Finally, --show displays all of
the current partition reassignments.
Reorganize the reassignment code and tests somewhat to rely more on unit
testing using the MockAdminClient and less on integration testing. Each
integration test where we bring up a cluster seems to take about 5 seconds, so
it's good when we can get similar coverage from unit tests. To enable this,
MockAdminClient now supports incrementalAlterConfigs, alterReplicaLogDirs,
describeReplicaLogDirs, and some other APIs. MockAdminClient is also now
thread-safe, to match the real AdminClient implementation.
In DeleteTopicTest, use the KIP-455 API rather than invoking the reassignment
command.
Currently when there is a leader change with a log dir reassignment in progress, we do not update the leader epoch in the partition state maintained by `ReplicaAlterLogDirsThread`. This can lead to a FENCED_LEADER_EPOCH error, which results in the partition being marked as failed, which is a permanent failure until the broker is restarted. This patch fixes the problem by updating the epoch in `ReplicaAlterLogDirsThread` after receiving a new LeaderAndIsr request from the controller.
Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
- part of KIP-447
- commit all tasks at once using non-eos (and eos-beta in follow up work)
- unified commit logic into TaskManager
- split existing methods of Task interface in pre/post parts
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Add 4 new assignor configs in preparation for the new assignment algorithm:
1. acceptable.recovery.lag
2. balance.factor
3. max.warmup.replicas
4. probing.rebalance.interval.ms
Implements: KIP-441
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <vvcephei@apache.org>
Since the assignment info includes a map with all member's host info, we can just check the received map to make sure our endpoint is contained. If not, we need to force the group to rebalance and get our updated endpoint info.
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The issue itself has been fixed a while ago on the producer side, so we can just remove this TODO marker now (we've removed the isZombie flag already anyways).
Reviewers: John Roesler <vvcephei@apache.org>