When the LogManager resumes cleaning it states that compaction is resumed, however the topic in question is not necessarily a compacted one.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Gwen Shapira, Chia-Ping Tsai
Closes#8466 from lbradstreet/bad-cleaning-message
This fixes a version pinning issue where a transitive dependency had a
major version upgrade that a dependency did not account for, breaking
the build.
Reviewers: Andrew Egelhofer <aegelhofer@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This is a follow-up to #8077. The bug exposed a testing gap in how we group partitions. This patch adds a test case which reproduces the reported problem.
Reviewers: David Arthur <mumrah@gmail.com>
The previous code did not use the collection produced by `takeWhile()`.
It only used the length of that collection to select the next element.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We were hitting an IllegalStateException: There is already a changelog registered for ... in trunk-eos due to failing to call TaskManager#cleanup on unrevoekd tasks that we end up closing in handleAssignment after failing to batch commit.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Change TimeoutException to BufferExhaustedException when no memory can be allocated for a record within max.block.ms
Refactored BufferExhaustedException to be a subclass of TimeoutException so existing code that catches TimeoutExceptions keeps working.
Added handling to count these Exceptions in the metric "buffer-exhausted-records".
Test Strategy
There were existing test cases to check this behavior, which I refactored.
I then added an extra case to check whether the expected Exception is actually thrown, which was not covered by current tests.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Remove the restriction in the protocol generation code that a structure
field needs to be part of an array.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
Currently a `LeaderAndIsr` request with a stale leader epoch for some partition may still result in the starting of the log dir fetcher for that partition (if the future log exists). I am not sure if this causes any correctness problem since we don't use any state from the request to start the fetcher, but it seems unnecessary to rely on this side effect.
Reviewers: Jun Rao <junrao@gmail.com>
In `validateOffsetsAsync` in t he consumer, we group the requests by leader node for efficiency. The list of topic-partitions are grouped from `partitionsToValidate` (all partitions) to `node` => `fetchPostitions` (partitions by node). However, when actually sending the request with `OffsetsForLeaderEpochClient`, we use `partitionsToValidate`, which is the list of all topic-partitions passed into `validateOffsetsAsync`. This results in extra partitions being included in the request sent to brokers that are potentially not the leader for those partitions.
This PR fixes the issue by using `fetchPositions`, which is the proper list of partitions that we should send in the request. Additionally, a small typo of API name in `OffsetsForLeaderEpochClient` is corrected (it originally referenced `LisfOffsets` as the API name).
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>
A read from the end of the log interleaved with a concurrent write can result in reading data above the expected read limit. In particular, this would allow a read above the high watermark. The root of the problem is consecutive calls to `sizeInBytes` in `FileRecords.slice` which do not account for an increase in size due to a concurrent write. This patch fixes the problem by using a single call to `sizeInBytes` and caching the result.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Fix the direct cause of the observed issue on the client side: when heartbeat getting errors and resetting generation, we only need to set it to UNJOINED when it was not already in REBALANCING; otherwise, the join-group handler would throw the retriable UnjoinedGroupException to force the consumer to re-send join group unnecessarily.
Fix the root cause of the issue on the broker side: we should still trigger rebalance when static member joins in CompletingRebalance phase; otherwise the member.ids would be changed when the assignment is received from the leader, hence causing the new member.id's assignment to be empty.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
Invoke `waitForQuotaUpdate` after the quotas are removed. It also changes
the default request quota to `Long.MaxValue`.
Reviewers: Anna Povzner <anna@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Add type bounds to the ProcessorContext, which bounds the types that can be forwarded to child nodes.
Reviewers: Matthias J. Sax <matthias@confluent.io>
90bbeedf52 introduced a regression resulting in passing an action per resource
name to the `Authorizer` instead of passing one per unique resource name. Refactor
the signatures of both `filterAuthorized` and `authorize` to make them easier to test
and add a test for each.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Some tasks get closed inside HandleAssignment and did not remove from the task manager bookkeep list. The next time they would be re-closed which is illegal state.
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The upper limit offset is displayed incorrectly in the log cleaner summary message. For example:
```
Log cleaner thread 0 cleaned log __consumer_offsets-47 (dirty section = [358800359, 358800359])
```
We should be using the next dirty offset as the upper limit.
Reviewers: David Arthur <mumrah@gmail.com>
On metadata change for assigned topics, we trigger rebalance, revoke partitions and send JoinGroup. If metadata reverts to the original value and JoinGroup fails, we don't resend JoinGroup because we don't set `rejoinNeeded`. This PR sets `rejoinNeeded=true` when rebalance is triggered due to metadata change to ensure that we retry on failure.
Reviewers: Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
Instance-level:
* number of alive stream threads
Thread-level:
* avg / max number of records polled from the consumer per runOnce, INFO
* avg / max number of records processed by the task manager (i.e. across all tasks) per runOnce, INFO
Task-level:
* number of current buffered records at the moment (i.e. it is just a dynamic gauge), DEBUG.
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <john@confluent.io>
As title suggests, we would like to broaden this check so that we don't fail to close a doom-to-cleanup task.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
As title suggested, consumers would first do an OffsetFetch before starting the normal processing. It makes sense to add it to the concurrent test suite to verify whether there would be a blocking behavior.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
https://issues.apache.org/jira/browse/KAFKA-8889 attempted to fill in the missing stacktrace in the log message when handling errors in FetchSessionHandler#handleError
But the fix is not effective without KAFKA-7016
The current fix removes the redundant pair of braces {} at the end of the log message. If and when the Throwable that is passed as argument to this method has a stacktrace, the log message will include it. Currently it doesn't because the Throwable argument does not have a stacktrace.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
If the high-watermark is updated in the middle of a read with the `read_committed` isolation level, it is possible to return data above the LSO. In the worst case, this can lead to the read of an aborted transaction. The root cause is that the logic depends on reading the high-watermark twice. We fix the problem by reading it once and caching the value.
Reviewers: David Arthur <mumrah@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Document the supported endpoint at the top-level (root) REST API resource and the information that it returns when a request is made to a Connect worker.
Fixes an omission in documentation after KAFKA-2369 and KAFKA-6311 (KIP-238)
Reviewers: Toby Drake <tobydrake7@gmail.com>, Soenke Liebau <soenke.liebau@opencore.com>
First set of cleanup pushed to followup PR after KIP-441 Pt. 5. Main changes are:
1. Moved `RankedClient` and the static `buildClientRankingsByTask` to a new file
2. Moved `Movement` and the static `getMovements` to a new file (also renamed to `TaskMovement`)
3. Consolidated the many common variables throughout the assignment tests to the new `AssignmentTestUtils`
4. New utility to generate comparable/predictable UUIDs for tests, and removed the generic from `TaskAssignor` and all related classes
Reviewers: John Roesler <vvcephei@apache.org>, Andrew Choi <a24choi@edu.uwaterloo.ca>
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>
Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
The runtime type of Metric.metricValue() needn't always be a Double,
for example, if it's a gauge from IntGaugeSuite.
Since it's impossible to format non-double values with 3 point precision
IllegalFormatConversionException resulted.
Author: Tom Bentley <tbentley@redhat.com>
Author: Tom Bentley <tombentley@users.noreply.github.com>
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#8373 from tombentley/KAFKA-9775-IllegalFormatConversionException