Follow on to #6731, this PR adds broker-side support for [KIP-392](https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica) (fetch from followers).
Changes:
* All brokers will handle FetchRequest regardless of leadership
* Leaders can compute a preferred replica to return to the client
* New ReplicaSelector interface for determining the preferred replica
* Incremental fetches will include partitions with no records if the preferred replica has been computed
* Adds new JMX to expose the current preferred read replica of a partition in the consumer
Two new conditions were added for completing a delayed fetch. They both relate to communicating the high watermark to followers without waiting for a timeout:
* For regular fetches, if the high watermark changes within a single fetch request
* For incremental fetch sessions, if the follower's high watermark is lower than the leader
A new JMX attribute `preferred-read-replica` was added to the `kafka.consumer:type=consumer-fetch-manager-metrics,client-id=some-consumer,topic=my-topic,partition=0` object. This was added to support the new system test which verifies that the fetch from follower behavior works end-to-end. This attribute could also be useful in the future when debugging problems with the consumer.
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
The patch clarifies the exception message for unknown key versions when loading from the group metadata topic. The patch also makes a trivial change in `KafkaAdminClient` to use `Map.computeIfAbsent`.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Jason Gustafson <jason@confluent.io>
Include group.instance.id in the describe group result for better visibility.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The purpose here is to leverage static membership information during round robin consumer assignment, because persistent member id could help make the assignment remain the same during rebalance.
The comparison logic is changed to:
1. If member A and member B both have group.instance.id, then compare their group.instance.id
2. If member A has group.instance.id, while member B doesn't, then A < B
3. If both member A and B don't have group.instance.id, compare their member.id
In round robin assignor, we use ephemeral member.id to sort the members in order for assignment. This semantic is not stable and could trigger unnecessary shuffle of tasks. By leveraging group.instance.id the static member assignment shall be persist when satisfying following conditions:
1. number of members remain the same across generation
2. static members' identities persist across generation
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
1. In ConsumerCoordinator, select the protocol as the common protocol from all configured assignor instances' supported protocols with the highest number.
1.b. In onJoinPrepare: only call onPartitionRevoked with EAGER.
1.a. In onJoinComplete: call onPartitionAssigned with EAGER; call onPartitionRevoked following onPartitionAssigned with COOPERATIVE, and then request re-join if the error indicates so.
1.c. In performAssignment: update the user's assignor returned assignments by excluding all partitions that are still owned by some other members.
2. I've refactored the Subscription / Assignment such that: assigned partitions, error codes, and group instance id are not-final anymore, instead they can be updated. For the last one, it is directly related to the logic of this PR but I felt it is more convienent to go with other fields.
3. Testing: primarily in ConsumerCoordinatorTest, make it parameterized with protocol, and add necessary scenarios for COOPERATIVE protocol.
I intentionally omitted the documentation change since there are some behavioral updates that needs to be finalized in later PRs, and hence I will also only add the docs in later PRs.
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Boyang Chen <boyang@confluent.io>, Sophie Blee-Goldman <sophie@confluent.io>
The existing implementation triggers warnings in Java 9+ and relies
on internal classes that vary depending on the JDK provider. The proposed
implementation fixes these issues and it's more concise.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Guozhang Wang <wangguoz@gmail.com>
* KAFKA-8106:Reducing the allocation and copying of ByteBuffer when logValidator do validation.
* KAFKA-8106:Reducing the allocation and copying of ByteBuffer when logValidator do validation.
* github comments
* use batch.skipKeyValueIterator
* cleanups
* no need to skip kv for uncompressed iterator
* checkstyle fixes
* fix findbugs
* adding unit tests
* reuse decompression buffer; and using streaming iterator
* checkstyle
* add unit tests
* remove reusing buffer supplier
* fix unit tests
* add unit tests
* use streaming iterator
* minor refactoring
* rename
* github comments
* github comments
* reuse buffer at DefaultRecord caller
* some further optimization
* major refactoring
* further refactoring
* update comment
* github comments
* minor fix
* add jmh benchmarks
* update jmh
* github comments
* minor fix
* github comments
When the log contains out of order message formats (for example v2 message followed by v1 message) and consists of compressed batches typically greater than 1kB in size, it is possible for down-conversion to fail. With compressed batches, we estimate the size of down-converted batches using:
```
private static int estimateCompressedSizeInBytes(int size, CompressionType compressionType) {
return compressionType == CompressionType.NONE ? size : Math.min(Math.max(size / 2, 1024), 1 << 16);
}
```
This almost always underestimates size of down-converted records if the batch is between 1kB-64kB in size. In general, this means we may under estimate the total size required for compressed batches.
Because of an implicit assumption in the code that messages with a lower message format appear before any with a higher message format, we do not grow the buffer we copy the down converted records into when we see a message <= the target message format. This assumption becomes incorrect when the log contains out of order message formats, for example because of leaders flapping while upgrading the message format.
Reviewers: Jason Gustafson <jason@confluent.io>
Static members never leave the group, so potentially we could log a flooding number of warning messages in the hb thread. The solution is to only log as warning when we are on dynamic membership.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This was caused by back-to-back merging of #6854 (which removed the Optional import) and #6936 (which needed the import).
Reviewers: Jason Gustafson <jason@confluent.io>
An attempt to refactor current coordinator logic.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Konstantine Karantasis <konstantine@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
1. Add new fields of subscription / assignment and bump up consumer protocol to v2.
2. Update tests to make sure old versioned protocol can be successfully deserialized, and new versioned protocol can be deserialized by old byte code.
Reviewers: Boyang Chen <boyang@confluent.io>, Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
The OffsetFetch requires Topic Describe permission. If a client does not have this, we return TOPIC_AUTHORIZATION_FAILED at the partition level. Currently the consumer does not handle this error explicitly, but raises it as a generic `KafkaException`. For consistency with other APIs and to fix transient test failures in `PlaintextEndToEndAuthorizationTest`, we should raise `TopicAuthorizationFailedException` instead.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The idempotent producer attempts to detect spurious UNKNOWN_PRODUCER_ID errors and handle them by reassigning sequence numbers to the inflight batches. The inflight batches are tracked in a PriorityQueue. The problem is that the reassignment of sequence numbers depends on the iteration order of PriorityQueue, which does not guarantee any ordering. So this can result in sequence numbers being assigned in the wrong order. This patch fixes the problem by using a sorted set instead of a priority queue so that the iteration order preserves the sequence order. Note that resetting sequence numbers is an exceptional case.
This patch also fixes KAFKA-8484, which can cause an IllegalStateException when the producerId is reset while there are pending produce requests inflight. The solution is to ensure that sequence numbers are only reset if the producerId of a failed batch corresponds to the current producerId.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
As title suggested, we boost 3 stream instances stream job with one minute session timeout, and once the group is stable, doing couple of rolling bounces for the entire cluster. Every rejoin based on restart should have no generation bump on the client side.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>
This commit makes three changes:
- Adds a constructor for NewTopic(String, Optional<Integer>, Optional<Short>)
which allows users to specify Optional.empty() for numPartitions or
replicationFactor in order to use the broker default.
- Changes AdminManager to accept -1 as valid options for replication
factor and numPartitions (resolving to broker defaults).
- Makes --partitions and --replication-factor optional arguments when creating
topics using kafka-topics.sh.
- Adds a dependency on scalaJava8Compat library to make it simpler to
convert Scala Option to Java Optional
Reviewers: Ismael Juma <ismael@juma.me.uk>, Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>
The goals for this small diff are:
1. Give user guidance if they want to relax commit timeout threshold
2. Indicate the code path where timeout exception was caught
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Temporarily restore the SslFactory.sslContext() function, which some connectors use. This function is not a public API and it will be removed eventually. For now, we will mark it as deprecated.
According to KIP-297 a parameter is passed to ConfigProvider with syntax "config.providers.{name}.param.{param-name}". Currently AbstractConfig allows parameters of the format "config.providers.{name}.{param-name}". With this fix AbstractConfig will be consistent with KIP-297 syntax.
Reviewers: Robert Yokota <rayokota@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
Since the originals map passed to AbstractConfig constructor may be immutable, avoid updating this map while resolving indirect config variables. Instead a new ResolvingMap instance is now used to store resolved configs.
Reviewers: Randall Hauch <rhauch@gmail.com>, Boyang Chen <bchen11@outlook.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
It is possible for the offset of a partition to be changed while we are in the middle of validation. If the OffsetForLeaderEpoch request is in-flight and the offset changes, we need to redo the validation after it returns. We had a check for this situation previously, but it was only checking if the current leader epoch had changed. This patch fixes this and moves the validation in `SubscriptionState` where it can be protected with a lock.
Additionally, this patch adds test cases for the SubscriptionState validation API. We fix a small bug handling broker downgrades. Basically we should skip validation if the latest metadata does not include leader epoch information.
Reviewers: David Arthur <mumrah@gmail.com>
When poll is called which resets the offsets to the beginning, followed by a seekToEnd and a position, it could happen that the "reset to earliest" call in poll overrides the "reset to latest" initiated by seekToEnd in a very delicate way:
1. both request has been issued and returned to the client side (listOffsetResponse has happened)
2. in Fetcher.resetOffsetIfNeeded(TopicPartition, Long, OffsetData) the thread scheduler could prefer the heartbeat thread with the "reset to earliest" call, overriding the offset to the earliest and setting the SubscriptionState with that position.
3. The thread scheduler continues execution of the thread (application thread) with the "reset to latest" call and discards it as the "reset to earliest" already set the position - the wrong one.
4. The blocking position call returns with the earliest offset instead of the latest, despite it wasn't expected.
The fix makes SubscriptionState synchronized so that we can verify that the reset is expected while holding the lock.
Reviewers: Jason Gustafson <jason@confluent.io>
As title suggests, this unit test is just a double check. No need to push in 2.3
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>
As we are planning to add on more supporting features for rebalancing under static membership, we need to make sure the behavior for `group.instance.id` is consistent throughout the whole stack. This patch ensures that the default value is null in the JoinGroup response.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
Authorized operations must be null when talking to a pre-KIP-430 broker.
If we present this as the empty set instead, it is impossible for clients
to know if they have no permissions, or are talking to an old broker.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
The consumer should await api version information before determining whether the broker supports offset validation. In KAFKA-8422, we skip the validation if we don't have api version information, which means we always skip validation the first time we connect to a node. This bug was detected by the failing system test `tests/client/truncation_test.py`. The test passes again with this fix.
Reviewers: Ismael Juma <ismael@juma.me.uk>
In the olden days, OffsetForLeaderEpoch was exclusively an inter-broker protocol and
required Cluster level permission. With KIP-320, clients can use this API as well and
so we lowered the required permission to Topic Describe. The only way the client can
be sure that the new permissions are in use is to require version 3 of the protocol
which was bumped for 2.3. If the broker does not support this version, we skip the
validation and revert to the old behavior.
Additionally, this patch fixes a problem with the newly added replicaId field when
parsed from older versions which did not have it. If the field was not present, then
we used the consumer's sentinel value, but this would limit the range of visible
offsets by the high watermark. To get around this problem, this patch adds a
separate "debug" sentinel similar to APIs like Fetch and ListOffsets.
Reviewers: Ismael Juma <ismael@juma.me.uk>
An API call for consumer groups must send a FindCoordinatorRequest to find the consumer group coordinator, and then send a follow-up request to that node. But the coordinator might move after the FindCoordinatorRequest but before the follow-up request is sent. In that case we currently fail.
This change fixes that by detecting this error and then retrying. This fixes listConsumerGroupOffsets, deleteConsumerGroups, and describeConsumerGroups.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Boyang Chen <bchen11@outlook.com>
When consumer coordinator realize the subscription may have changed, today we check again against the joinedSubscription within handleAssignmentMismatch. This checking however is a bit fishy and over-kill as well. It's better just simplifying it to always request re-join.
The joinedSubscription object itself however still need to be maintained for potential augment to avoid extra re-joining the group.
Since testOutdatedCoordinatorAssignment already cover the normal case we also remove the other invalidAssignment test case.
Reviewers: Jason Gustafson <jason@confluent.io>
As title states. We plan to merge this to both trunk and 2.3 if it could fix the stream system tests globally.
Reference implementation: #6673
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>
KIP-345 and KIP-392 introduced a couple breaking changes for old versions of bumped protocols. This patch fixes them.
Reviewers: Colin Patrick McCabe <cmccabe@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Boyang Chen <bchen11@outlook.com>, Guozhang Wang <wangguoz@gmail.com>