KAFKA-6432: Make index lookup more cache friendly
For each topic-partition, Kafka broker maintains two indices: one for message offset, one for message timestamp. By default, a new index entry is appended to each index for every 4KB messages. The lookup of the indices is a simple binary search. The indices are mmaped files, and cached by Linux page cache.
Both consumer fetch and follower fetch have to do an offset lookup, before accessing the actual message data. The simple binary search algorithm used for looking up the index is not cache friendly, and may cause page faults even on high QPS topic-partitions.
In a normal Kafka broker, all the follower fetch requests, and most consumer fetch requests should only look up the last few entries of the index. We can make the index lookup more cache friendly, by searching in the last one or two pages of the index first.
Reviewers: Colin Patrick McCabe <colin@cmccabe.xyz>, Guozhang Wang <wangguoz@gmail.com>, Ted Yu <yuzhihong@gmail.com>, Ismael Juma <github@juma.me.uk>, Sriharsha Chintalapani <sriharsha@apache.org>
This has always been an issue, but the recent upgrade to ZooKeeper
3.4.13 means it is also an issue when an unresolvable ZK
address is used, causing some tests to leak threads.
The change in behaviour in ZK 3.4.13 is that no exception is thrown
from the ZooKeeper constructor in case of an unresolvable address.
Instead, ZooKeeper tries to re-resolve the address hoping it becomes
resolvable again. We eventually throw a
`ZooKeeperClientTimeoutException`, which is similar to the case
where the the address is resolvable but ZooKeeper is not
reachable.
Reviewers: Ismael Juma <ismael@juma.me.uk>
When there are many inactive partitions in the cluster, we observed constant churn of URP in the cluster even if follower can catch up with leader's byte-in-rate because leader broker frequently moves replicas of inactive partitions out of ISR. This PR mitigates this issue by not moving replica out of ISR if follower's LEO == leader's LEO.
Author: Zhanxiang (Patrick) Huang <hzxa21@hotmail.com>
Reviewers: Dong Lin <lindong28@gmail.com>
Closes#5412 from hzxa21/KAFKA-7152
After successful completion of KafkaProducer#close, it is possible that an application calls KafkaProducer#send. If the send is invoked for a topic for which we do not have any metadata, the producer will block until `max.block.ms` elapses - we do not expect to receive any metadata update in this case because Sender (and NetworkClient) has already exited. It is only when RecordAccumulator#append is invoked that we notice that the producer has already been closed and throw an exception. If `max.block.ms` is set to Long.MaxValue (or a sufficiently high value in general), the producer could block awaiting metadata indefinitely.
This patch makes sure `Metadata#awaitUpdate` periodically checks if the network client has been closed, and if so bails out as soon as possible.
And change KafkaController to use the newly introduced method.
Also remove redundant `InZk` postfixes from `registerBrokerInZk` and
`updateBrokerInfoInZk`.
As `checkedEphemeralCreate` is not used outside of `KafkaZkClient`
any longer, reduce its visibility.
ControllerIntegrationTest already covers this functionality well, it validates the
refactor.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Currently, if a consumer group never commits offsets, ConsumerGroupCommand will not include it in the describe output even if the member assignment is valid. Instead, the tool should be able to describe the group information showing empty current_offset and LAG.
Reviewers: Sriharsha Chintalapani <sriharsha@apache.org>, Vahid Hashemian <vahidhashemian@us.ibm.com>, Jason Gustafson <jason@confluent.io>
- Replace adminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK calls with TestUtils.createTopic wherever applicable
- Replace adminZkClient.createTopic calls with TestUtils.createTopic wherever applicable
- Move non-deprecated tests to other test classes and deprecate AdminTest.scala
- Remove duplicate tests between AdminTest and AdminZkClientTest
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Dong Lin <lindong28@gmail.com>
Closes#5303 from omkreddy/topiccreate
This includes a fix for ZOOKEEPER-2184 (Zookeeper Client
should re-resolve hosts when connection attempts fail), which
fixes KAFKA-4041.
Updated a couple of tests as unresolvable addresses are now
retried until the connection timeout. Cleaned up tests a little.
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
This setting allows specifying a chroot so we documented it.
Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
Co-authored-by: Katherine Farmer <kfarme3@uk.ibm.com>
Reviewers: Manikumar Reddy O <manikumar.reddy@gmail.com>, Jason Gustafson <jason@confluent.io>
If inter.broker.protocol.version is 2.0-IV1 or newer. Also fixed ListOffsetRequest
so that v2 is used, if applicable.
Added a unit test which verifies that we use the latest version of the various
requests by default. Included a few minor tweaks to make testing easier.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
ZooKeeper listener for change notifications should be created before loading the ACL cache to avoid timing window if acls are modified when broker is starting up.
Reviewers: Jun Rao <junrao@gmail.com>, Ismael Juma <ismael@confluent.io>
Use KafkaPrincipal objects for authorization in `SimpleAclAuthorizer` so that comparison with super.users and ACLs instantiated from Strings work. Previously, it would compare two different classes `KafkaPrincipal` and the custom class, which would always return false because of the implementation of `KafkaPrincipal#equals`.
Do not update LogReadResult after it is initially populated when returning fetches immediately (i.e. without hitting the purgatory). This was done in #3954 as an optimization so that the followers get the potentially updated high watermark. However, since many things can happen (like deleting old segments and advancing log start offset) between initial creation of LogReadResult and the update, we can hit issues like log start offset in fetch response being higher than the last offset in fetched records.
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This patch removes the need to build up producer state when the log is using V0 / V1 message format where we did not have idempotent and transactional producers yet.
Also fixes a small issue where we incorrectly reported the offset index corrupt if the last offset in the index is equal to the base offset of the segment.
They rely on finalizers (before Java 11), which create
unnecessary GC load. The alternatives are as easy to
use and don't have this issue.
Also use FileChannel directly instead of retrieving
it from RandomAccessFile whenever possible
since the indirection is unnecessary.
Finally, add a few try/finally blocks.
Reviewers: Colin Patrick McCabe <colin@cmccabe.xyz>, Rajini Sivaram <rajinisivaram@googlemail.com>
NoSuchElementException will be thrown if ReplicaAlterDirThread replaces the current replica with future replica right before the request handler thread executes `futureReplica.log.get.dir.getParent` in the ReplicaManager.alterReplicaLogDirs(). The solution is to grab the partition lock when request handler thread attempts to check the destination log directory of the future replica.
Author: Dong Lin <lindong28@gmail.com>
Reviewers: Jun Rao <junrao@gmail.com>
Closes#5081 from lindong28/KAFKA-6949
Currently the check whether a user as a super user in `SimpleAclAuthorizer` is performed only after all other ACLs have been evaluated. Since all requests from a super user are granted we don't really need to apply the ACLs. This patch returns true if the user is a super user before checking ACLs, thus bypassing the needless evaluation effort.
Reviewers: Andy Coates <big-andy-coates@users.noreply.github.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Jason Gustafson <jason@confluent.io>
Wait for produce to fail before updating listener to avoid send succeeding after the listener update. Also use different topics in tests with connection failures where one is expected to fail and the other is expected to succeed.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Leftover threads doing network I/O can interfere with subsequent tests. Add missing shutdown in tests and include admin client in the check for leftover threads.
Reviewers: Anna Povzner <anna@confluent.io>, Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Manikumar Reddy O <manikumar.reddy@gmail.com>
This patch contains the improved offset expiration semantics proposed in KIP-211. Committed offsets will not be expired as long as a group is active. Once all members have left the group, then offsets will be expired after the timeout configured by `offsets.retention.minutes`. Note that the optimization for early expiration of unsubscribed topics will be implemented in a separate patch.
This is an unexpected exception so `UnknownServerException`
is thrown back to the client.
This is a minimal change to make the behaviour match `ZkUtils`.
This is better, but one could argue that it's not perfect. A more
sophisticated approach can be tackled separately.
Added a concurrent test that fails without this change.
Reviewers: Jun Rao <junrao@gmail.com>
For metadata request version 6 and above, use a different error code to indicate missing listener on leader broker to enable diagnosis of listener configuration issues.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Also:
- Remove exceptions in `kafka.common` that are no longer used.
- Keep `kafka.common.KafkaException` as it's still used by `ZkUtils`,
`kafka.admin.AdminClient` and `kafka.security.auth` classes and
we would like to maintain compatibility for now.
- Add deprecated annotation to `kafka.admin.AdminClient`. The scaladoc
stated that the class is deprecated, but the annotation was missing.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
A broker with multiple log dirs will die on startup if dir.getCanonicalPath() throws
IOException for one of the log dirs. We should mark such log directory as offline
instead and the broker should start if there is a healthy log dir.
Reviewers: Ismael Juma <ismael@juma.me.uk>
This patch fixes the following issues in the log splitting logic added to address KAFKA-6264:
1. We were not handling the case when all messages in the segment overflowed the index. In this case, there is only one resulting segment following the split.
2. There was an off-by-one error in the recovery logic when completing a swap operation which caused an unintended segment deletion.
Additionally, this patch factors out of `splitOverflowedSegment` a method to write to a segment using from with an instance of `FileRecords`. This allows for future reuse and isolated testing.
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
- Removed Scala consumers (`SimpleConsumer` and `ZooKeeperConsumerConnector`)
and their tests.
- Removed Scala request/response/message classes.
- Removed any mention of new consumer or new producer in the code
with the exception of MirrorMaker where the new.consumer option was
never deprecated so we have to keep it for now. The non-code
documentation has not been updated either, that will be done
separately.
- Removed a number of tools that only made sense in the context
of the Scala consumers (see upgrade notes).
- Updated some tools that worked with both Scala and Java consumers
so that they only support the latter (see upgrade notes).
- Removed `BaseConsumer` and related classes apart from `BaseRecord`
which is used in `MirrorMakerMessageHandler`. The latter is a pluggable
interface so effectively public API.
- Removed `ZkUtils` methods that were only used by the old consumers.
- Removed `ZkUtils.registerBroker` and `ZKCheckedEphemeral` since
the broker now uses the methods in `KafkaZkClient` and no-one else
should be using that method.
- Updated system tests so that they don't use the Scala consumers except
for multi-version tests.
- Updated LogDirFailureTest so that the consumer offsets topic would
continue to be available after all the failures. This was necessary for it
to work with the Java consumer.
- Some multi-version system tests had not been updated to include
recently released Kafka versions, fixed it.
- Updated findBugs and checkstyle configs not to refer to deleted
classes and packages.
Reviewers: Dong Lin <lindong28@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Support configuration of dynamic broker configs in ZooKeeper before starting brokers using ConfigCommand. This will allow password configs to be encrypted and stored in ZooKeeper, without requiring clear passwords in server.properties to bootstrap the broker first.
Reviewers: Jason Gustafson <jason@confluent.io>
The initial PR for KIP-290 #5117 added a new `ResourceNameType`, which was initially a field on `Resource` and `ResourceFilter`. However, follow on PRs have now moved the name type fields to new `ResourcePattern` and `ResourcePatternFilter` classes. This means the old name is no longer valid and may be confusing. The PR looks to rename the class to a more intuitive `resource.PatternType`.
@cmccabe also requested that the current `ANY` value for this class be renamed to avoid confusion. `PatternType.ANY` currently causes `ResourcePatternFilter` to bring back all ACLs that would affect the supplied resource, i.e. it brings back literal, wildcard ACLs, and also does pattern matching to work out which prefix acls would affect the resource. This is very different from the behaviour of `ResourceType.ANY`, which just means the filter ignores the type of resources.
`ANY` is to be renamed to `MATCH` to disambiguate it from other `ANY` filter types. A new `ANY` will be added that works in the same way as others, i.e. it will cause the filter to ignore the pattern type, (but won't do any pattern matching).
Reviewers: Colin Patrick McCabe <colin@cmccabe.xyz>, Jun Rao <junrao@gmail.com>
This does the minimal amount of work so that the tool
relies on public non-deprecated APIs (i.e. it no longer
relies on Scala clients code).
Additional improvements (not included here) have
been proposed via KIP-308.
There are a few other PRs that touch this class with
overlapping goals:
- https://github.com/apache/kafka/pull/2891
- https://github.com/apache/kafka/pull/3051
- https://github.com/apache/kafka/pull/3320
One of them remains relevant in the context of KIP-308, but
the others have been superseded. I included the authors of
the 3 PRs as co-authors.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Vahid Hashemian <vahidhashemian@us.ibm.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Co-authored-by: Arseniy Tashoyan <tashoyan@gmail.com>
Co-authored-by: Vahid Hashemian <vahidhashemian@us.ibm.com>
Co-authored-by: Mohammed Amine GARMES
Co-authored-by: Ismael Juma <ismael@juma.me.uk>
It is possible that log start offset may fall in the middle of the batch after AdminClient#deleteRecords(). This will cause a follower starting from log start offset to fail fetching (all records). Use-cases when a follower will start fetching from log start offset includes: 1) new replica due to partition re-assignment; 2) new local replica created as a result of AdminClient#AlterReplicaLogDirs(); 3) broker that was down for some time while AdminClient#deleteRecords() move log start offset beyond its HW.
Added two integration tests:
1) Produce and then AdminClient#deleteRecords() while one of the followers is down, and then restart of the follower requires fetching from log start offset;
2) AdminClient#AlterReplicaLogDirs() after AdminClient#deleteRecords()
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
For tests that are not testing the old consumers functionality. As part of this,
consolidate `TopicMetadataTest` into `MetadataRequestTest`. Finally,
remove `ProducerBounceTest` which has no tests left in it.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch changes the default `request.timeout.ms` of the consumer to 30 seconds. Additionally, it adds logic to `NetworkClient` and related to components to support timeouts at the request level. We use this to handle the special case of the JoinGroup request, which may block for as long as the value configured by `max.poll.interval.ms`.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <guozhang@confluent.io>
- Update some tests to use the Java consumer.
- Remove ignored `ProducerBounceTest.testBrokerFailure`. This test
is flaky and it has been superseded by `TransactionBounceTest`.
- Use non-blocking poll for consumption methods in `TestUtils`.
This is a step on the road to remove the old consumers.
A bug in the original KIP-277 submission was caught during code review,
but it was not detected by the tests. Fix that gap.
Co-authored-by: Edoardo Comar <ecomar@uk.ibm.com>
Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
Keep Literal ACLs on the old paths, using the old formats, to maintain backwards compatibility.
Have Prefixed, and any latter types, go on new paths, using JSON, (old brokers are not aware of them)
Add checks to reject any adminClient requests to add prefixed acls before the cluster is fully upgraded.
Colin Patrick McCabe <colin@cmccabe.xyz>, Jun Rao <junrao@gmail.com>