This patch fixes a bug in `SocketServer` in the expiration of connections which have not re-authenticated quickly enough. Previously these connections were left hanging, but now they are properly closed and cleaned up. This was one cause of the flaky test failures in `EndToEndAuthorizationTest.testNoDescribeProduceOrConsumeWithoutTopicDescribeAcl`.
Reviewers: Jason Gustafson<jason@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
KAFKA-8448 fixes problem with similar leak. The Log objects are being
held in ScheduledExecutor PeriodicProducerExpirationCheck callback. The
fix in KAFKA-8448 was to change the policy of ScheduledExecutor to
remove the scheduled task when it gets canceled (by calling
setRemoveOnCancelPolicy(true)).
This works when a log is closed using close() method. But when a log is
deleted either when the topic gets deleted or when the rebalancing
operation moves the replica away from broker, the delete() operation is
invoked. Log.delete() doesn't close the pending scheduled task and that
leaks Log instance.
Fix is to close the scheduled task in the Log.delete() method too.
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Race condition in concurrent `get` method invocation of lazy indexes might lead
to multiple `OffsetIndex`/`TimeIndex` objects being concurrently created. When
that happens position of `MappedByteBuffer` in `AbstractIndex` is advanced to
the last entry which in turn leads to a critical `BufferOverflowException` being
thrown whenever leader or replica tries to append a record to the segment.
Moreover, `file_=` setter is seemingly also vulnerable to the race, since multiple
threads can attempt to set a new file reference as well as create new
Time/OffsetIndex objects at the same time. This might lead to the discrepant
File references being held by e.g. LazyTimeIndex and its TimeIndex counterpart.
This patch attempts to fix the issue by making sure that index objects are
atomically constructed when loaded lazily via `get` method. Additionally, `file`
reference modifications are also made atomic and thread safe.
Note that the `Lazy*Index` mutation operations are executed with a lock held by
the callers, but `get` can be called without a lock (e.g. from `Log.read`).
Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>, Shilin Lu, Ismael Juma <ismael@juma.me.uk>
https://issues.apache.org/jira/browse/KAFKA-9069
Make `getTopicMetadata` in AdminClientIntegrationTest always read metadata from controller to get a consistent view.
*More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.*
*Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.*
Author: huxihx <huxi_2b@hotmail.com>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, José Armando García Sancio <jsancio@gmail.com>
Closes#7619 from huxihx/KAFKA-9069
If a field is not marked as ignorable, we should raise an exception if it has been set to a non-default value. This check already exists in `Message.write`, so this patch adds it to `Message.toStruct`. Additionally, we fix several fields which should have been marked ignorable and we fix some related test assertions.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Manikumar Reddy <manikumar.reddy@gmail.com>, Colin Patrick McCabe <cmccabe@apache.org>
This test was taking more than 5 minutes because the producer writes were timing out. The problem was that broker 100 was being shutdown before broker 101, which meant that the partition was still offline after broker 100 was restarted. The producer timeouts were not detected because the produce future was not checked. After the fix, test time drops to about 15s.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Fixes#7717, which did not actually achieve its intended effect. The event manager failed to process the new event because we disabled the rate metric, which it expected to be present.
Reviewers: Ismael Juma <ismael@juma.me.uk
Create a controller event for handling UpdateMetadata responses and log a message when a response contains an error.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
The latter was previously hardcoding logDirCount instead of using the
method defined in the superclass since it was unnecessarily duplicating
logic.
Also tweak IntegrationTestHarness and remove unnecessary method
override from SaslPlainPlaintextConsumerTest.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Force completion of delayed operations when receiving a StopReplica request. In the case of a partition reassignment, we cannot rely on receiving a LeaderAndIsr request in order to complete these operations because the leader may no longer be a replica. Previously when this happened, the delayed operations were left to eventually timeout.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
Co-Authored-By: Kun Du <kidkun@users.noreply.github.com>
While investigating KAFKA-9180, I noticed that we had no
unit test coverage. It turns out that the behavior was
correct, so we just fix the test coverage issue.
Also updated .gitignore with jmh-benchmarks/generated.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
When we roll a new segment, the log offset metadata tied to the high watermark may
need to be updated. This is needed when the high watermark is equal to the log end
offset at the time of the roll. Otherwise, we risk exposing uncommitted data early.
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This patch creates a `BaseAdminIntegrationTest` to be the root for all integration test extensions. Most of the existing tests will only be tested in `PlaintextAdminIntegrationTest`, which extends from `BaseAdminIntegrationTest`. This should cut off about 30 minutes from the overall build time.
Reviewers: David Arthur <mumrah@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Fixes `java.util.concurrent.ExecutionException: java.lang.AssertionError: Close finished too quickly 5999`.
The close test sets a close duration in milliseconds, but measures the time taken in nanoseconds. This leads to small error due to the resolution in each, where the close is deemed to have taken too little time.
When I measured the start and end with nanoTime, I found the time taken to close was `5999641566 ns (5999.6ms)` which seems close enough to be a resolution error. I've run the test 50 times and have not hit the "Close finished too quickly" issue again, whereas previously I hit a failure pretty quickly.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#7683 from lbradstreet/flaky-consumer-bounce-test
This patch implements the broker-side changes for KIP-360. It adds two new fields to InitProducerId: lastEpoch and producerId. Passing these values allows the TransactionCoordinator to safely bump a producer's epoch after some failures (such as UNKNOWN_PRODUCER_ID and INVALID_PRODUCER_ID_MAPPING). When a producer calls InitProducerId after a failure, the coordinator first checks the producer ID from the request to make sure no other producer has been started using the same transactional ID. If it is safe to continue, the coordinator checks the epoch from the request; if it matches the existing epoch, the epoch is bumped and the producer can safely continue. If it matches the previous epoch, the the current epoch is returned without bumping. Otherwise, the producer is fenced.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
This change fixes a performance regression due to follower last seen highwatermark
handling introduced in 23beeea. maybeUpdateHwAndSendResponse is expensive for
brokers with high partition counts, as it requires a partition and a replica lookup for every
partition being fetched. This refactor moves the last seen watermark update into the follower
fetch state update where we have already looked up the partition and replica.
I've seen cases where maybeUpdateHwAndSendResponse is responsible 8% of CPU usage, not including the responseCallback call that is part of it.
I have benchmarked this change with `UpdateFollowerFetchStateBenchmark` and it adds 5ns
of overhead to Partition.updateFollowerFetchState, which is a rounding error compared to the
current overhead of maybeUpdateHwAndSendResponse.
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This was a regression in 2.3.1. In the case of a DeleteRecords call, the log start offset may be higher than the active segment base offset. The cleaner should allow for this case gracefully.
Reviewers: Jun Rao <junrao@gmail.com>
Co-Authored-By: Tim Van Laer <timvlaer@users.noreply.github.com>
Author: uttpal <kumar.uttpal@oyorooms.com>
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Dong Lin <lindong28@gmail.com>
Closes#7563 from uttpal/KAFKA-9016
Within KafkaConsumer.poll, we have an optimization to try to send the next fetch request before returning the data in order to pipelining the fetch requests; however, this pollNoWakeup should NOT throw any exceptions, since at this point the fetch position has been updated. If an exception is thrown and the callers decide to capture and continue, those records would never be returned again, causing data loss.
Also fix the flaky test itself.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
Background:
Currently, whenever a batch is dropped because ofInvalidRecordException or InvalidTimestampException, only the culprit record appears in ProduceResponse.PartitionResponse.recordErrors. However, after users try to resend that batch excluding the rejected message, the latter records are not guaranteed to be free of problems.
Changes:
To address this issue, I changed the function signature of validateKey, validateRecord and validateTimestamp to return a Scala's Option object. Specifically, this object will hold the reason/message the current record in iteration fails and leaves to the callers (convertAndAssignOffsetsNonCompressed, assignOffsetsNonCompressed, validateMessagesAndAssignOffsetsCompressed) to gathered all troubling records into one place. Then, all these records will be returned along with the PartitionResponse object. As a result, if a batch contains more than one record errors, users see exactly which records cause the failure. PartitionResponse.recordErrors is a list of RecordError objects introduced by #7167 which include batchIndex denoting the relative position of a record in a batch and message indicating the reason of failure.
Gotchas:
Things are particularly tricky when a batch has records rejected because of both InvalidRecordException and InvalidTimestampException. In this case, the InvalidTimestampException takes higher precedence. Therefore, the Error field in PartitionResponse will be encoded with INVALID_TIMESTAMP.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
In #7361, we inadvertently reverted a change to enforce leader only fetching for old versions of the protocol. This patch fixes the problem and adds a new test case to cover fetches which hit purgatory.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, David Arthur <mumrah@gmail.com>
Fix a bug where the lastUsedMs value in the FetchSessionCache was not getting correctly updated, resulting in spurious evictions.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch removes the NewPartitionReassignment#of() method in favor of a simple constructor. Said method was confusing due to breaking two conventions - always returning a non-empty Optional and thus not being used as a static factory method.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
1. Avoid a buffer allocation and a buffer copy per file read.
2. Ensure we flush `netWriteBuffer` successfully before reading from
disk to avoid wasted disk reads.
3. 32k reads instead of 8k reads to reduce the number of disk reads
(improves efficiency for magnetic drives and reduces the number of
system calls).
4. Update SslTransportLayer.write(ByteBuffer) to loop until the socket
buffer is full or the src buffer has no remaining bytes.
5. Renamed `MappedByteBuffers` to `ByteBufferUnmapper` since it's also
applicable for direct byte buffers.
6. Skip empty `RecordsSend`
7. Some minor clean-ups for readability.
I ran a simple consumer perf benchmark on a 6 partition topic (large
enough not to fit into page cache) starting from the beginning of the
log with TLS enabled on my 6 core MacBook Pro as a sanity check.
This laptop has fast SSDs so it benefits less from the larger reads
than the case where magnetic disks are used. Consumer throughput
was ~260 MB/s before the changes and ~300 MB/s after
(~15% improvement).
Credit to @junrao for pointing out that this code could be more efficient.
Reviewers: Jun Rao <junrao@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
#7167 added a check for non-incremental offsets in `assignOffsetsNonCompressed`, which is not applicable for message format V0 and V1. Therefore, I added a condition to disable the check if the record version precedes V2.
Author: Tu Tran <tu@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7628 from tuvtran/KAFKA-9080
We would mistakenly increment the `OffsetCommits` metric instead
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: David Jacot <djacot@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7624 from stanislavkozlovski/minor-fix-group-coordinator-offset-expiry-metric
In 18d4e57f6e (diff-394389922df5210adf43a8b7064cc4ffL61) we unintentionally renamed the metric with the previous changes to reassignments
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7606 from stanislavkozlovski/minor-partition-reassignment-metric
The truncateHead method was removed from ProducerStateManager by github.com/apache/kafka/commit/c49775b. This meant that snapshots were no longer removed when the log start offset increased, even though the intent of that change was to remove snapshots but preserve the in-memory mapping. This patch adds the required functionality back.
Reviewers: Jason Gustafson <jason@confluent.io>
This PR fixes the inconsistency involved in the `removeMembersFromGroup` admin API calls:
1. Fail the `all()` request when there is sub level error (either partition or member)
2. Change getMembers() to members()
3. Hide the actual Errors from user
4. Do not expose generated MemberIdentity type
5. Use more consistent naming for Options and Result types
Reviewers: Guozhang Wang <wangguoz@gmail.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
Prior to this patch, partition creation would not be allowed for any topic while a reassignment is in progress. The PR makes this a topic level check. As long as a particular topic is not being reassigned, we allow partitions to be increased.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
The purpose of this patch is to make the reassignment algorithm simpler and more resilient to unexpected errors. Specifically, it has the following improvements:
1. Remove `ReassignedPartitionContext`. We no longer need to track the previous reassignment through the context and we now use the assignment state as the single source of truth for the target replicas in a reassignment.
2. Remove the intermediate assignment state when overriding a previous reassignment. Instead, an overriding reassignment directly updates the assignment state and shuts down any unneeded replicas. Reassignments are _always_ persisted in Zookeeper before being updated in the controller context.
3. To fix race conditions with concurrent submissions, reassignment completion for a partition always checks for a zk partition reassignment to be removed. This means the controller no longer needs to track the source of the reassignment.
4. Api reassignments explicitly remove reassignment state from zk prior to beginning the new reassignment. This fixes an inconsistency in precedence. Upon controller failover, zookeeper reassignments always take precedence over any active reassignment. So if we do not have the logic to remove the zk reassignment when an api reassignment is triggered, then we can revert to the older zk reassignment.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jun Rao <junrao@gmail.com>
Re-implement NewPartitionReassignment#of. It now takes a list rather than a variable-length list of arguments.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Vikas Singh <vikas@confluent.io>
Currently we only record completed sends and receives in the selector metrics. If there is a disconnect in the middle of the respective operation, then it is not counted. The metrics will be more accurate if we take into account partial sends and receives.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com
Do not allow an empty replica set to be passed into the reassignment API.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@gmail.com>
Aims to fix the flaky LogCleanerIntegrationTest#testIsThreadFailed by changing how metrics are cleaned.
Reviewers: Jason Gustafson <jason@confluent.io>
Allow routing of `AdminClient#describeTopics` to any broker in the cluster than just the controller, so that we don't create a hotspot for this API call. `AdminClient#describeTopics` uses the broker's metadata cache which is asynchronously maintained, so routing to brokers other than the controller is not expected to have a significant difference in terms of metadata consistency; all metadata requests are eventually consistent.
This patch also fixes a few flaky test failures.
Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>