While investigating KAFKA-9180, I noticed that we had no
unit test coverage. It turns out that the behavior was
correct, so we just fix the test coverage issue.
Also updated .gitignore with jmh-benchmarks/generated.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
When we roll a new segment, the log offset metadata tied to the high watermark may
need to be updated. This is needed when the high watermark is equal to the log end
offset at the time of the roll. Otherwise, we risk exposing uncommitted data early.
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This patch creates a `BaseAdminIntegrationTest` to be the root for all integration test extensions. Most of the existing tests will only be tested in `PlaintextAdminIntegrationTest`, which extends from `BaseAdminIntegrationTest`. This should cut off about 30 minutes from the overall build time.
Reviewers: David Arthur <mumrah@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Fixes `java.util.concurrent.ExecutionException: java.lang.AssertionError: Close finished too quickly 5999`.
The close test sets a close duration in milliseconds, but measures the time taken in nanoseconds. This leads to small error due to the resolution in each, where the close is deemed to have taken too little time.
When I measured the start and end with nanoTime, I found the time taken to close was `5999641566 ns (5999.6ms)` which seems close enough to be a resolution error. I've run the test 50 times and have not hit the "Close finished too quickly" issue again, whereas previously I hit a failure pretty quickly.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#7683 from lbradstreet/flaky-consumer-bounce-test
This patch implements the broker-side changes for KIP-360. It adds two new fields to InitProducerId: lastEpoch and producerId. Passing these values allows the TransactionCoordinator to safely bump a producer's epoch after some failures (such as UNKNOWN_PRODUCER_ID and INVALID_PRODUCER_ID_MAPPING). When a producer calls InitProducerId after a failure, the coordinator first checks the producer ID from the request to make sure no other producer has been started using the same transactional ID. If it is safe to continue, the coordinator checks the epoch from the request; if it matches the existing epoch, the epoch is bumped and the producer can safely continue. If it matches the previous epoch, the the current epoch is returned without bumping. Otherwise, the producer is fenced.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
This change fixes a performance regression due to follower last seen highwatermark
handling introduced in 23beeea. maybeUpdateHwAndSendResponse is expensive for
brokers with high partition counts, as it requires a partition and a replica lookup for every
partition being fetched. This refactor moves the last seen watermark update into the follower
fetch state update where we have already looked up the partition and replica.
I've seen cases where maybeUpdateHwAndSendResponse is responsible 8% of CPU usage, not including the responseCallback call that is part of it.
I have benchmarked this change with `UpdateFollowerFetchStateBenchmark` and it adds 5ns
of overhead to Partition.updateFollowerFetchState, which is a rounding error compared to the
current overhead of maybeUpdateHwAndSendResponse.
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This was a regression in 2.3.1. In the case of a DeleteRecords call, the log start offset may be higher than the active segment base offset. The cleaner should allow for this case gracefully.
Reviewers: Jun Rao <junrao@gmail.com>
Co-Authored-By: Tim Van Laer <timvlaer@users.noreply.github.com>
Author: uttpal <kumar.uttpal@oyorooms.com>
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Dong Lin <lindong28@gmail.com>
Closes#7563 from uttpal/KAFKA-9016
Within KafkaConsumer.poll, we have an optimization to try to send the next fetch request before returning the data in order to pipelining the fetch requests; however, this pollNoWakeup should NOT throw any exceptions, since at this point the fetch position has been updated. If an exception is thrown and the callers decide to capture and continue, those records would never be returned again, causing data loss.
Also fix the flaky test itself.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
Background:
Currently, whenever a batch is dropped because ofInvalidRecordException or InvalidTimestampException, only the culprit record appears in ProduceResponse.PartitionResponse.recordErrors. However, after users try to resend that batch excluding the rejected message, the latter records are not guaranteed to be free of problems.
Changes:
To address this issue, I changed the function signature of validateKey, validateRecord and validateTimestamp to return a Scala's Option object. Specifically, this object will hold the reason/message the current record in iteration fails and leaves to the callers (convertAndAssignOffsetsNonCompressed, assignOffsetsNonCompressed, validateMessagesAndAssignOffsetsCompressed) to gathered all troubling records into one place. Then, all these records will be returned along with the PartitionResponse object. As a result, if a batch contains more than one record errors, users see exactly which records cause the failure. PartitionResponse.recordErrors is a list of RecordError objects introduced by #7167 which include batchIndex denoting the relative position of a record in a batch and message indicating the reason of failure.
Gotchas:
Things are particularly tricky when a batch has records rejected because of both InvalidRecordException and InvalidTimestampException. In this case, the InvalidTimestampException takes higher precedence. Therefore, the Error field in PartitionResponse will be encoded with INVALID_TIMESTAMP.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
In #7361, we inadvertently reverted a change to enforce leader only fetching for old versions of the protocol. This patch fixes the problem and adds a new test case to cover fetches which hit purgatory.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, David Arthur <mumrah@gmail.com>
Fix a bug where the lastUsedMs value in the FetchSessionCache was not getting correctly updated, resulting in spurious evictions.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch removes the NewPartitionReassignment#of() method in favor of a simple constructor. Said method was confusing due to breaking two conventions - always returning a non-empty Optional and thus not being used as a static factory method.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
1. Avoid a buffer allocation and a buffer copy per file read.
2. Ensure we flush `netWriteBuffer` successfully before reading from
disk to avoid wasted disk reads.
3. 32k reads instead of 8k reads to reduce the number of disk reads
(improves efficiency for magnetic drives and reduces the number of
system calls).
4. Update SslTransportLayer.write(ByteBuffer) to loop until the socket
buffer is full or the src buffer has no remaining bytes.
5. Renamed `MappedByteBuffers` to `ByteBufferUnmapper` since it's also
applicable for direct byte buffers.
6. Skip empty `RecordsSend`
7. Some minor clean-ups for readability.
I ran a simple consumer perf benchmark on a 6 partition topic (large
enough not to fit into page cache) starting from the beginning of the
log with TLS enabled on my 6 core MacBook Pro as a sanity check.
This laptop has fast SSDs so it benefits less from the larger reads
than the case where magnetic disks are used. Consumer throughput
was ~260 MB/s before the changes and ~300 MB/s after
(~15% improvement).
Credit to @junrao for pointing out that this code could be more efficient.
Reviewers: Jun Rao <junrao@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
#7167 added a check for non-incremental offsets in `assignOffsetsNonCompressed`, which is not applicable for message format V0 and V1. Therefore, I added a condition to disable the check if the record version precedes V2.
Author: Tu Tran <tu@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7628 from tuvtran/KAFKA-9080
We would mistakenly increment the `OffsetCommits` metric instead
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: David Jacot <djacot@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7624 from stanislavkozlovski/minor-fix-group-coordinator-offset-expiry-metric
In 18d4e57f6e (diff-394389922df5210adf43a8b7064cc4ffL61) we unintentionally renamed the metric with the previous changes to reassignments
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7606 from stanislavkozlovski/minor-partition-reassignment-metric
The truncateHead method was removed from ProducerStateManager by github.com/apache/kafka/commit/c49775b. This meant that snapshots were no longer removed when the log start offset increased, even though the intent of that change was to remove snapshots but preserve the in-memory mapping. This patch adds the required functionality back.
Reviewers: Jason Gustafson <jason@confluent.io>
This PR fixes the inconsistency involved in the `removeMembersFromGroup` admin API calls:
1. Fail the `all()` request when there is sub level error (either partition or member)
2. Change getMembers() to members()
3. Hide the actual Errors from user
4. Do not expose generated MemberIdentity type
5. Use more consistent naming for Options and Result types
Reviewers: Guozhang Wang <wangguoz@gmail.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
Prior to this patch, partition creation would not be allowed for any topic while a reassignment is in progress. The PR makes this a topic level check. As long as a particular topic is not being reassigned, we allow partitions to be increased.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
The purpose of this patch is to make the reassignment algorithm simpler and more resilient to unexpected errors. Specifically, it has the following improvements:
1. Remove `ReassignedPartitionContext`. We no longer need to track the previous reassignment through the context and we now use the assignment state as the single source of truth for the target replicas in a reassignment.
2. Remove the intermediate assignment state when overriding a previous reassignment. Instead, an overriding reassignment directly updates the assignment state and shuts down any unneeded replicas. Reassignments are _always_ persisted in Zookeeper before being updated in the controller context.
3. To fix race conditions with concurrent submissions, reassignment completion for a partition always checks for a zk partition reassignment to be removed. This means the controller no longer needs to track the source of the reassignment.
4. Api reassignments explicitly remove reassignment state from zk prior to beginning the new reassignment. This fixes an inconsistency in precedence. Upon controller failover, zookeeper reassignments always take precedence over any active reassignment. So if we do not have the logic to remove the zk reassignment when an api reassignment is triggered, then we can revert to the older zk reassignment.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jun Rao <junrao@gmail.com>
Re-implement NewPartitionReassignment#of. It now takes a list rather than a variable-length list of arguments.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Vikas Singh <vikas@confluent.io>
Currently we only record completed sends and receives in the selector metrics. If there is a disconnect in the middle of the respective operation, then it is not counted. The metrics will be more accurate if we take into account partial sends and receives.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com
Do not allow an empty replica set to be passed into the reassignment API.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@gmail.com>
Aims to fix the flaky LogCleanerIntegrationTest#testIsThreadFailed by changing how metrics are cleaned.
Reviewers: Jason Gustafson <jason@confluent.io>
Allow routing of `AdminClient#describeTopics` to any broker in the cluster than just the controller, so that we don't create a hotspot for this API call. `AdminClient#describeTopics` uses the broker's metadata cache which is asynchronously maintained, so routing to brokers other than the controller is not expected to have a significant difference in terms of metadata consistency; all metadata requests are eventually consistent.
This patch also fixes a few flaky test failures.
Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
KIP-352 aims to add several new metrics in order to track reassignments much better. We will be able to measure bytes in/out rate and the count of partitions under active reassignment.
We also change the semantic of the UnderReplicatedPartitions metric to cater better for reassignment. Currently it reports under-replicated partitions when during reassignment extra partitions are added as part of the process but this PR changes it so it'll always take the original replica set into account when computing the URP metrics.
The newly added metrics will be:
- kafka.server:type=ReplicaManager,name=ReassigningPartitions
- kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec
- kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec
The changed URP metric:
- kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
With KIP-392, we allow consumers to fetch from followers. This capability is enabled when a replica selector has been provided in the configuration. When not in use, the intent is to preserve current behavior of fetching only from leader. The leader epoch is the mechanism that keeps us honest. When there is a leader change, the epoch gets bumped, consumer fetches fail due to the fenced epoch, and we find the new leader.
However, for old consumers, there is no similar protection. The leader epoch was not available to clients until recently. If there is a preferred leader election (for example), the old consumer will happily continue fetching from the demoted leader until a periodic metadata fetch causes us to discover the new leader. This does not create any problems from a correctness perspective–fetches are still bound by the high watermark–but it is unexpected and may cause unexpected performance characteristics.
This patch fixes this problem by enforcing leader-only fetching for older versions of the fetch request.
Reviewers: Jason Gustafson <jason@confluent.io>
AbstractRequestResponse should be an interface, since it has no concrete elements or implementation. Move AbstractRequestResponse#serialize to RequestUtils#serialize and make it package-private, since it doesn't need to be public.
Reviewers: Ismael Juma <ismael@juma.me.uk>
A partition log in initialized in following steps:
1. Fetch log config from ZK
2. Call LogManager.getOrCreateLog which creates the Log object, then
3. Registers the Log object
Step #3 enables Configuration update thread to deliver configuration
updates to the log. But if any update arrives between step #1 and #3
then that update is missed. It breaks following use case:
1. Create a topic with default configuration, and immediately after that
2. Update the configuration of topic
There is a race condition here and in random cases update made in
second step will get dropped.
This change fixes it by tracking updates arriving between step #1 and #3
Once a Partition is done initializing log, it checks if it has missed any
update. If yes, then the configuration is read from ZK again.
Added unit tests to make sure a dirty configuration is refreshed. Tested
on local cluster to make sure that topic configuration and updates are
handled correctly.
Reviewers: Jason Gustafson <jason@confluent.io>
KAFKA-7215 improved the log cleaner error handling to mitigate thread death but missed one case. Exceptions in grabFilthiestCompactedLog still cause the thread to die.
This patch improves handling to ensure that errors in that function still mark a partition as uncleanable and do not crash the thread.
Reviewers: Jason Gustafson <jason@confluent.io>
All the changes are in ReplicaManager.appendToLocalLog and ReplicaManager.appendRecords. Also, replaced LogAppendInfo.unknownLogAppendInfoWithLogStartOffset with LogAppendInfo.unknownLogAppendInfoWithAdditionalInfo to include those 2 new fields.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
The scalac optimizer is able to inline methods to avoid lambda allocations, eliminating
the runtime cost of higher order functions in many cases. The compilation parameters
we are using here were introduced in 2.12.x, so we don't enable them for Scala 2.11.
Also, we enable a more aggressive inlining policy for the `core` project since it's
not meant to be used as a library.
See https://www.lightbend.com/blog/scala-inliner-optimizer for more information about
the optimizer.
I verified that the lambda allocation in the code below (from LogCleaner.scala) went away
after this change with Scala 2.12 and 2.13.
```scala
private def consumeAbortedTxnsUpTo(offset: Long): Unit = {
while (abortedTransactions.headOption.exists(_.firstOffset <= offset)) {
val abortedTxn = abortedTransactions.dequeue()
ongoingAbortedTxns.getOrElseUpdate(abortedTxn.producerId, new AbortedTransactionMetadata(abortedTxn))
}
}
```
The relevant part of the bytecode when compiled with Scala 2.13 looks like:
```text
private void consumeAbortedTxnsUpTo(long);
Code:
0: aload_0
1: invokespecial #54 // Method abortedTransactions:()Lscala/collection/mutable/PriorityQueue;
4: invokevirtual #175 // Method scala/collection/mutable/PriorityQueue.headOption:()Lscala/Option;
7: dup
8: ifnonnull 13
11: aconst_null
12: athrow
13: astore 4
15: aload 4
17: invokevirtual #145 // Method scala/Option.isEmpty:()Z
20: ifne 48
23: aload 4
25: invokevirtual #148 // Method scala/Option.get:()Ljava/lang/Object;
28: checkcast #177 // class kafka/log/AbortedTxn
```
The increased inlining causes some spurious spotBugs warnings, I added a few suppressions
and fixed one warning by avoiding unnecessary boxing.
Reviewers: Guozhang Wang <wangguoz@gmail.com>