Fixes `java.util.concurrent.ExecutionException: java.lang.AssertionError: Close finished too quickly 5999`.
The close test sets a close duration in milliseconds, but measures the time taken in nanoseconds. This leads to small error due to the resolution in each, where the close is deemed to have taken too little time.
When I measured the start and end with nanoTime, I found the time taken to close was `5999641566 ns (5999.6ms)` which seems close enough to be a resolution error. I've run the test 50 times and have not hit the "Close finished too quickly" issue again, whereas previously I hit a failure pretty quickly.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#7683 from lbradstreet/flaky-consumer-bounce-test
Instead of caching the checkpoint map during StandbyTask
initialization, use the latest checkpoints (which would have
been updated during suspend).
Reviewers: Bill Bejeck <bill@confluent.io>
This patch implements the broker-side changes for KIP-360. It adds two new fields to InitProducerId: lastEpoch and producerId. Passing these values allows the TransactionCoordinator to safely bump a producer's epoch after some failures (such as UNKNOWN_PRODUCER_ID and INVALID_PRODUCER_ID_MAPPING). When a producer calls InitProducerId after a failure, the coordinator first checks the producer ID from the request to make sure no other producer has been started using the same transactional ID. If it is safe to continue, the coordinator checks the epoch from the request; if it matches the existing epoch, the epoch is bumped and the producer can safely continue. If it matches the previous epoch, the the current epoch is returned without bumping. Otherwise, the producer is fenced.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
This change fixes a performance regression due to follower last seen highwatermark
handling introduced in 23beeea. maybeUpdateHwAndSendResponse is expensive for
brokers with high partition counts, as it requires a partition and a replica lookup for every
partition being fetched. This refactor moves the last seen watermark update into the follower
fetch state update where we have already looked up the partition and replica.
I've seen cases where maybeUpdateHwAndSendResponse is responsible 8% of CPU usage, not including the responseCallback call that is part of it.
I have benchmarked this change with `UpdateFollowerFetchStateBenchmark` and it adds 5ns
of overhead to Partition.updateFollowerFetchState, which is a rounding error compared to the
current overhead of maybeUpdateHwAndSendResponse.
Reviewers: David Arthur <mumrah@gmail.com>, Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>
7.3.1.1 JAAS configuration for Kafka brokers was followed by
7.3.1.4 JAAS configuration for Kafka clients instead of 7.3.1.2.
Reviewers: Mickael Maison <mickael.maison@gmail.com>
ZK upgrade from 3.4.X to 3.5.6 fails with "java.io.IOException: No snapshot found" if there are no snapshot files. This was discussed in https://issues.apache.org/jira/browse/ZOOKEEPER-3056
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#7625 from omkreddy/zk-upgrade
The `Epoch` field description was copy of the `SessionId` field. This
change updates it to describe `Epoch` instead.
No code change, only description changes. Code compiles.
Reviewers: Jason Gustafson <jason@confluent.io>
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#7477 from cmccabe/KAFKA-8984
This was a regression in 2.3.1. In the case of a DeleteRecords call, the log start offset may be higher than the active segment base offset. The cleaner should allow for this case gracefully.
Reviewers: Jun Rao <junrao@gmail.com>
Co-Authored-By: Tim Van Laer <timvlaer@users.noreply.github.com>
When users specify a name for a repartition topic, we should use the same name for the repartition filter, source, and sink nodes. With the addition of KIP-307 if users go to the effort of naming every node in the topology having processor nodes with generated names is inconsistent behavior.
Updated tests in the streams test suite.
Reviewers: John Roesler <john@confluent.io>, Christopher Pettitt <cpettitt@confluent.io>
Author: uttpal <kumar.uttpal@oyorooms.com>
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Dong Lin <lindong28@gmail.com>
Closes#7563 from uttpal/KAFKA-9016
Within KafkaConsumer.poll, we have an optimization to try to send the next fetch request before returning the data in order to pipelining the fetch requests; however, this pollNoWakeup should NOT throw any exceptions, since at this point the fetch position has been updated. If an exception is thrown and the callers decide to capture and continue, those records would never be returned again, causing data loss.
Also fix the flaky test itself.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
This patch fixes the test utility `testAllMessageRoundTripsFromVersion` in `MessageTest` which was unintentionally excluding the highest version.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Added one test and some comments to clarify how EOS is "enabled" for some of the tests.
Ran all streams tests.
Reviewers: Matthias J. Sax <mjsax@apache.org>
Background:
Currently, whenever a batch is dropped because ofInvalidRecordException or InvalidTimestampException, only the culprit record appears in ProduceResponse.PartitionResponse.recordErrors. However, after users try to resend that batch excluding the rejected message, the latter records are not guaranteed to be free of problems.
Changes:
To address this issue, I changed the function signature of validateKey, validateRecord and validateTimestamp to return a Scala's Option object. Specifically, this object will hold the reason/message the current record in iteration fails and leaves to the callers (convertAndAssignOffsetsNonCompressed, assignOffsetsNonCompressed, validateMessagesAndAssignOffsetsCompressed) to gathered all troubling records into one place. Then, all these records will be returned along with the PartitionResponse object. As a result, if a batch contains more than one record errors, users see exactly which records cause the failure. PartitionResponse.recordErrors is a list of RecordError objects introduced by #7167 which include batchIndex denoting the relative position of a record in a batch and message indicating the reason of failure.
Gotchas:
Things are particularly tricky when a batch has records rejected because of both InvalidRecordException and InvalidTimestampException. In this case, the InvalidTimestampException takes higher precedence. Therefore, the Error field in PartitionResponse will be encoded with INVALID_TIMESTAMP.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
On-the-side cleanups extracted from the PR for KAFKA-9103, so that the actual PR can be as small as possible.
Reviewers: Christopher Pettitt <cpettitt@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Otherwise the join-group would not be resend and we'd just fall into the endless loop.
Reviewers: Jason Gustafson <jason@confluent.io>, Boyang Chen <boyang@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
In #7361, we inadvertently reverted a change to enforce leader only fetching for old versions of the protocol. This patch fixes the problem and adds a new test case to cover fetches which hit purgatory.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, David Arthur <mumrah@gmail.com>
The consumer's `committed` API does not return an entry in the response map for a requested partition if there is no committed offset. The transactional message copier, which is used in the transaction system test, did not account for this. If the first transaction attempted by the copier was randomly aborted, then we would not seek to the beginning as expected, which means we would fail to copy some of the records.
This patch fixes the problem by iterating over the assignment rather than the result of `committed` when resetting offsets. It also adds enables additional logging in the transaction message copier service to make finding problems easier in the future.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7653 from hachikuji/fix-transaction-system-test
Fix a bug where the lastUsedMs value in the FetchSessionCache was not getting correctly updated, resulting in spurious evictions.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch removes the NewPartitionReassignment#of() method in favor of a simple constructor. Said method was confusing due to breaking two conventions - always returning a non-empty Optional and thus not being used as a static factory method.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
1. Avoid a buffer allocation and a buffer copy per file read.
2. Ensure we flush `netWriteBuffer` successfully before reading from
disk to avoid wasted disk reads.
3. 32k reads instead of 8k reads to reduce the number of disk reads
(improves efficiency for magnetic drives and reduces the number of
system calls).
4. Update SslTransportLayer.write(ByteBuffer) to loop until the socket
buffer is full or the src buffer has no remaining bytes.
5. Renamed `MappedByteBuffers` to `ByteBufferUnmapper` since it's also
applicable for direct byte buffers.
6. Skip empty `RecordsSend`
7. Some minor clean-ups for readability.
I ran a simple consumer perf benchmark on a 6 partition topic (large
enough not to fit into page cache) starting from the beginning of the
log with TLS enabled on my 6 core MacBook Pro as a sanity check.
This laptop has fast SSDs so it benefits less from the larger reads
than the case where magnetic disks are used. Consumer throughput
was ~260 MB/s before the changes and ~300 MB/s after
(~15% improvement).
Credit to @junrao for pointing out that this code could be more efficient.
Reviewers: Jun Rao <junrao@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
Minor follow up to #7608: For some reason the AssignedStreamTasks#updateRestored method only updates the restoring and restoredPartitions data structures, but there is a third map holding restored tasks & partitions: restoringByPartitions
Also improves the TaskManager#closeLostTasks logging, by separating by case and logging the specific failure before throwing.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Guozhang Wang <wangguoz@gmail.com>
If there are pending bytes in the transport layer, we may
complete a send even if no bytes were recorded as written.
We assume bytes are written when they are in the netWriteBuffer,
but we only consider the send as completed when it's in
the socket channel buffer.
This fixes a regression introduced via 0971f66ff5. The impact is
that we would sometimes throw the following exception in
`MultiRecordsSend.writeTo`:
```java
if (completed())
throw new KafkaException("This operation cannot be invoked on a complete request.");
```
Added unit test verifying the bug fix. While in the area, I simplified one of the
`SslSelectorTest` methods.
Reviewers: Jun Rao <junrao@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
#7167 added a check for non-incremental offsets in `assignOffsetsNonCompressed`, which is not applicable for message format V0 and V1. Therefore, I added a condition to disable the check if the record version precedes V2.
Author: Tu Tran <tu@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7628 from tuvtran/KAFKA-9080
Currently when we identify version probing we return early from onAssignment and never get to updating the TaskManager and general state with the new assignment. Since we do actually give out "real" assignments even during version probing, a StreamThread should take real ownership of its tasks/partitions including cleaning them up in onPartitionsRevoked which gets invoked when we call onLeavePrepare as part of triggering the follow-up rebalance.
Every member will always get an assignment encoded with the lowest common version, so there should be no problem decoding a VP assignment. We should just allow onAssignment to proceed as usual so that the TaskManager is in a consistent state, and knows what all its tasks/partitions are when the first rebalance completes and the next one is triggered.
Reviewers: Boyang Chen <boyang@confluent.io>, Matthias J. Sax <mjsax@apache.org>, Guozhang Wang <wangguoz@gmail.com>
We should not use StreamsMetricsImpl. threadLevelSensor directly which would only retrieve the sensor but would not add any metrics to the sensor. Generally speaking we should always use the corresponding-level Metrics class (e.g. ThreadMetrics) to get the sensors which are populated with metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>, Matthias J. Sax <matthias@confluent.io>
Rather than maintain hand coded protocol serialization code, Streams could use the same code-generation framework as Clients/Core.
There isn't a perfect match, since the code generation framework includes an assumption that you're generating "protocol messages", rather than just arbitrary blobs, but I think it's close enough to justify using it, and improving it over time.
Using the code generation allows us to drop a lot of detail-oriented, brittle, and hard-to-maintain serialization logic in favor of a schema spec.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Boyang Chen <boyang@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
We would mistakenly increment the `OffsetCommits` metric instead
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: David Jacot <djacot@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7624 from stanislavkozlovski/minor-fix-group-coordinator-offset-expiry-metric
`log` in KafkaConsumer does not get initialized if an invalid value for group.intance.id is given during consumer construction. In this case we should skip the catch block's close procedure since no internal objects have been initialized yet.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Refactors metrics according to KIP-444
Introduces StateStoreMetrics as a central provider for state store metrics
Adds metric scope (a.k.a. store type) to the in-memory suppression buffer
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>
In 18d4e57f6e (diff-394389922df5210adf43a8b7064cc4ffL61) we unintentionally renamed the metric with the previous changes to reassignments
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7606 from stanislavkozlovski/minor-partition-reassignment-metric
Third bugfix for the failing broker bounce system test with cooperative rebalancing:
tl;dr We need to remove everything associated with a task when it is closed, but in some cases (eg AssignedTasks#commit) on a TaskMigratedExceptionwe would close it as a zombie and then (only) remove the taskId from therunning` map. This left its partitions, restorers, state stores, etc around and in an undefined state, causing exceptions when closing and/or opening the stores again.
Longer explanation:
In AssignedTasks (the abstract class from which the standby and active task variations extend) a commit failure (even due to broker down/unavailable) is treated as a TaskMigratedException after which the failed task is closed as a zombie and removed from running -- the remaining tasks (ie those still in running are then also closed as zombies in the subsequent onPartitionsLost
However we do not remove the closed task from runningByPartition nor do we remove the corresponding changelogs, if restoring, from the StoreChangelogReader since that applies only to active tasks, and AssignedTasks is generic/abstract. The changelog reader then retains a mapping from the closed task's changelog partition to its CompositeRestoreListener (and does not replace this when the new one comes along after the rebalance). The restore listener has a reference to a specific RocksDBStore instance, one which was closed when the task was closed as a zombie, so it accidentally tries to restore to the "old" RocksDBStore instance rather than the new one that was just opened.
Although technically this bug existed before KIP-429, it was only uncovered now that we remove tasks and clear their state/partitions/etc one at a time. We don't technically need to cherrypick the fix back earlier as before we just blindly clear all data structures entirely during an eager rebalance.
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
With KIP-444 the metrics definitions are refactored. Thus, Streams' SimpleBenchmark needs to be updated to correctly access the refactored metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>