The `Epoch` field description was copy of the `SessionId` field. This
change updates it to describe `Epoch` instead.
No code change, only description changes. Code compiles.
Reviewers: Jason Gustafson <jason@confluent.io>
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#7477 from cmccabe/KAFKA-8984
This was a regression in 2.3.1. In the case of a DeleteRecords call, the log start offset may be higher than the active segment base offset. The cleaner should allow for this case gracefully.
Reviewers: Jun Rao <junrao@gmail.com>
Co-Authored-By: Tim Van Laer <timvlaer@users.noreply.github.com>
When users specify a name for a repartition topic, we should use the same name for the repartition filter, source, and sink nodes. With the addition of KIP-307 if users go to the effort of naming every node in the topology having processor nodes with generated names is inconsistent behavior.
Updated tests in the streams test suite.
Reviewers: John Roesler <john@confluent.io>, Christopher Pettitt <cpettitt@confluent.io>
Author: uttpal <kumar.uttpal@oyorooms.com>
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Dong Lin <lindong28@gmail.com>
Closes#7563 from uttpal/KAFKA-9016
Within KafkaConsumer.poll, we have an optimization to try to send the next fetch request before returning the data in order to pipelining the fetch requests; however, this pollNoWakeup should NOT throw any exceptions, since at this point the fetch position has been updated. If an exception is thrown and the callers decide to capture and continue, those records would never be returned again, causing data loss.
Also fix the flaky test itself.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
This patch fixes the test utility `testAllMessageRoundTripsFromVersion` in `MessageTest` which was unintentionally excluding the highest version.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Added one test and some comments to clarify how EOS is "enabled" for some of the tests.
Ran all streams tests.
Reviewers: Matthias J. Sax <mjsax@apache.org>
Background:
Currently, whenever a batch is dropped because ofInvalidRecordException or InvalidTimestampException, only the culprit record appears in ProduceResponse.PartitionResponse.recordErrors. However, after users try to resend that batch excluding the rejected message, the latter records are not guaranteed to be free of problems.
Changes:
To address this issue, I changed the function signature of validateKey, validateRecord and validateTimestamp to return a Scala's Option object. Specifically, this object will hold the reason/message the current record in iteration fails and leaves to the callers (convertAndAssignOffsetsNonCompressed, assignOffsetsNonCompressed, validateMessagesAndAssignOffsetsCompressed) to gathered all troubling records into one place. Then, all these records will be returned along with the PartitionResponse object. As a result, if a batch contains more than one record errors, users see exactly which records cause the failure. PartitionResponse.recordErrors is a list of RecordError objects introduced by #7167 which include batchIndex denoting the relative position of a record in a batch and message indicating the reason of failure.
Gotchas:
Things are particularly tricky when a batch has records rejected because of both InvalidRecordException and InvalidTimestampException. In this case, the InvalidTimestampException takes higher precedence. Therefore, the Error field in PartitionResponse will be encoded with INVALID_TIMESTAMP.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
On-the-side cleanups extracted from the PR for KAFKA-9103, so that the actual PR can be as small as possible.
Reviewers: Christopher Pettitt <cpettitt@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Otherwise the join-group would not be resend and we'd just fall into the endless loop.
Reviewers: Jason Gustafson <jason@confluent.io>, Boyang Chen <boyang@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
In #7361, we inadvertently reverted a change to enforce leader only fetching for old versions of the protocol. This patch fixes the problem and adds a new test case to cover fetches which hit purgatory.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, David Arthur <mumrah@gmail.com>
The consumer's `committed` API does not return an entry in the response map for a requested partition if there is no committed offset. The transactional message copier, which is used in the transaction system test, did not account for this. If the first transaction attempted by the copier was randomly aborted, then we would not seek to the beginning as expected, which means we would fail to copy some of the records.
This patch fixes the problem by iterating over the assignment rather than the result of `committed` when resetting offsets. It also adds enables additional logging in the transaction message copier service to make finding problems easier in the future.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7653 from hachikuji/fix-transaction-system-test
Fix a bug where the lastUsedMs value in the FetchSessionCache was not getting correctly updated, resulting in spurious evictions.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
This patch removes the NewPartitionReassignment#of() method in favor of a simple constructor. Said method was confusing due to breaking two conventions - always returning a non-empty Optional and thus not being used as a static factory method.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
1. Avoid a buffer allocation and a buffer copy per file read.
2. Ensure we flush `netWriteBuffer` successfully before reading from
disk to avoid wasted disk reads.
3. 32k reads instead of 8k reads to reduce the number of disk reads
(improves efficiency for magnetic drives and reduces the number of
system calls).
4. Update SslTransportLayer.write(ByteBuffer) to loop until the socket
buffer is full or the src buffer has no remaining bytes.
5. Renamed `MappedByteBuffers` to `ByteBufferUnmapper` since it's also
applicable for direct byte buffers.
6. Skip empty `RecordsSend`
7. Some minor clean-ups for readability.
I ran a simple consumer perf benchmark on a 6 partition topic (large
enough not to fit into page cache) starting from the beginning of the
log with TLS enabled on my 6 core MacBook Pro as a sanity check.
This laptop has fast SSDs so it benefits less from the larger reads
than the case where magnetic disks are used. Consumer throughput
was ~260 MB/s before the changes and ~300 MB/s after
(~15% improvement).
Credit to @junrao for pointing out that this code could be more efficient.
Reviewers: Jun Rao <junrao@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
Minor follow up to #7608: For some reason the AssignedStreamTasks#updateRestored method only updates the restoring and restoredPartitions data structures, but there is a third map holding restored tasks & partitions: restoringByPartitions
Also improves the TaskManager#closeLostTasks logging, by separating by case and logging the specific failure before throwing.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Guozhang Wang <wangguoz@gmail.com>
If there are pending bytes in the transport layer, we may
complete a send even if no bytes were recorded as written.
We assume bytes are written when they are in the netWriteBuffer,
but we only consider the send as completed when it's in
the socket channel buffer.
This fixes a regression introduced via 0971f66ff546. The impact is
that we would sometimes throw the following exception in
`MultiRecordsSend.writeTo`:
```java
if (completed())
throw new KafkaException("This operation cannot be invoked on a complete request.");
```
Added unit test verifying the bug fix. While in the area, I simplified one of the
`SslSelectorTest` methods.
Reviewers: Jun Rao <junrao@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
#7167 added a check for non-incremental offsets in `assignOffsetsNonCompressed`, which is not applicable for message format V0 and V1. Therefore, I added a condition to disable the check if the record version precedes V2.
Author: Tu Tran <tu@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7628 from tuvtran/KAFKA-9080
Currently when we identify version probing we return early from onAssignment and never get to updating the TaskManager and general state with the new assignment. Since we do actually give out "real" assignments even during version probing, a StreamThread should take real ownership of its tasks/partitions including cleaning them up in onPartitionsRevoked which gets invoked when we call onLeavePrepare as part of triggering the follow-up rebalance.
Every member will always get an assignment encoded with the lowest common version, so there should be no problem decoding a VP assignment. We should just allow onAssignment to proceed as usual so that the TaskManager is in a consistent state, and knows what all its tasks/partitions are when the first rebalance completes and the next one is triggered.
Reviewers: Boyang Chen <boyang@confluent.io>, Matthias J. Sax <mjsax@apache.org>, Guozhang Wang <wangguoz@gmail.com>
We should not use StreamsMetricsImpl. threadLevelSensor directly which would only retrieve the sensor but would not add any metrics to the sensor. Generally speaking we should always use the corresponding-level Metrics class (e.g. ThreadMetrics) to get the sensors which are populated with metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>, Matthias J. Sax <matthias@confluent.io>
Rather than maintain hand coded protocol serialization code, Streams could use the same code-generation framework as Clients/Core.
There isn't a perfect match, since the code generation framework includes an assumption that you're generating "protocol messages", rather than just arbitrary blobs, but I think it's close enough to justify using it, and improving it over time.
Using the code generation allows us to drop a lot of detail-oriented, brittle, and hard-to-maintain serialization logic in favor of a schema spec.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Boyang Chen <boyang@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
We would mistakenly increment the `OffsetCommits` metric instead
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: David Jacot <djacot@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7624 from stanislavkozlovski/minor-fix-group-coordinator-offset-expiry-metric
`log` in KafkaConsumer does not get initialized if an invalid value for group.intance.id is given during consumer construction. In this case we should skip the catch block's close procedure since no internal objects have been initialized yet.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Refactors metrics according to KIP-444
Introduces StateStoreMetrics as a central provider for state store metrics
Adds metric scope (a.k.a. store type) to the in-memory suppression buffer
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>
In 18d4e57f6e (diff-394389922df5210adf43a8b7064cc4ffL61) we unintentionally renamed the metric with the previous changes to reassignments
Author: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#7606 from stanislavkozlovski/minor-partition-reassignment-metric
Third bugfix for the failing broker bounce system test with cooperative rebalancing:
tl;dr We need to remove everything associated with a task when it is closed, but in some cases (eg AssignedTasks#commit) on a TaskMigratedExceptionwe would close it as a zombie and then (only) remove the taskId from therunning` map. This left its partitions, restorers, state stores, etc around and in an undefined state, causing exceptions when closing and/or opening the stores again.
Longer explanation:
In AssignedTasks (the abstract class from which the standby and active task variations extend) a commit failure (even due to broker down/unavailable) is treated as a TaskMigratedException after which the failed task is closed as a zombie and removed from running -- the remaining tasks (ie those still in running are then also closed as zombies in the subsequent onPartitionsLost
However we do not remove the closed task from runningByPartition nor do we remove the corresponding changelogs, if restoring, from the StoreChangelogReader since that applies only to active tasks, and AssignedTasks is generic/abstract. The changelog reader then retains a mapping from the closed task's changelog partition to its CompositeRestoreListener (and does not replace this when the new one comes along after the rebalance). The restore listener has a reference to a specific RocksDBStore instance, one which was closed when the task was closed as a zombie, so it accidentally tries to restore to the "old" RocksDBStore instance rather than the new one that was just opened.
Although technically this bug existed before KIP-429, it was only uncovered now that we remove tasks and clear their state/partitions/etc one at a time. We don't technically need to cherrypick the fix back earlier as before we just blindly clear all data structures entirely during an eager rebalance.
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
With KIP-444 the metrics definitions are refactored. Thus, Streams' SimpleBenchmark needs to be updated to correctly access the refactored metrics.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bill Bejeck <bbejeck@gmail.com>
Inside onLeavePrepare we would look into the assignment and try to revoke the owned tasks and notify users via RebalanceListener#onPartitionsRevoked, and then clear the assignment.
However, the subscription's assignment is already cleared in this.subscriptions.unsubscribe(); which means user's rebalance listener would never be triggered. In other words, from consumer client's pov nothing is owned after unsubscribe, but from the user caller's pov the partitions are not revoked yet. For callers like Kafka Streams which rely on the rebalance listener to maintain their internal state, this leads to inconsistent state management and failure cases.
Before KIP-429 this issue is hidden away since every time the consumer re-joins the group later, it would still revoke everything anyways regardless of the passed-in parameters of the rebalance listener; with KIP-429 this is easier to reproduce now.
Our fixes are following:
• Inside unsubscribe, first do onLeavePrepare / maybeLeaveGroup and then subscription.unsubscribe. This we we are guaranteed that the streams' tasks are all closed as revoked by then.
• [Optimization] If the generation is reset due to fatal error from join / hb response etc, then we know that all partitions are lost, and we should not trigger onPartitionRevoked, but instead just onPartitionsLost inside onLeavePrepare. This is because we don't want to commit for lost tracks during rebalance which is doomed to fail as we don't have any generation info.
Reviewers: Matthias J. Sax <matthias@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
The truncateHead method was removed from ProducerStateManager by github.com/apache/kafka/commit/c49775b. This meant that snapshots were no longer removed when the log start offset increased, even though the intent of that change was to remove snapshots but preserve the in-memory mapping. This patch adds the required functionality back.
Reviewers: Jason Gustafson <jason@confluent.io>
Introduces TaskMetrics class
Introduces dropped-records
Replaces skipped-records with dropped-records with latest built-in
metrics version
Does not add standby-process-ratio and active-process-ratio
Does not refactor parent sensors for processor node metrics
Reviewers: Guozhang Wang <wangguoz@gmail.com>, John Roesler <john@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
This PR fixes the inconsistency involved in the `removeMembersFromGroup` admin API calls:
1. Fail the `all()` request when there is sub level error (either partition or member)
2. Change getMembers() to members()
3. Hide the actual Errors from user
4. Do not expose generated MemberIdentity type
5. Use more consistent naming for Options and Result types
Reviewers: Guozhang Wang <wangguoz@gmail.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
Prior to this patch, partition creation would not be allowed for any topic while a reassignment is in progress. The PR makes this a topic level check. As long as a particular topic is not being reassigned, we allow partitions to be increased.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
In the case of unclean close we still need to make sure all the stores are flushed before closing any.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Boyang Chen <boyang@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Add a paragraph explaining the producer caller's expected behavior change on record validation failure scenarios that are improved by KIP-467.
Reviewers: Tu V. Tran <tu@confluent.io>, Jason Gustafson <jason@confluent.io>
The purpose of this patch is to make the reassignment algorithm simpler and more resilient to unexpected errors. Specifically, it has the following improvements:
1. Remove `ReassignedPartitionContext`. We no longer need to track the previous reassignment through the context and we now use the assignment state as the single source of truth for the target replicas in a reassignment.
2. Remove the intermediate assignment state when overriding a previous reassignment. Instead, an overriding reassignment directly updates the assignment state and shuts down any unneeded replicas. Reassignments are _always_ persisted in Zookeeper before being updated in the controller context.
3. To fix race conditions with concurrent submissions, reassignment completion for a partition always checks for a zk partition reassignment to be removed. This means the controller no longer needs to track the source of the reassignment.
4. Api reassignments explicitly remove reassignment state from zk prior to beginning the new reassignment. This fixes an inconsistency in precedence. Upon controller failover, zookeeper reassignments always take precedence over any active reassignment. So if we do not have the logic to remove the zk reassignment when an api reassignment is triggered, then we can revert to the older zk reassignment.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jun Rao <junrao@gmail.com>
Re-implement NewPartitionReassignment#of. It now takes a list rather than a variable-length list of arguments.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Vikas Singh <vikas@confluent.io>