This patch removes the NewPartitionReassignment#of() method in favor of a simple constructor. Said method was confusing due to breaking two conventions - always returning a non-empty Optional and thus not being used as a static factory method.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. McCabe <cmccabe@apache.org>
1. Avoid a buffer allocation and a buffer copy per file read.
2. Ensure we flush `netWriteBuffer` successfully before reading from
disk to avoid wasted disk reads.
3. 32k reads instead of 8k reads to reduce the number of disk reads
(improves efficiency for magnetic drives and reduces the number of
system calls).
4. Update SslTransportLayer.write(ByteBuffer) to loop until the socket
buffer is full or the src buffer has no remaining bytes.
5. Renamed `MappedByteBuffers` to `ByteBufferUnmapper` since it's also
applicable for direct byte buffers.
6. Skip empty `RecordsSend`
7. Some minor clean-ups for readability.
I ran a simple consumer perf benchmark on a 6 partition topic (large
enough not to fit into page cache) starting from the beginning of the
log with TLS enabled on my 6 core MacBook Pro as a sanity check.
This laptop has fast SSDs so it benefits less from the larger reads
than the case where magnetic disks are used. Consumer throughput
was ~260 MB/s before the changes and ~300 MB/s after
(~15% improvement).
Credit to @junrao for pointing out that this code could be more efficient.
Reviewers: Jun Rao <junrao@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
If there are pending bytes in the transport layer, we may
complete a send even if no bytes were recorded as written.
We assume bytes are written when they are in the netWriteBuffer,
but we only consider the send as completed when it's in
the socket channel buffer.
This fixes a regression introduced via 0971f66ff546. The impact is
that we would sometimes throw the following exception in
`MultiRecordsSend.writeTo`:
```java
if (completed())
throw new KafkaException("This operation cannot be invoked on a complete request.");
```
Added unit test verifying the bug fix. While in the area, I simplified one of the
`SslSelectorTest` methods.
Reviewers: Jun Rao <junrao@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
Currently when we identify version probing we return early from onAssignment and never get to updating the TaskManager and general state with the new assignment. Since we do actually give out "real" assignments even during version probing, a StreamThread should take real ownership of its tasks/partitions including cleaning them up in onPartitionsRevoked which gets invoked when we call onLeavePrepare as part of triggering the follow-up rebalance.
Every member will always get an assignment encoded with the lowest common version, so there should be no problem decoding a VP assignment. We should just allow onAssignment to proceed as usual so that the TaskManager is in a consistent state, and knows what all its tasks/partitions are when the first rebalance completes and the next one is triggered.
Reviewers: Boyang Chen <boyang@confluent.io>, Matthias J. Sax <mjsax@apache.org>, Guozhang Wang <wangguoz@gmail.com>
Rather than maintain hand coded protocol serialization code, Streams could use the same code-generation framework as Clients/Core.
There isn't a perfect match, since the code generation framework includes an assumption that you're generating "protocol messages", rather than just arbitrary blobs, but I think it's close enough to justify using it, and improving it over time.
Using the code generation allows us to drop a lot of detail-oriented, brittle, and hard-to-maintain serialization logic in favor of a schema spec.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Boyang Chen <boyang@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
`log` in KafkaConsumer does not get initialized if an invalid value for group.intance.id is given during consumer construction. In this case we should skip the catch block's close procedure since no internal objects have been initialized yet.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Inside onLeavePrepare we would look into the assignment and try to revoke the owned tasks and notify users via RebalanceListener#onPartitionsRevoked, and then clear the assignment.
However, the subscription's assignment is already cleared in this.subscriptions.unsubscribe(); which means user's rebalance listener would never be triggered. In other words, from consumer client's pov nothing is owned after unsubscribe, but from the user caller's pov the partitions are not revoked yet. For callers like Kafka Streams which rely on the rebalance listener to maintain their internal state, this leads to inconsistent state management and failure cases.
Before KIP-429 this issue is hidden away since every time the consumer re-joins the group later, it would still revoke everything anyways regardless of the passed-in parameters of the rebalance listener; with KIP-429 this is easier to reproduce now.
Our fixes are following:
• Inside unsubscribe, first do onLeavePrepare / maybeLeaveGroup and then subscription.unsubscribe. This we we are guaranteed that the streams' tasks are all closed as revoked by then.
• [Optimization] If the generation is reset due to fatal error from join / hb response etc, then we know that all partitions are lost, and we should not trigger onPartitionRevoked, but instead just onPartitionsLost inside onLeavePrepare. This is because we don't want to commit for lost tracks during rebalance which is doomed to fail as we don't have any generation info.
Reviewers: Matthias J. Sax <matthias@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
This PR fixes the inconsistency involved in the `removeMembersFromGroup` admin API calls:
1. Fail the `all()` request when there is sub level error (either partition or member)
2. Change getMembers() to members()
3. Hide the actual Errors from user
4. Do not expose generated MemberIdentity type
5. Use more consistent naming for Options and Result types
Reviewers: Guozhang Wang <wangguoz@gmail.com>, David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
Re-implement NewPartitionReassignment#of. It now takes a list rather than a variable-length list of arguments.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Vikas Singh <vikas@confluent.io>
This is not guaranteed to actually fix queryOnRebalance, since the
failure could never be reproduced locally. I did not bump timeouts
because it looks like that has been done in the past for this test
without success. Instead this change makes the following improvements:
It waits for the application to be in a RUNNING state before
proceeding with the test.
It waits for the remaining instance to return to RUNNING state
within a timeout after rebalance. I observed once that we were able to
do the KV queries but the instance was still in REBALANCING, so this
should reduce some opportunity for flakiness.
The meat of this change: we now iterate over all keys in one shot
(vs. one at a time with a timeout) and collect various failures, all of
which are reported at the end. This should help us to narrow down the
cause of flakiness if it shows up again.
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
The RPC code generator should support using the ByteBuffer class in addition to byte arrays. By using the ByteBuffer class, we can avoid performing a copy in many situations. Also modify TestByteBufferDataTest to test the new feature.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Guozhang Wang <wangguoz@gmail.com>
Currently we only record completed sends and receives in the selector metrics. If there is a disconnect in the middle of the respective operation, then it is not counted. The metrics will be more accurate if we take into account partial sends and receives.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com
This patch adds a `toString()` implementation to `PartitionReassignment`. It also makes the `ListPartitionReassignmentsResult` constructor use default access, which is the standard for the admin client *Result classes.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Reviewers: Jason Gustafson <jason@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>, Vikas Singh <vikas@confluent.io>, Colin P. McCabe <cmccabe@apache.org>
Allow routing of `AdminClient#describeTopics` to any broker in the cluster than just the controller, so that we don't create a hotspot for this API call. `AdminClient#describeTopics` uses the broker's metadata cache which is asynchronously maintained, so routing to brokers other than the controller is not expected to have a significant difference in terms of metadata consistency; all metadata requests are eventually consistent.
This patch also fixes a few flaky test failures.
Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
AbstractRequestResponse should be an interface, since it has no concrete elements or implementation. Move AbstractRequestResponse#serialize to RequestUtils#serialize and make it package-private, since it doesn't need to be public.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The KafkaConsumer Fetcher can sometimes get into an invalid state where it believes that there are ongoing fetch requests, but in fact there are none. This may be caused by the heartbeat thread concurrently handling a disconnection event just after the fetcher thread submits a request which would cause the Fetcher to enter an invalid state where it believes it has ongoing requests to the disconnected node but in fact it does not. This is due to a thread safety issue in the Fetcher where it was possible for the ordering of the modifications to the nodesWithPendingFetchRequests to be incorrect - the Fetcher was adding it after the listener had already been invoked, which would mean that pending node never gets removed again.
This PR addresses that thread safety issue by ensuring that the pending node is added to the nodesWithPendingFetchRequests before the listener is added to the future, ensuring the finally block is called after the node is added.
Reviewers: Tom Lee, Jason Gustafson <jason@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
This PR contains the fix of race condition bug between "consumer thread" and "consumer coordinator heartbeat thread". It reproduces in many production environments.
Condition for reproducing:
1. Consumer thread initiates rejoin to the group because of commit timeout. Call of AbstractCoordinator#joinGroupIfNeeded which leads to sendJoinGroupRequest.
2. JoinGroupResponseHandler writes to the AbstractCoordinator.this.generation new generation data and leaves the synchronized section.
3. Heartbeat thread executes mabeLeaveGroup and clears generation data via resetGenerationOnLeaveGroup.
4. Consumer thread executes onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment); with the cleared generation data. This leads to the corresponding exception.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
By default, if the user does not configure a `client.id`, then we use a very generic identifier, such as `consumer-15`. It is more useful to include identifying information when available such as `group.id` for the consumer and `transactional.id` for the producer.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
One of the sticky assignor tests involves a random change in subscriptions that the current assignor algorithm struggles to react to and in cooperative mode ends up requiring more than one followup rebalance.
Apparently, in rare cases it can also require more than 2. Bumping the "allowed subsequent rebalances" to 4 (increase of 2) to allow some breathing room and reduce flakiness (technically any number is "correct", but if it turns out to ever require more than 4 we should revisit and improve the algorithm because that would be excessive (see KAFKA-8767)
Reviewers: Guozhang Wang <wangguoz@gmail.com>
All the changes are in ReplicaManager.appendToLocalLog and ReplicaManager.appendRecords. Also, replaced LogAppendInfo.unknownLogAppendInfoWithLogStartOffset with LogAppendInfo.unknownLogAppendInfoWithAdditionalInfo to include those 2 new fields.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
Since generation is private in AbstractCoordinator, I need to modify the generation() to let it return the object directly.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bill@confluent.io>
The deleteRecords API in the AdminClient groups records to be sent by the partition leaders. If one of these requests fails, we currently fail all futures, including those tied to requests sent to other leaders. It would be better to fail only those partitions included in the failed request.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Key improvements with this PR:
* tasks will remain available for IQ during a rebalance (but not during restore)
* continue restoring and processing standby tasks during a rebalance
* continue processing active tasks during rebalance until the RecordQueue is empty*
* only revoked tasks must suspended/closed
* StreamsPartitionAssignor tries to return tasks to their previous consumers within a client
* but do not try to commit, for now (pending KAFKA-7312)
Reviewers: John Roesler <john@confluent.io>, Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This patch adds flexible version support for the following inter-broker APIs: ControlledShutdown, LeaderAndIsr, UpdateMetadata, and StopReplica. Version checks have been removed from `getErrorResponse` methods since they were redundant given the checks in `AbstractRequest` and the respective`*Data` types.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Minor tweak to gracefully handle a possible IllegalStateException while checking a metric value.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
https://issues.apache.org/jira/browse/KAFKA-3705
Allows for a KTable to map its value to a given foreign key and join on another KTable keyed on that foreign key. Applies the joiner, then returns the tuples keyed on the original key. This supports updates from both sides of the join.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Boyang Chen <boyang@confluent.io>, Christopher Pettitt <cpettitt@confluent.io>, Bill Bejeck <bbejeck@gmail.com>, Jan Filipiak <Jan.Filipiak@trivago.com>, pgwhalen, Alexei Daniline
I tried to add a test for this, but it's actually pretty hard to verify what we want to
verify. I could add a test that checks the correlation field after the connection
has been established, but it would not catch this kind of bug where the issue is not
the value we store, but the value we create the request header with.
I have another PR that avoids intermediate structs during serialization/deserialization,
which has a test that fails without this change. So we'll get coverage that way.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Implemented KIP-481 by adding support for deserializing Connect DECIMAL values encoded in JSON as numbers, in addition to raw byte array (base64) format used previously.
Author: Almog Gavra <almog@confluent.io>
Reviewers: Chris Egerton <chrise@confluent.io>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>
As noted in the KIP-467, the updated ProduceResponse is
```
Produce Response (Version: 8) => [responses] throttle_time_ms
responses => topic [partition_responses]
topic => STRING
partition_responses => partition error_code base_offset log_append_time log_start_offset
partition => INT32
error_code => INT16
base_offset => INT64
log_append_time => INT64
log_start_offset => INT64
error_records => [INT32] // new field, encodes the relative offset of the records that caused error
error_message => STRING // new field, encodes the error message that client can use to log itself
throttle_time_ms => INT32
with a new error code:
```
INVALID_RECORD(86, "Some record has failed the validation on broker and hence be rejected.", InvalidRecordException::new);
Reviewers: Jason Gustafson <jason@confluent.io>, Magnus Edenhill <magnus@edenhill.se>, Guozhang Wang <wangguoz@gmail.com>
In addition to the existing metrics added in KAFKA-8609, add the total (cumulative) time spent during rebalances.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Replaced UpdateMetadata{Request, Response}, LeaderAndIsr{Request, Response}
and StopReplica{Request, Response} with the automated protocol classes.
Updated the JSON schema for the 3 request types to be more consistent and
less strict (if needed to avoid duplication).
The general approach is to avoid generating new collections in the request
classes. Normalization happens in the constructor to make this possible. Builders
still have to group by topic to maintain the external ungrouped view.
Introduced new tests for LeaderAndIsrRequest and UpdateMetadataRequest to
verify that the new logic is correct.
A few other clean-ups/fixes in code that was touched due to these changes:
* KAFKA-8956: Refactor DelayedCreatePartitions#updateWaiting to avoid modifying
collection in foreach.
* Avoid unnecessary allocation for state change trace logging if trace logging is not enabled
* Use `toBuffer` instead of `toList`, `toIndexedSeq` or `toSeq` as it generally performs
better and it matches the performance characteristics of `java.util.ArrayList`. This is
particularly important when passing such instances to Java code.
* Minor refactoring for clarity and readability.
* Removed usage of deprecated `/:`, unused imports and unnecessary `var`s.
* Include exception in `AdminClientIntegrationTest` failure message.
* Move StopReplicaRequest verification in `AuthorizerIntegrationTest` to the end
to match the comment.
Reviewers: Colin Patrick McCabe <cmccabe@apache.org>
Profiling while benchmarking shows unnecessary calls to
`responseDataToLogString` in FetchSessionHandler when logging was set to
INFO level. This leads to 1.47% of the JVM CPU time going to this method.
Fix it by checking if debug logging is enabled.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Previous KafkaStreamsTest takes 2min20s on my local laptop, because lots of its integration test which is producing / consuming records, and checking state directory file system takes lots of time. On the other hand, these tests should be well simplified with mocks.
This test reduces the test from a clumsy integration test class into a unit tests with mocks of its internal modules. And some other test functions should not be in KafkaStreamsTest actually and have been moved to other modular test classes. Now it takes 2s.
Also it helps removing the potential flakiness of the following (some of them are claimed resolved only because we have not seen them recently, but after looking at the test code I can verify they are still flaky):
* KAFKA-5818 (the original JIRA ticket indeed exposed a real issue that has been fixed, but the test itself remains flaky)
* KAFKA-6215
* KAFKA-7921
* KAFKA-7990
* KAFKA-8319
* KAFKA-8427
Reviewers: Bill Bejeck <bill@confluent.io>, John Roesler <john@confluent.io>, Bruno Cadonna <bruno@confluent.io>