This patch fixes a bug in the sending of transactional requests. We need to call `KafkaClient.send` with an updated current time. Failing to do so can result in an `IllegalStateExcepton` which leaves the producer effectively dead since the in-flight correlation id has been set, but no request has been sent. To avoid the same problem in the future, we update the in flight correlationId only after sending the request.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Apurva Mehta <apurva@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
When shutting down the ReplicaFetcher thread, we may fail to unregister sensors in selector.close(). When that happened, we will fail to start up the ReplicaFetcherThread with the same fetch id again because of the IllegalArgumentException in sensor registration. This issue will cause constant URPs in the cluster because the ReplicaFetchterThread is gone.
This patch addresses this issue by introducing a try-finally block in selector.close() so that we will always unregister the sensors in shutting down ReplicaFetcherThreads.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Jason Gustafson <jason@confluent.io>
`ConfigDef.embeddedValidator` should return an Anonymous Object instead of lambda so that we can have a useful `toString()` for methods such as `toRst`.
Reviewers: Jason Gustafson <jason@confluent.io>
In NetworkClient.leastLoadedNode, we invoke `isReady` to check if an established connection exists for the given node. `isReady` checks whether metadata needs to be updated also which wants to make metadata request first priority. However, if the to-be-sent request is metadata request, then we do not have to check this otherwise the loop in `leastLoadedNode` will do a complete iteration until the final node is selected.
Reviewers: Jason Gustafson <jason@confluent.io>
This is the first diff for the implementation of JoinGroup logic for static membership. The goal of this diff contains:
* Add group.instance.id to be unique identifier for consumer instances, provided by end user;
Modify group coordinator to accept JoinGroupRequest with/without static membership, refactor the logic for readability and code reusability.
* Add client side support for incorporating static membership changes, including new config for group.instance.id, apply stream thread client id by default, and new join group exception handling.
* Increase max session timeout to 30 min for more user flexibility if they are inclined to tolerate partial unavailability than burdening rebalance.
* Unit tests for each module changes, especially on the group coordinator logic. Crossing the possibilities like:
6.1 Dynamic/Static member
6.2 Known/Unknown member id
6.3 Group stable/unstable
6.4 Leader/Follower
The rest of the 345 change will be broken down to 4 separate diffs:
* Avoid kicking out members through rebalance.timeout, only do the kick out through session timeout.
* Changes around LeaveGroup logic, including version bumping, broker logic, client logic, etc.
* Admin client changes to add ability to batch remove static members
* Deprecate group.initial.rebalance.delay
Reviewers: Liquan Pei <liquanpei@gmail.com>, Stanislav Kozlovski <familyguyuser192@windowslive.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
In the KafkaProducer#close method we have a debug log statement Kafka producer has been closed then a few lines later a KafkaException can occur.
This could be confusing to users, so this PR simply moves the log statement to after the possible exception to avoid confusing information in the logs.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Streams previously flushed stores in the order of their registration, which is arbitrary. Because stores may forward values upon flush (as in cached state stores), we must flush stores in topological order.
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
Several issues have come to light since the 2.2.0 release:
upon restore, suppress incorrectly set the record metadata using the changelog record, instead of preserving the original metadata
restoring a tombstone incorrectly didn't update the buffer size and min-timestamp
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Reviewers: Colin P. McCabe <cmccabe@apache.org>, Viktor Somogyi <viktorsomogyi@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk>
Currently close() only awaits completion of pending produce requests. If there is a transaction ongoing, it may be dropped. For example, if one thread is calling commitTransaction() and another calls close(), then the commit may never happen even if the caller is willing to wait for it (by using a long timeout). What's more, the thread blocking in commitTransaction() will be stuck since the result will not be completed once the producer has shutdown.
This patch ensures that 1) completing transactions are awaited, 2) ongoing transactions are aborted, and 3) pending callbacks are completed before close() returns.
Reviewers: Jason Gustafson <jason@confluent.io>
We have not had great experience with listeners. They make the code harder to understand because they result in indirectly maintained circular dependencies. Often this leads to tricky deadlocks when we try to introduce locking. We were able to remove the Metadata listener in KAFKA-7831. Here we do the same for the listener in SubscriptionState.
Reviewers: Viktor Somogyi-Vass <viktorsomogyi@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
This patch updates the InitProducerId request API to use the generated sources. It also fixes a small bug in the DescribeAclsRequest class where we were using the wrong api key.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Colin McCabe <cmccabe@apache.org>
Protocol compatibility can be facilitated if a Struct, that has been defined as an extension of a previous Struct by adding fields at the end of the older version, can read a message of an older version by ignoring the absence of the missing new fields. Reading the missing fields should be allowed by the definition of these fields (they have to be nullable) when supported by the schema.
Reviewers: David Arthur <mumrah@gmail.com>, Randall Hauch <rhauch@gmail.com>, Jason Gustafson <jason@confluent.io>
Ensure that modification time is checked against the file used to create the SSLContext that is in-use so that SSLContext is updated whenever file is modified and a config update request is received.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
ToString functions must not get a NullPointException. read() functions
must properly translate a negative array length to a null field.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
The goal of this task is to implement an integration test for the kafka stream metrics.
We have to check 2 things:
1. After streams application are started, all metrics from different levels (thread, task, processor, store, cache) are correctly created and displaying recorded values.
2. When streams application are shutdown, all metrics are correctly de-registered and removed.
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
TopicDescription and ConsumerGroupDescription in org.apache.kafka.clients.admin. are part of the public API, so we should retain the existing public constructor. Changed the new constructor with authorized operations to be package-private to avoid maintaining more public constructors since we only expect admin client to use this.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Adds a new listener config `max.connections` to limit the number of active connections on each listener. The config may be prefixed with listener prefix. This limit may be dynamically reconfigured without restarting the broker.
This is one of the PRs for KIP-402 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-402%3A+Improve+fairness+in+SocketServer+processors). Note that this is currently built on top of PR #6022
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Gwen Shapira <cshapi@gmail.com>
Closes#6034 from rajinisivaram/KAFKA-7730-max-connections
Users have reported (KAFKA-7565) that when consumer poll wake up is used,
it is possible to receive fetch responses that don't match the copied topic
partitions collection for the session when the fetch request was created.
This commit improves the error handling here by throwing an
IllegalStateException instead of a NullPointerException. And by
generating a message for the exception that includes a bit of more
information.
Reviewers: Jason Gustafson <jason@confluent.io>
There is a small timing window where ```time.sleep(retryBackoff)``` will get executed before adminClient adds retry request to the queue. Added a check to wait until the retry call added to the queue in AdminClient.
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#6418 from omkreddy/KAFKA-7939
Metadata may be updated from the background thread, so we need to protect access to SubscriptionState. This patch restructures the metadata handling so that we only check pattern subscriptions in the foreground. Additionally, it improves the following:
1. SubscriptionState is now the source of truth for the topics that will be fetched. We had a lot of messy logic previously to try and keep the the topic set in Metadata consistent with the subscription, so this simplifies the logic.
2. The metadata needs for the producer and consumer are quite different, so it made sense to separate the custom logic into separate extensions of Metadata. For example, only the producer requires topic expiration.
3. We've always had an edge case in which a metadata change with an inflight request may cause us to effectively miss an expected update. This patch implements a separate version inside Metadata which is bumped when the needed topics changes.
4. This patch removes the MetadataListener, which was the cause of https://issues.apache.org/jira/browse/KAFKA-7764.
Reviewers: David Arthur <mumrah@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
In KAFKA-5503, we have added a check for `running` flag in the loop inside maybeWaitForProducerId. This is to handle concurrent call to Sender close(), while we attempt to get the ProducerId. This avoids blocking indefinitely when the producer is shutting down.
This created a corner case, where Sender thread gets blocked, if we had concurrent producerId reset and call to Sender thread close. The fix here is to check the `forceClose` flag in the loop inside maybeWaitForProducerId instead of the `running` flag.
Reviewers: Jason Gustafson <jason@confluent.io>
SelectorTest.testCloseConnectionInClosingState sends and receives messages to get the channel into a state with staged receives and then waits for idle timeout to close the channel. When run with SSL, the channel may have buffered bytes that prevent the channel from being closed. Updated the test to wait until there are no buffered bytes as well.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Test uses 100ms as connections.max.reauth.ms and checks that a second reauthentication doesn't occur within the hard-coded 1 second minimum interval. But since the interval is small, we cannot guarantee that the time between the two checks is not higher than 1 second. Change the test to use MockTime so that we can control the time.
Reviewers: Ron Dagostino <rndgstn@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting
* Resolve host only when currentAddress() is called
Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.
* Add Javadoc to ClusterConnectionStates.connecting()