Prevent exception thrown by metric reporters to impact request processing and other reporters.
Co-authored-by: Mickael Maison <mickael.maison@gmail.com>
Co-authored-by: Edoardo Comar <ecomar@uk.ibm.com>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Added unit tests for ReplicaAlterLogDirsThread. Mostly focused on unit tests for truncating logic.
Fixed ReplicaAlterLogDirsThread.buildLeaderEpochRequest() to use future replica's latest epoch (not the latest epoch of replica it is fetching from). This follows the logic that offset for leader epoch request should be based on leader epoch of the follower (in this case it's the future local replica).
Also fixed PartitionFetchState constructor that takes offset and delay. The code ignored the delay parameter and used 0 for the delay. This constructor is used only by another constructor which passes delay = 0, which luckily works.
Author: Anna Povzner <anna@confluent.io>
Reviewers: Dong Lin <lindong28@gmail.com>
Closes#4918 from apovzner/kafka-6795
Currently if the client sends a produce request or a fetch request to a broker which isn't a replica,
we return UNKNOWN_TOPIC_OR_PARTITION. This is a bit surprising to see when the topic actually
exists. It would be better to return NOT_LEADER to avoid confusion. Clients typically handle both errors by refreshing metadata and retrying, so changing this should not cause any change in behavior on the client. This case can be hit following a partition reassignment after the leader is moved and the local replica is deleted.
To validate the current behavior and the fix, I've added integration tests for the fetch and produce APIs.
Start processing client connections only after completing KafkaServer initialization to ensure that credentials are loaded from ZK into cache before authentications are processed. Acceptors are started earlier so that bound port is known for registering in ZK.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Jun Rao <junrao@gmail.com>, Ismael Juma <ismael@juma.me.uk>
A partially deleted topic can end up with some partitions having no leadership info.
For the partially deleted topic, a new controller should be able to finish the topic deletion
by transitioning the rogue partition's replicas to OfflineReplica state.
This patch adds logic to transition replicas to OfflineReplica state whose partitions have
no leadership info.
Added a new test method to cover the partially deleted topic case.
Reviewers: Jun Rao <junrao@gmail.com>
Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
We had a regression in #4788 which caused the offset commit/fetch/describe APIs to fail if the groupId was empty. This should be allowed for backwards compatibility. Additionally, I have modified DeleteGroups to allow removal of the empty group, which was missed in the initial implementation. I've added a test case to ensure that we do not miss this again in the future.
Reviewers: Ismael Juma <ismael@juma.me.uk>
AsyncProducerTest gets an error about an incorrect mock when the logging
level is turned up. Instead of usIng a mock, just create a real
SyncProducerConfig object, since the object is simple to create.
Reviewers: Ismael Juma <ismael@juma.me.uk>
Move creation of quota callback instance out of KafkaConfig constructor to QuotaFactory.instantiate to avoid creating a callback instance for every KafkaConfig since we create temporary KafkaConfigs during dynamic config updates.
Reviewers: Jun Rao <junrao@gmail.com>
Add a validation check to make sure max.connections.per.ip.overrides is configured when max.connections.per.ip is set zero. Also clean up the config description.
Implementation of KIP-86. Client, server and login callback handlers have been made configurable for both brokers and clients.
Reviewers: Jun Rao <junrao@gmail.com>, Ron Dagostino <rndgstn@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
In the group coordinator, we currently check whether the partition is owned before checking whether it is loading. Since loading is a prerequisite for partition ownership, it means that it is not actually possible to see the COORDINATOR_LOAD_IN_PROGRESS error. The impact is mostly harmless: while loading the group, the client may send unnecessary FindCoordinator requests to rediscover the coordinator. I've fixed the bug and restructured the code to enable testing.
In the process of fixing this bug, the following improvements have been made:
1. We now verify valid groupId in all request handlers.
2. Currently if the coordinator is loading when a SyncGroup is received, we'll return NOT_COORDINATOR. I've changed this to return REBALANCE_IN_PROGRESS since the rebalance state will have been lost on coordinator failover. This effectively forces the consumer to rejoin the group, which seems preferable over unnecessarily rediscovering the coordinator.
3. I added a check for the COORDINATOR_LOAD_IN_PROGRESS handler in SyncGroup. Although we do not currently return this error, it seems reasonable that we might want to some day, so it seems better to get the check in now.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Prior to this, there have been instances where invalid data was allowed to be persisted in
ZooKeeper, which causes ClassCastExceptions when a broker is restarted and reads this
type-unsafe data.
Adds basic structural and type validation for the reassignment JSON via
introduction of Scala case classes that map to the expected JSON
structure. Also use the Scala case classes to deserialize the JSON
to avoid duplication.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Viktor Somogyi <viktor.somogyi@cloudera.com>, Ismael Juma <ismael@juma.me.uk>
Currently the `initTransactions()` API blocks indefinitely if the broker cannot be reached. This patch changes the behavior to raise a `TimeoutException` after waiting for `max.block.ms`.
Reviewers: Apurva Mehta <apurva@confluent.io>, Jason Gustafson <jason@confluent.io>
This resolves the issue found when running the brokers on Windows which prevents the coordinator from sending WriteTxnMarkers requests to complete a transaction.
`Thread.sleep` in `LogManager.deleteLogs` potentially blocks a scheduler thread for up to `log.segment.delete.delay.ms` with a default value of a minute. To avoid this, `deleteLogs` now deletes the logs for which `currentDefaultConfig.fileDeleteDelayMs` has elapsed after the delete was scheduled. Logs for which this interval has not yet elapsed are considered for deletion in the next iteration of `deleteLogs`, which is scheduled sooner if required.
Reviewers: Jun Rao <junrao@gmail.com>, Dong Lin <lindong28@gmail.com>, Ted Yu <yuzhihong@gmail.com>
Require a minimum value of 1 for `segment.ms` to avoid division by zero when computing random jitter.
Reviewers: Manikumar Reddy O <manikumar.reddy@gmail.com>, Jason Gustafson <jason@confluent.io>
This patch adds logic in handling the PartitionModifications event, so that if the partition count is increased when a topic deletion is still in progress, the controller will restore the data of the path /brokers/topics/"topic" to remove the added partitions.
Testing done:
Added a new test method to cover the bug
Author: Lucas Wang <luwang@linkedin.com>
Reviewers: Jiangjie (Becket) Qin <becket.qin@gmail.com>
Closes#4666 from gitlw/prevent_increasing_partition_count_during_topic_deletion
DynamicBrokerReconfigurationTest currently assumes that passwords encoded with one secret will fail with an exception if decoded with another secret and configures an old.secret in setUp. This could potentially cause test failures if a password was incorrectly decoded with the wrong secret, since the test writes passwords encoded with the new secret directly to ZooKeeper. Since old.secret is only used in one test for verifying secret rotation, this config can be moved to that test to avoid transient failures.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We were unintentionally mutating the cached queue of batches prior to appending to the log. This could have several bad consequences if the append ultimately failed or was truncated. In the reporter's case, it caused the snapshot to be invalid after a segment roll. The snapshot contained producer state at offsets higher than the snapshot offset. If we ever had to load from that snapshot, the state was left inconsistent, which led to an error that ultimately crashed the replica fetcher.
The fix required some refactoring to avoid sharing the same underlying queue inside ProducerAppendInfo. I have added test cases which reproduce the invalid snapshot state. I have also made an effort to clean up logging since it was not easy to track this problem down.
One final note: I have removed the duplicate check inside ProducerStateManager since it was both redundant and incorrect. The redundancy was in the checking of the cached batches: we already check these in Log.analyzeAndValidateProducerState. The incorrectness was the handling of sequence number overflow: we were only handling one very specific case of overflow, but others would have resulted in an invalid assertion. Instead, we now throw OutOfOrderSequenceException.
Reviewers: Apurva Mehta <apurva@confluent.io>, Jun Rao <junrao@gmail.com>
1. Use JmxMixin for SimpleBenchmark (will remove the self reporting in #4744), only when loading phase is false (i.e. we are in fact starting the streams app).
2. Reported the full jmx reported metrics in log files, and in the returned data only return the max values: this is because we want to skip the warming up and cooling down periods that will have lower rate numbers, while max represents the actual rate at full speed.
3. Incorporates two other improves to JMXTool: #1241 and #2950
Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Rohan Desai <desai.p.rohan@gmail.com>
Don't return config values from `describeConfigs` if the config type cannot be determined. Obtain config types correctly for listener configs for `describeConfigs` and password encryption.
Reviewers: Jason Gustafson <jason@confluent.io>
TestUtils#produceMessages should always close the KafkaProducer, even
when there is an exception. Otherwise, the test will leak threads when
there is an error.
TestUtils#createNewProducer should create a producer with a
requestTimeoutMs of 30 seconds by default, not around 10 seconds.
This should avoid tests that flake when the load on Jenkins climbs.
Fix two cases where a very short timeout of 2 seconds was getting set.
Reviewers: Ismael Juma <ismael@juma.me.uk>
- Fix kafkaConfig initialization if there are no dynamic configs defined in ZK.
- Update DynamicListenerConfig.validateReconfiguration() to check new Listeners must be subset of listener map
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>