In PR #8962 we introduced a sentinel UNKNOWN_OFFSET to mark unknown offsets in checkpoint files. The sentinel was set to -2 which is the same value used for the sentinel LATEST_OFFSET that is used in subscriptions to signal that state stores have been used by an active task. Unfortunately, we missed to skip UNKNOWN_OFFSET when we compute the sum of the changelog offsets.
If a task had only one state store and it did not restore anything before the next rebalance, the stream thread wrote -2 (i.e., UNKNOWN_OFFSET) into the subscription as sum of the changelog offsets. During assignment, the leader interpreted the -2 as if the stream run the task as active although it might have run it as standby. This misinterpretation of the sentinel value resulted in unexpected task assignments.
Ports: KAFKA-10287 / #9066
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, John Roesler <vvcephei@apache.org>, Matthias J. Sax <mjsax@apache.org>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, David Jacot <djacot@confluent.io>, Lee Dongjin <dongjin@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>
The start() function for global stream thread only checks whether the thread is not running, as it needs to block until it finishes the initialization. This PR fixes this behavior by adding a check whether the thread is already in error state as well.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, John Roesler <vvcephei@apache.org>
Modified KafkaProducer.sendOffsetsToTransaction() to be affected with max.block.ms, and added timeout test for blocking methods
Reviewers: Boyang Chen <boyang@confluent.io>, Guozhang Wang <wangguoz@gmail.com>, Xi Hu <huxi_2b@hotmail.com>
Add a broker to controller channel manager for use cases such as redirection and AlterIsr.
Reviewers: David Arthur <mumrah@gmail.com>, Colin P. McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
Co-authored-by: Viktor Somogyi <viktorsomogyi@gmail.com>
Co-authored-by: Boyang Chen <boyang@confluent.io>
Update jersey license from CDDL to EPLv2
Author: Rens Groothuijsen <l.groothuijsen@alumni.maastrichtuniversity.nl>
Reviewer: Randall Hauch <rhauch@gmail.com>
A system test failed with the following error: global name 'self' is not defined
The reason was that `self` was accessed to log a message in a static method. This commit makes the method an instance method.
Reviewer: Matthias J. Sax <matthias@confluent.io>
I left out updates that could be risky. Preliminary testing indicates
we can build (including spotBugs) and run tests with Java 15 with
these changes. I will do more thorough testing once Java 15 reaches
release candidate stage in a few weeks.
Minor updates with mostly bug fixes:
- Scala: 2.12.11 -> 2.12.12 (compiler and collection performance improvements)
- Bouncy castle: 1.64 -> 1.66 (several bug fixes)
- HttpClient: 4.5.11 -> 4.5.12 (small number of bug fixes)
- Mockito: 3.3.3 -> 3.4.4 (several bug fixes and Java 15 support)
- Netty: 4.5.10 -> 4.5.11 (several bug fixes)
- Snappy: 1.1.7.3 -> 1.1.7.6 (small number of bug fixes)
- Zstd: 1.4.5-2 -> 1.4.5-6 (small number of bug fixes)
Gradle plugin and library upgrades:
- Gradle version plugins: 0.28.0 -> 0.29.0 (small number of bug fixes)
- Git: 4.0.1 -> 4.0.2 (small number of bug fixes)
- Scoverage plugin: 4.0.1 -> 4.0.2 (small number of bug fixes)
- Shadow plugin: 5.2.0 -> 6.0.0 (Java 15 support and require Gradle 6.0)
- Test Retry plugin: 1.1.5 -> 1.1.6 (small number of bug fixes)
- Spotless plugin: 4.4.4 -> 5.1.0 (several internal changes that should not matter to us)
- Spotbugs: 4.0.3 -> 4.0.6 (small number of bug fixes)
- Spotbugs plugin: 4.2.4 -> 4.4.4 (small number of bug fixes)
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
While debugging a rebalance scenario I found that inside rejoinNeededOrPending when we trigger rebalance due to metadata or subscription changes it is not logged, and hence it's actually a bit tricky to find out the reason of the triggered rebalance. I'm adding two INFO log4j entries to fill in the gap.
Other requestRejoin() calls are already covered.
Reviewers: Boyang Chen <boyang@confluent.io>
Java 11 has been recommended for a while, but the ops section had not been updated.
Also added `-XX:+ExplicitGCInvokesConcurrent` which has been in `kafka-run-class`
for a while. Finally, tweaked the text slightly to read better.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Set `replica.fetch.max.bytes` to `1` and produce multiple record batches to allow
for throttling to take place. This helps avoid a race condition where the
reassignment would complete more quickly than expected causing an
assertion to fail some times.
Reviewers: Lucas Bradstreet <lucas@confluent.io>, Jason Gustafson <jason@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Ismael Juma <ismael@juma.me.uk>
AbstractProcessorContext topic() throws NullPointerException when modifying a state store within the DSL from a punctuator. Reorder the check to avoid the NPE.
Co-authored-by: Ashish Roy <v-ashish.r@turvo.com>
Reviewers: Boyang Chen <boyang@confluent.io>
https://issues.apache.org/jira/browse/KAFKA-10305
When `kafka-consumer-perf-test.sh` is executed without required options or no options at all, only the error message is displayed. It's better off showing the usage as well like what we did for kafka-console-producer.sh.
We would previously update the map by adding the new replicas to the map and then removing the old ones.
During a recent refactoring, we changed the logic to first clear the map and then add all the replicas to it.
While this is done in a write lock, not all callers that access the map structure use a lock. It is safer to revert to
the previous behavior of showing the intermediate state of the map with extra replicas, rather than an
intermediate state of the map with no replicas.
Reviewers: Ismael Juma <ismael@juma.me.uk>
* KAFKA-10268: dynamic config like "--delete-config log.retention.ms" doesn't work
https://issues.apache.org/jira/browse/KAFKA-10268
Currently, ConfigCommand --delete-config API does not restore the config to default value, no matter at broker-level or broker-default level. Besides, Admin.incrementalAlterConfigs API also runs into this problem. This patch fixes it by removing the corresponding config from the newConfig properties when reconfiguring dynamic broker config.
This flaky test exists for a long time, and it happened more frequently recently. (also happened in my PR testing!! :( ) In KAFKA-8264 and KAFKA-8460, it described the issue for this test is that
> Timed out before consuming expected 2700 records. The number consumed was xxxx
I did some investigation. This test is to test:
> we consume all partitions if fetch max bytes and max.partition.fetch.bytes are low.
And what it did, is to create 3 topics and 30 partitions for each. And then, iterate through all 90 partitions to send 30 records for each. Finally, verify the we can consume all the records successfully.
What the error message saying is that it cannot consume all the records in time (might be the busy system) So, we can actually decrease the record size to avoid it. I checked all the error messages we collected in KAFKA-8264 and KAFKA-8460, the failed cases can always consume at least 1440 up (total is 2700). So, I set the records half size of the original setting, it'll become 1350 records in total. It should make this test more stable.
Author: Luke Chen <showuon@gmail.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8885 from showuon/KAFKA-8264
- part of KIP-216
- adds new sub-classes of InvalidStateStoreException
Reviewers: Navinder Pal Singh Brar <navinder_brar@yahoo.com>, Matthias J. Sax <matthias@confluent.io>
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
This PR implements the broker side changes of KIP-599, except the changes of the Rate implementation which will be addressed separately. The PR changes/introduces the following:
- It introduces the protocol changes.
- It introduces a new quota manager ControllerMutationQuotaManager which is another specialization of the ClientQuotaManager.
- It enforces the quota in the KafkaApis and in the AdminManager. This part handles new and old clients as described in the KIP.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
- part of KIP-572
- deprecates producer config `retries` (still in use)
- deprecates admin config `retries` (still in use)
- deprecates Kafka Streams config `retries` (will be ignored)
- adds new Kafka Streams config `task.timeout.ms` (follow up PRs will leverage this new config)
Reviewers: John Roesler <john@confluent.io>, Jason Gustafson <jason@confluent.io>, Randall Hauch <randall@confluent.io>
- After #8312, older brokers are returning empty configs, with latest `adminClient.describeConfigs`. Old brokers are receiving empty configNames in `AdminManageer.describeConfigs()` method. Older brokers does not handle empty configKeys. Due to this old brokers are filtering all the configs.
- Update ClientCompatibilityTest to verify describe configs
- Add test case to test describe configs with empty configuration Keys
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#9046 from omkreddy/KAFKA-9432
add a timeout for event queue time histogram;
reset eventQueueTimeHist when the controller event queue is empty;
Reviewers: José Armando García Sancio <jsancio@users.noreply.github.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Igor Soarez <i@soarez.me>, Jun Rao <junrao@gmail.com>
Add docs for KIP-441 and KIP-613.
Fixed some miscellaneous unrelated issues in the docs:
* Adds some missing configs to the Streams config docs: max.task.idle.ms,topology.optimization, default.windowed.key.serde.inner.class, and default.windowed.value.serde.inner.class
* Defines the previously-undefined default windowed serde class configs, including choosing a default (null) and giving them a doc string, so the yshould nwo show up in the auto-generated general Kafka config docs
* Adds a note to warn users about the rocksDB bug that prevents setting a strict capacity limit and counting write buffer memory against the block cache
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <vvcephei@apache.org>
Currently, the system tests `connect_distributed_test` and `connect_rest_test` only wait for the REST api to come up.
The startup of the worker includes an asynchronous process for joining the worker group and syncing with other workers.
There are some situations in which this sync takes an unusually long time, and the test continues without all workers up.
This leads to flakey test failures, as worker joins are not given sufficient time to timeout and retry without waiting explicitly.
This changes the `ConnectDistributedTest` to wait for the Joined group message to be printed to the logs before continuing with tests. I've activated this behavior by default, as it's a superset of the checks that were performed by default before.
This log message is present in every version of DistributedHerder that I could find, in slightly different forms, but always with `Joined group` at the beginning of the log message. This change should be safe to backport to any branch.
Signed-off-by: Greg Harris <gregh@confluent.io>
Author: Greg Harris <gregh@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
Brokers currently return NOT_LEADER_FOR_PARTITION to producers and REPLICA_NOT_AVAILABLE to consumers if a replica is not available on the broker during reassignments. Non-Java clients treat REPLICA_NOT_AVAILABLE as a non-retriable exception, Java consumers handle this error by explicitly matching the error code even though it is not an InvalidMetadataException. This PR renames NOT_LEADER_FOR_PARTITION to NOT_LEADER_OR_FOLLOWER and uses the same error for producers and consumers. This is compatible with both Java and non-Java clients since all clients handle this error code (6) as retriable exception. The PR also makes ReplicaNotAvailableException a subclass of InvalidMetadataException.
- ALTER_REPLICA_LOG_DIRS continues to return REPLICA_NOT_AVAILABLE. Retained this for compatibility since this request never returned NOT_LEADER_FOR_PARTITION earlier.
- MetadataRequest version 0 also returns REPLICA_NOT_AVAILABLE as topic-level error code for compatibility. Newer versions filter these out and return Errors.NONE, so didn't change this.
- Partition responses in MetadataRequest return REPLICA_NOT_AVAILABLE to indicate that one of the replicas is not available. Did not change this since NOT_LEADER_FOR_PARTITION is not suitable in this case.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>, Bob Barrett <bob.barrett@confluent.io>
Add log4j entry summarizing the assignment (previous owned and assigned) at the consumer level.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>
The test case `OffsetValidationTest.test_fencing_static_consumer` fails periodically due to this error:
```
Traceback (most recent call last):
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 134, in run
data = self.run_test()
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 192, in run_test
return self.test_context.function(self.test)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/mark/_mark.py", line 429, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/home/jenkins/workspace/system-test-kafka_2.6/kafka/tests/kafkatest/tests/client/consumer_test.py", line 257, in test_fencing_static_consumer
assert len(consumer.dead_nodes()) == num_conflict_consumers
AssertionError
```
When a consumer stops, there is some latency between when the shutdown is observed by the service and when the node is added to the dead nodes. This patch fixes the problem by giving some time for the assertion to be satisfied.
Reviewers: Boyang Chen <boyang@confluent.io>
When fencing producers, we currently blindly bump the epoch by 1 and write an abort marker to the transaction log. If the log is unavailable (for example, because the number of in-sync replicas is less than min.in.sync.replicas), we will roll back the attempted write of the abort marker, but still increment the epoch in the transaction metadata cache. During periods of prolonged log unavailability, producer retires of InitProducerId calls can cause the epoch to be increased to the point of exhaustion, at which point further InitProducerId calls fail because the producer can no longer be fenced. With this patch, we track whenever we have failed to write the bumped epoch, and when that has happened, we don't bump the epoch any further when attempting to fence. This is safe because the in-memory epoch is still causes old producers to be fenced.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
security_rolling_upgrade_test may change the security listener and then restart Kafka servers. has_sasl and has_ssl get out-of-date due to cached _security_config. This PR offers a simple fix that we always check the changes of port mapping and then update the sasl/ssl flag.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>