For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp.
During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down.
Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Corrects the system tests to check for either a 404 or a 409 error and sleeping until the Connect REST API becomes available. This corrects a previous change to how REST extensions are initialized (#6651), which added the ability of Connect throwing a 404 if the resources are not yet started. The integration tests were already looking for 409.
Author: Magesh Nandakumar <magesh.n.kumar@gmail.com>
Reviewer: Randall Hauch <rhauch@gmail.com>
This is the first diff for the implementation of JoinGroup logic for static membership. The goal of this diff contains:
* Add group.instance.id to be unique identifier for consumer instances, provided by end user;
Modify group coordinator to accept JoinGroupRequest with/without static membership, refactor the logic for readability and code reusability.
* Add client side support for incorporating static membership changes, including new config for group.instance.id, apply stream thread client id by default, and new join group exception handling.
* Increase max session timeout to 30 min for more user flexibility if they are inclined to tolerate partial unavailability than burdening rebalance.
* Unit tests for each module changes, especially on the group coordinator logic. Crossing the possibilities like:
6.1 Dynamic/Static member
6.2 Known/Unknown member id
6.3 Group stable/unstable
6.4 Leader/Follower
The rest of the 345 change will be broken down to 4 separate diffs:
* Avoid kicking out members through rebalance.timeout, only do the kick out through session timeout.
* Changes around LeaveGroup logic, including version bumping, broker logic, client logic, etc.
* Admin client changes to add ability to batch remove static members
* Deprecate group.initial.rebalance.delay
Reviewers: Liquan Pei <liquanpei@gmail.com>, Stanislav Kozlovski <familyguyuser192@windowslive.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
As titled, this PR changed the default reset policy to latest accidentally for system tests, which in fact was earliest.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
`kafka.list_topics(...)` should not require a topic parameter
Author: Brian Bushree <bbushree@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6367 from brianbushree/list-topics-no-topic
* add a normal windowed suppress with short windows and a short grace
period
* improve the smoke test so that it actually verifies the intended
conditions
See https://issues.apache.org/jira/browse/KAFKA-7944
Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
For consumers using SSL, this timeout includes the time to create and copy keystores and truststores and sometime it takes longer than 10s to complete the security setup before starting the consumer process.
Reviewers: Ismael Juma <ismael@juma.me.uk>
The StreamsBrokerBounceTest.test_broker_type_bounce experienced what looked like a transient failure. After looking over this test and failure, it seems like it is vulnerable to timing error that streams will start before the kafka service creates all topics.
Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>
* Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests
* Collect the heap dump from the predefined location as part of the result logs for each service
* Change Connect service to delete the whole root directory instead of individual expected files
* Tested by running the full suite of system tests
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6158 from kkonstantine/KAFKA-7834
[KIP-297](https://cwiki.apache.org/confluence/display/KAFKA/KIP-297%3A+Externalizing+Secrets+for+Connect+Configurations#KIP-297:ExternalizingSecretsforConnectConfigurations-PublicInterfaces) introduced the `ConfigProvider` mechanism, which was primarily intended for externalizing secrets provided in connector configurations. However, when querying the Connect REST API for the configuration of a connector or its tasks, those secrets are still exposed. The changes here prevent the Connect REST API from ever exposing resolved configurations in order to address that. rhauch has given a more thorough writeup of the thinking behind this in [KAFKA-5117](https://issues.apache.org/jira/browse/KAFKA-5117)
Tested and verified manually. If these changes are approved unit tests can be added to prevent a regression.
Author: Chris Egerton <chrise@confluent.io>
Reviewers: Robert Yokota <rayokota@gmail.com>, Randall Hauch <rhauch@gmail.com, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#6129 from C0urante/hide-provided-connect-configs
When I originally refactored the streams_upgrade_test#upgrade_downgrade_brokers test I removed the wait call which would wait for up 24 minutes for the SmokeTestDriver class to publish and verify all of its records.
Since most of the tests run in two minutes or less I set the monitor_log timeout to three minutes. However, the SmokeTestDriver#verify method allows up to six minutes to consume all records before verifying the monitor_log timeout needs to be greater than 6 minutes. I've set the timeout to 8 minutes.
Additionally, the steps needed to update the streams_upgrade_test should be documented as there are several components that need to get updated. So I've documented those steps here on the test as a giant comment.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This commit creates an EndToEndTest base class which relies on the verifiable consumer. This will ultimately replace ProduceConsumeValidateTest which depends on the console consumer. The advantage is that the verifiable consumer exposes more information to use for validation. It also allows for a nicer shutdown pattern. Rather than relying on the console consumer idle timeout, which requires a minimum wait time, we can halt consumption after we have reached the last acked offsets. This should be more reliable and faster. The downside is that the verifiable consumer only works with the new consumer, so we cannot yet convert the upgrade tests. This commit converts only the replication tests and a flaky security test to use EndToEndTest.
The StreamsUpgradeTest::test_upgrade_downgrade_brokers used sleep calls in the test which led to flaky test performance and as a result, we placed an @ignore annotation on the test. This PR uses log events instead of the sleep calls hence we can now remove the @ignore setting.
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Some Streams system tests have failed during the setup phase
due to the producer having retries disabled and getting some
transient error from the broker.
This patch adds a retries parameter to the VerifiableProducer
(default unchanged), and sets retries to 10 for Streams tests.
It also sets acks equal to the number of brokers for Streams tests.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>
Previous PR #6043 reduced throughput for VerifiableProducer in base class, but the streams_standby_replica_test needs higher throughput for consumer to complete verification in 60 seconds
Update system test and kicked off branch builder with 25 repeats https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2201/
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This PR addresses a few issues with this system test flakiness. This PR is a cherry-picked duplicate of #6041 but for the trunk branch, hence I won't repeat the inline comments here.
1. Need to grab the monitor before a given operation to observe logs for signal
2. Relied too much on a timely rebalance and only sent a handful of messages.
I've updated the test and ran it here https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2143/ parameterized for 15 repeats all passed.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This is the error message we're after:
"You may not specify an explicit partition assignment when using multiple consumers in the same group."
We apparently changed it midway through #5810 and forgot to update the test.
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>
This is a system test for doing a rolling upgrade of a topology with a named repartition topic.
1. An initial Kafka Streams application is started on 3 nodes. The topology has one operation forcing a repartition and the repartition topic is explicitly named.
2. Each node is started and processing of data is validated
3. Then one node is stopped (full stop is verified)
4. A property is set signaling the node to add operations to the topology before the repartition node which forces a renumbering of all operators (except repartition node)
5. Restart the node and confirm processing records
6. Repeat the steps for the other 2 nodes completing the rolling upgrade
I ran two runs of the system test with 25 repeats in each run for a total of 50 test runs.
All test runs passed
Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This change adds some basic system tests for delegation token based authentication:
- basic delegation token creation
- producing with a delegation token
- consuming with a delegation token
- expiring a delegation token
- producing with an expired delegation token
New files:
- delegation_tokens.py: a wrapper around kafka-delegation-tokens.sh - executed in container where a secure Broker is running (taking advantage of automatic cleanup)
- delegation_tokens_test.py: basic test to validate the lifecycle of a delegation token
Changes were made in the following file to extend their functionality:
- config_property was updated to be able to configure Kafka brokers with delegation token related settings
- jaas.conf template because a broker needs to support multiple login modules when delegation tokens are used
- consule-consumer and verifiable_producer to override KAFKA_OPTS (to specify custom jaas.conf) and the client properties (to authenticate with delegation token).
Author: Attila Sasvari <asasvari@apache.org>
Reviewers: Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Andras Katona <41361962+akatona84@users.noreply.github.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#5660 from asasvari/KAFKA-4544
This is a new system test testing for optimizing an existing topology. This test takes the following steps
1. Start a Kafka Streams application that uses a selectKey then performs 3 groupByKey() operations and 1 join creating four repartition topics
2. Verify all instances start and process data
3. Stop all instances and verify stopped
4. For each stopped instance update the config for TOPOLOGY_OPTIMIZATION to all then restart the instance and verify the instance has started successfully also verifying Kafka Streams reduced the number of repartition topics from 4 to 1
5. Verify that each instance is processing data from the aggregation, reduce, and join operation
Stop all instances and verify the shut down is complete.
6. For testing I ran two passes of the system test with 25 repeats for a total of 50 test runs.
All test runs passed
Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
KAFKA-7597: Add configurable transaction support to ProduceBenchWorker. In order to get support for serializing Optional<> types to JSON, add a new library: jackson-datatype-jdk8. Once Jackson 3 comes out, this library will not be needed.
Reviewers: Colin McCabe <cmccabe@apache.org>, Ismael Juma <ismael@juma.me.uk>
The Kafka Streams system tests fail with some regularity due to a timeout starting the broker.
The initial start is quite quick, but many of our tests involve stopping and restarting nodes with data already loaded, and also while processing is ongoing.
Under these conditions, it seems to be normal for the broker to take about 25 seconds to start, which makes the 30 second timeout pretty close for comfort.
I have seen many test failures in which the broker successfully started within a couple of seconds after the tests timed out and already initiated the failure/shut-down sequence.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
The function `setup_producer_and_consumer` is unused in the system test
framework, which incorrectly suggests subclasses should implement
it. It is not required or even referenced by the framework, so
the requirement should be removed.
Reviewers: Viktor Somogyi <viktorsomogyi@gmail.com>, Jason Gustafson <jason@confluent.io>
Add threads with separate consumers to ConsumeBenchWorker. Update the Trogdor test scripts and documentation with the new functionality.
Reviewers: Colin McCabe <cmccabe@apache.org>
Currently, the startup timeout is hardcoded to be 60 seconds in Connect's test service. Modifying it to be passable via init. This can safely be backported as well.
Reviewers: Randall Hauch <rhauch@gmail.com>, Jason Gustafson <jason@confluent.io>
This fixes the Connect standalone system tests. See branch builder: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2021/
This should be backported to the 2.0 branch, since that's when the tests were first
modified to use the external property file.
Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Corrects an error in the system tests:
```
07:55:45 [ERROR:2018-10-23 07:55:45,738]: Failed to import kafkatest.tests.connect.connect_test, which may indicate a broken test that cannot be loaded: NameError: name 'EXTERNAL_CONFIGS_FILE' is not defined
```
The constant is defined in the [services/connect.py](https://github.com/apache/kafka/blob/trunk/tests/kafkatest/services/connect.py#L43) file in the `ConnectServiceBase` class, but the problem is in the [tests/connect/connect_test.py](https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/connect/connect_test.py#L50) `ConnectStandaloneFileTest`, which does *not* extend the `ConnectServiceBase class`. Suggestions welcome to be able to reuse that variable without duplicating the literal (as in this PR).
System test run with this PR: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2004/
If approved, this should be merged as far back as the `2.0` branch.
Author: Randall Hauch <rhauch@gmail.com>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#5832 from rhauch/fix-connect-externals-tests
1. In test_upgrade_downgrade_brokers, allow duplicates to happen.
2. In test_version_probing_upgrade, grep the generation numbers from brokers at the end, and check if they can ever be synchronized.
Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>
PR #2267 Introduced support for Zstandard compression. The relevant test expects values for `num_nodes` and `num_producers` based on the (now-incremented) count of compression types.
Passed the affected, previously-failing test:
`ducker-ak test tests/kafkatest/tests/client/compression_test.py`
Reviewers: Jason Gustafson <jason@confluent.io>
In some tests, the check monitoring the JMX tool log output doesn’t quite wait long enough before failing. Increasing the timeout from 10 to 20 seconds.
Removed ignore annotations from the upgrade tests. This PR includes the following changes for updating the upgrade tests:
* Uploaded new versions 0.10.2.2, 0.11.0.3, 1.0.2, 1.1.1, and 2.0.0 (in the associated scala versions) to kafka-packages
* Update versions in version.py, Dockerfile, base.sh
* Added new versions to StreamsUpgradeTest.test_upgrade_downgrade_brokers including version 2.0.0
* Added new versions StreamsUpgradeTest.test_simple_upgrade_downgrade test excluding version 2.0.0
* Version 2.0.0 is excluded from the streams upgrade/downgrade test as StreamsConfig needs an update for the new version, requiring a KIP. Once the community votes the KIP in, a minor follow-up PR can be pushed to add the 2.0.0 version to the upgrade test.
* Fixed minor bug in kafka-run-class.sh for classpath in upgrade/downgrade tests across versions.
* Follow on PRs for 0.10.2x, 0.11.0x, 1.0.x, 1.1.x, and 2.0.x will be pushed soon with the same updates required for the specific version.
Reviewers: Eno Thereska <eno.thereska@gmail.com>, John Roesler <vvcephei@users.noreply.github.com>, Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <matthias@confluent.io>