src-kafka

Commit Graph

Author	SHA1	Message	Date
Boyang Chen	e981b82601	KAFKA-8500; Static member rejoin should always update member.id (#6899 ) This PR fixes a bug in static group membership. Previously we limit the `member.id` replacement in JoinGroup to only cases when the group is in Stable. This is error-prone and could potentially allow duplicate consumers reading from the same topic. For example, imagine a case where two unknown members join in the `PrepareRebalance` stage at the same time. The PR fixes the following things: 1. Replace `member.id` at any time we see a known static member rejoins group with unknown member.id 2. Immediately fence any ongoing join/sync group callback to early terminate the duplicate member. 3. Clearly handle Dead/Empty cases as exceptional. 4. Return old leader id upon static member leader rejoin to avoid trivial member assignment being triggered. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>	6 years ago
Jason Gustafson	2feb44ebc8	MINOR: Fix race condition on shutdown of verifiable producer We've seen `ReplicaVerificationToolTest.test_replica_lags` fail occasionally due to errors such as the following: ``` RemoteCommandError: ubuntuworker7: Command 'kill -15 2896' returned non-zero exit status 1. Remote error message: bash: line 0: kill: (2896) - No such process ``` The problem seems to be a shutdown race condition when using `max_messages` with the producer. The process may already be gone which will cause the signal to fail. Author: Jason Gustafson <jason@confluent.io> Reviewers: Gwen Shapira Closes #6906 from hachikuji/fix-failing-replicat-verification-test	6 years ago
Jason Gustafson	c7c310beff	MINOR: Lower producer throughput in flaky upgrade system test We see the upgrade test failing from time to time. I looked into it and found that the root cause is basically that the test throughput can be too high for the 0.9 producer to make progress. Eventually it reaches a point where it has a huge backlog of timed out requests in the accumulator which all have to be expired. We see a long run of messages like this in the output: ``` {"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386132,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335160","key":null} {"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386132,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335163","key":null} {"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386133,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335166","key":null} {"exception":"class org.apache.kafka.common.errors.TimeoutException","time_ms":1559907386133,"name":"producer_send_error","topic":"test_topic","message":"Batch Expired","class":"class org.apache.kafka.tools.VerifiableProducer","value":"335169","key":null} ``` This can continue for a long time (I have observed up to 1 min) and prevents the producer from successfully writing any new data. While it is busy expiring the batches, no data is getting delivered to the consumer, which causes it to eventually raise a timeout. ``` kafka.consumer.ConsumerTimeoutException at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:50) at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:109) at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:69) at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:47) at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) ``` The fix here is to reduce the throughput, which seems reasonable since the purpose of the test is to verify the upgrade, which does not demand heavy load. Note that I investigated several failing instances of this test going back to 1.0 and saw a similar pattern, so there does not appear to be a regression. Author: Jason Gustafson <jason@confluent.io> Reviewers: Gwen Shapira Closes #6907 from hachikuji/lower-throughput-for-upgrade-test	6 years ago
Lucas Bradstreet	677713baf3	KAFKA-8499: ensure java is in PATH for ducker system tests (#6898 ) Reviewers: Colin P. McCabe <cmccabe@apache.org>	6 years ago
Boyang Chen	cca05cace4	KAFKA-8331: stream static membership system test (#6877 ) As title suggested, we boost 3 stream instances stream job with one minute session timeout, and once the group is stable, doing couple of rolling bounces for the entire cluster. Every rejoin based on restart should have no generation bump on the client side. Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bill Bejeck <bbejeck@gmail.com>	6 years ago
Stanislav Kozlovski	58aa04f91e	MINOR: Improve Trogdor external command worker docs (#6438 ) Reviewers: Colin McCabe <cmccabe@apache.org>, Xi Yang <xi@confluent.io>	6 years ago
Matthias J. Sax	ba3dc49437	KAFKA-8155: Add 2.2.0 release to system tests (#6597 ) Reviewers: Bill Bejeck <bill@confluent.io>, Boyang Chen <boyang@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Guozhang Wang <guozhang@confuent.io>	6 years ago
Konstantine Karantasis	55d07e717e	KAFKA-8473: Adjust Connect system tests for incremental cooperative rebalancing (#6872 ) Author: Konstantine Karantasis <konstantine@confluent.io> Reviewer: Randall Hauch <rhauch@gmail.com>	6 years ago
Matthias J. Sax	55bfea1378	KAFKA-8155: Add 2.1.1 release to system tests (#6596 ) Reviewers: Bill Bejeck <bill@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <guozhang@confluent.io>	6 years ago
Alex Diachenko	77a9a108ff	KAFKA-8418: Wait until REST resources are loaded when starting a Connect Worker. (#6840 ) Author: Alex Diachenko <sansanichfb@gmail.com> Reviewers: Arjun Satish <arjun@confluent.io>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>	6 years ago
Alex Diachenko	4838855ea7	MINOR: Fix red herring when ConnectDistributedTest.test_bounce fails. (#6838 ) Author: Alex Diachenko <sansanichfb@gmail.com> Reviewer: Randall Hauch <rhauch@gmail.com>	6 years ago
Bill Bejeck	f249956390	MINOR: Account for different versions in upgrade (#6835 ) Reviewers: Guozhang Wang <wangguoz@gmail.com>, Bruno Cadonna <bruno@confluent.io>	6 years ago
Matthias J. Sax	d286051a21	MINOR: fix Streams version-probing system test (#6764 ) Reviewers: John Roesler <john@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>, Boyang Chen <boyang@confluent.io>	6 years ago
Ismael Juma	c823f32070	MINOR: Add 2.0, 2.1 and 2.2 to broker and client compat tests These are important to ensure we don't break compatibility. Author: Ismael Juma <ismael@juma.me.uk> Reviewers: Gwen Shapira Closes #6794 from ijuma/update-version-compat-tests	6 years ago
Konstantine Karantasis	c6d083d7fc	KAFKA-8417: Remove redundant network definition --net=host when starting testing docker containers (#6797 ) Reviewers: Colin P. McCabe <cmccabe@apache.org>	6 years ago
Manikumar Reddy	5ca6a2ee94	MINOR: Use `jps` cmd to find out the pid of TransactionalMessageCopier Author: Manikumar Reddy <manikumar.reddy@gmail.com> Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com> Closes #6787 from omkreddy/transaction_test	6 years ago
Colin Patrick McCabe	87ff83a82e	MINOR: Bump version to 2.4.0-SNAPSHOT (#6774 ) Reviewers: Jason Gustafson <jason@confluent.io>	6 years ago
Jason Gustafson	b52170372b	MINOR: Increase security test timeouts for transient failures (#6760 ) Reviewers: Ismael Juma <ismael@juma.me.uk>	6 years ago
Boyang Chen	9fa331b811	KAFKA-8225 & KIP-345 part-2: fencing static member instances with conflicting group.instance.id (#6650 ) For static members join/rejoin, we encode the current timestamp in the new member.id. The format looks like group.instance.id-timestamp. During consumer/broker interaction logic (Join, Sync, Heartbeat, Commit), we shall check the whether group.instance.id is known on group. If yes, we shall match the member.id stored on static membership map with the request member.id. If mismatching, this indicates a conflict consumer has used same group.instance.id, and it will receive a fatal exception to shut down. Right now the only missing part is the system test. Will work on it offline while getting the major logic changes reviewed. Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>	6 years ago
Vahid Hashemian	16ece15fb3	MINOR: Include StickyAssignor in system tests (#5223 ) Reviewers: Jason Gustafson <jason@conflent.io>	6 years ago
Magesh Nandakumar	40d5c9fac9	KAFKA-8352 : Fix Connect System test failure 404 Not Found (#6713 ) Corrects the system tests to check for either a 404 or a 409 error and sleeping until the Connect REST API becomes available. This corrects a previous change to how REST extensions are initialized (#6651), which added the ability of Connect throwing a 404 if the resources are not yet started. The integration tests were already looking for 409. Author: Magesh Nandakumar <magesh.n.kumar@gmail.com> Reviewer: Randall Hauch <rhauch@gmail.com>	6 years ago
Boyang Chen	0f995ba6be	KAFKA-7862 & KIP-345 part-one: Add static membership logic to JoinGroup protocol (#6177 ) This is the first diff for the implementation of JoinGroup logic for static membership. The goal of this diff contains: * Add group.instance.id to be unique identifier for consumer instances, provided by end user; Modify group coordinator to accept JoinGroupRequest with/without static membership, refactor the logic for readability and code reusability. * Add client side support for incorporating static membership changes, including new config for group.instance.id, apply stream thread client id by default, and new join group exception handling. * Increase max session timeout to 30 min for more user flexibility if they are inclined to tolerate partial unavailability than burdening rebalance. * Unit tests for each module changes, especially on the group coordinator logic. Crossing the possibilities like: 6.1 Dynamic/Static member 6.2 Known/Unknown member id 6.3 Group stable/unstable 6.4 Leader/Follower The rest of the 345 change will be broken down to 4 separate diffs: * Avoid kicking out members through rebalance.timeout, only do the kick out through session timeout. * Changes around LeaveGroup logic, including version bumping, broker logic, client logic, etc. * Admin client changes to add ability to batch remove static members * Deprecate group.initial.rebalance.delay Reviewers: Liquan Pei <liquanpei@gmail.com>, Stanislav Kozlovski <familyguyuser192@windowslive.com>, Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>	6 years ago
Boyang Chen	847957cbea	KAFKA-8291 : System test fix (#6637 ) As titled, this PR changed the default reset policy to latest accidentally for system tests, which in fact was earliest. Reviewers: Guozhang Wang <wangguoz@gmail.com>	6 years ago
Ismael Juma	7d9e93ac6d	MINOR: Use https instead of http in links (#6477 ) Verified that the https links work. I didn't update the license header in this PR since that touches so many files. Will file a separate one for that. Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>	6 years ago
David Arthur	409fabc561	KAFKA-7747; Check for truncation after leader changes [KIP-320] (#6371 ) After the client detects a leader change we need to check the offset of the current leader for truncation. These changes were part of KIP-320: https://cwiki.apache.org/confluence/display/KAFKA/KIP-320%3A+Allow+fetchers+to+detect+and+handle+log+truncation. Reviewers: Jason Gustafson <jason@confluent.io>	6 years ago
Guozhang Wang	4aa2cfe467	MINOR: Tighten up metadata upgrade test (#6531 ) Reviewers: Bill Bejeck <bbejeck@gmail.com>	6 years ago
Stanislav Kozlovski	0d55f0f3ec	KAFKA-8102: Add an interval-based Trogdor transaction generator (#6444 ) This patch adds a TimeIntervalTransactionsGenerator class which enables the Trogdor ProduceBench worker to commit transactions based on a configurable millisecond time interval. Also, we now handle 409 create task responses in the coordinator command-line client by printing a more informative message Reviewers: Colin P. McCabe <cmccabe@apache.org>	6 years ago
Brian Bushree	860e957999	MINOR: list-topics should not require topic param `kafka.list_topics(...)` should not require a topic parameter Author: Brian Bushree <bbushree@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #6367 from brianbushree/list-topics-no-topic	6 years ago
Stanislav Kozlovski	f20f3c1a97	MINOR: Update Trogdor ConnectionStressWorker status at the end of execution (#6445 ) Reviewers: Colin P. McCabe <cmccabe@apache.org>	6 years ago
John Roesler	8e97540071	KAFKA-7944: Improve Suppress test coverage (#6382 ) * add a normal windowed suppress with short windows and a short grace period * improve the smoke test so that it actually verifies the intended conditions See https://issues.apache.org/jira/browse/KAFKA-7944 Reviewers: Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>	6 years ago
Rajini Sivaram	460b3a6259	KAFKA-8070: Increase consumer startup timeout in system tests (#6405 ) For consumers using SSL, this timeout includes the time to create and copy keystores and truststores and sometime it takes longer than 10s to complete the security setup before starting the consumer process. Reviewers: Ismael Juma <ismael@juma.me.uk>	6 years ago
Guozhang Wang	057bb35f24	HOTFIX: add igore import to streams_upgrade_test	6 years ago
Guozhang Wang	dfae20ecee	MINOR: disable Streams system test for broker upgrade/downgrade (#6341 ) Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>	6 years ago
Bill Bejeck	071f62a356	MINOR: refactor topic check to make sure all topics exist by name vs doing a topic count (#6271 ) Reviewers: John Roesler <john@confluent.io>, Matthias J. Sax <matthias@confluent.io>	6 years ago
Bill Bejeck	be76560011	MINOR: Add all topics created check streams broker bounce test (trunk) (#6243 ) The StreamsBrokerBounceTest.test_broker_type_bounce experienced what looked like a transient failure. After looking over this test and failure, it seems like it is vulnerable to timing error that streams will start before the kafka service creates all topics. Reviewers: Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>	6 years ago
Matthias J. Sax	d2575f03a3	MINOR: Bump version to 2.3.0-SNAPSHOT (#6226 ) * MINOR: Bump version to 2.3.0-SNAPSHOT * Github comment	6 years ago
Colin Patrick McCabe	4be68c58da	KAFKA-7828: Add ExternalCommandWorker to Trogdor (#6219 ) Allow the Trogdor agent to execute external commands. The agent communicates with the external commands via stdin, stdout, and stderr. Based on a patch by Xi Yang <xi@confluent.io> Reviewers: David Arthur <mumrah@gmail.com>	6 years ago
Konstantine Karantasis	83c435f3ba	KAFKA-7834: Extend collected logs in system test services to include heap dumps * Enable heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<file.bin> in the major services in system tests * Collect the heap dump from the predefined location as part of the result logs for each service * Change Connect service to delete the whole root directory instead of individual expected files * Tested by running the full suite of system tests Author: Konstantine Karantasis <konstantine@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #6158 from kkonstantine/KAFKA-7834	6 years ago
Konstantine Karantasis	0dbb064963	MINOR: Upgrade ducktape to 0.7.5 (#6197 ) Reviewed-by: Colin P. McCabe <cmccabe@apache.org>	6 years ago
Colin Patrick McCabe	a79d6dcdb6	KAFKA-7793: Improve the Trogdor command line. (#6133 ) * Allow the Trogdor agent to be started in "exec mode", where it simply runs a single task and exits after it is complete. * For AgentClient and CoordinatorClient, allow the user to pass the path to a file containing JSON, instead of specifying the JSON object in the command-line text itself. This means that we can get rid of the bash scripts whose only function was to load task specs into a bash string and run a Trogdor command. * Print dates and times in a human-readable way, rather than as numbers of milliseconds. * When listing tasks or workers, output human-readable tables of information. * Allow the user to filter on task ID name, task ID pattern, or task state. * Support a --json flag to provide raw JSON output if desired. Reviewed-by: David Arthur <mumrah@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>	6 years ago
Kan Li	f8e8d62f56	MINOR: ducker-ak: add down -f, avoid using a terminal in ducker test When using ./ducker-ak test on Jenkins, the script complains that there is no TTY. To fix this, we should skip passing -t to docker exec. We do not need a pseudo-TTY to run the tests. Similarly, we should skip passing -i, since we do not need to keep stdin open. The down command should have a force option, specified as -f or --force. Reviewed-by: Colin P. McCabe <cmccabe@apache.org>	6 years ago
Chris Egerton	743607af5a	KAFKA-5117: Stop resolving externalized configs in Connect REST API [KIP-297](https://cwiki.apache.org/confluence/display/KAFKA/KIP-297%3A+Externalizing+Secrets+for+Connect+Configurations#KIP-297:ExternalizingSecretsforConnectConfigurations-PublicInterfaces) introduced the `ConfigProvider` mechanism, which was primarily intended for externalizing secrets provided in connector configurations. However, when querying the Connect REST API for the configuration of a connector or its tasks, those secrets are still exposed. The changes here prevent the Connect REST API from ever exposing resolved configurations in order to address that. rhauch has given a more thorough writeup of the thinking behind this in [KAFKA-5117](https://issues.apache.org/jira/browse/KAFKA-5117) Tested and verified manually. If these changes are approved unit tests can be added to prevent a regression. Author: Chris Egerton <chrise@confluent.io> Reviewers: Robert Yokota <rayokota@gmail.com>, Randall Hauch <rhauch@gmail.com, Ewen Cheslack-Postava <ewen@confluent.io> Closes #6129 from C0urante/hide-provided-connect-configs	6 years ago
Xi Yang	cc33511e9a	MINOR: Support choosing different JVMs when running integration tests + Add a parameter to the ducktap-ak to control the OpenJDK base image. + Fix a few issues of using OpenJDK:11 as the base image. More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers. Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes. Author: Xi Yang <xi@confluent.io> Reviewers: Ewen Cheslack-Postava <ewen@confluent.io> Closes #6071 from yangxi/ducktape-jdk	6 years ago
Bill Bejeck	3746bf2c84	MINOR: Need to have same wait as verify timeout broker upgrade downgrade (#6127 ) When I originally refactored the streams_upgrade_test#upgrade_downgrade_brokers test I removed the wait call which would wait for up 24 minutes for the SmokeTestDriver class to publish and verify all of its records. Since most of the tests run in two minutes or less I set the monitor_log timeout to three minutes. However, the SmokeTestDriver#verify method allows up to six minutes to consume all records before verifying the monitor_log timeout needs to be greater than 6 minutes. I've set the timeout to 8 minutes. Additionally, the steps needed to update the streams_upgrade_test should be documented as there are several components that need to get updated. So I've documented those steps here on the test as a giant comment. Reviewers: Guozhang Wang <wangguoz@gmail.com>	6 years ago
Bill Bejeck	b1b792d9a8	MINOR: Add 2.1 version metadata upgrade (#6111 ) Updated the test_metadata_upgrade test. To enable using the 2.1 version I needed to add config change to the StreamsUpgradeTestJobRunnerService to ensure the ductape passes proper args when starting the StreamsUpgradeTest For testing, I ran the test_metadata_upgrade test and all versions now pass http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2019-01-09--001.1547049873--bbejeck--MINOR_add_2_1_version_metadata_upgrade--a450c68/report.html Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>	6 years ago
Bill Bejeck	515e680c71	MINOR: Put states in proper order, increase timeout for starting (#6105 ) Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>	6 years ago
Jason Gustafson	f9a22f42a8	KAFKA-7773; Add end to end system test relying on verifiable consumer (#6070 ) This commit creates an EndToEndTest base class which relies on the verifiable consumer. This will ultimately replace ProduceConsumeValidateTest which depends on the console consumer. The advantage is that the verifiable consumer exposes more information to use for validation. It also allows for a nicer shutdown pattern. Rather than relying on the console consumer idle timeout, which requires a minimum wait time, we can halt consumption after we have reached the last acked offsets. This should be more reliable and faster. The downside is that the verifiable consumer only works with the new consumer, so we cannot yet convert the upgrade tests. This commit converts only the replication tests and a flaky security test to use EndToEndTest.	6 years ago
Bill Bejeck	404bdef08d	MINOR: Remove sleep calls and ignore annotation from streams upgrade test (#6046 ) The StreamsUpgradeTest::test_upgrade_downgrade_brokers used sleep calls in the test which led to flaky test performance and as a result, we placed an @ignore annotation on the test. This PR uses log events instead of the sleep calls hence we can now remove the @ignore setting. Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>	6 years ago
John Roesler	ef9204dc58	MINOR: improve resilience of Streams test producers (#6028 ) Some Streams system tests have failed during the setup phase due to the producer having retries disabled and getting some transient error from the broker. This patch adds a retries parameter to the VerifiableProducer (default unchanged), and sets retries to 10 for Streams tests. It also sets acks equal to the number of brokers for Streams tests. Reviewers: Matthias J. Sax <matthias@confluent.io>, Bill Bejeck <bill@confluent.io>, Guozhang Wang <guozhang@confluent.io>	6 years ago
Bill Bejeck	639427a38f	MINOR: Increase throughput for VerifiableProducer in test (#6060 ) Previous PR #6043 reduced throughput for VerifiableProducer in base class, but the streams_standby_replica_test needs higher throughput for consumer to complete verification in 60 seconds Update system test and kicked off branch builder with 25 repeats https://jenkins.confluent.io/job/system-test-kafka-branch-builder/2201/ Reviewers: Guozhang Wang <wangguoz@gmail.com>	6 years ago

1 2 3 4 5 ...

428 Commits (93bf96589471acadfb90e57ebfecbd91f679f77b)