This PR does the following:
* Remove the StreamsRepeatingIntegerKeyProducerService and the associated Java class
* Add a parameter to VerifiableProducer.java to enable sending keys when specified
* Update the corresponding Python file verifiable_producer.py to support the new parameter.
Reviewers: Matthias J Sax <matthias@confluentio>, Guozhang Wang <wangguoz@gmail.com>
Implement destroying tasks and workers. This means erasing all record of them on the Coordinator and the Agent.
Workers should be identified by unique 64-bit worker IDs, rather than by the names of the tasks they are implementing. This ensures that when a task is destroyed and re-created with the same task ID, the old workers will be not be treated as part of the new task instance.
Fix some return results from RPCs. In some cases RPCs were returning values that were never used. Attempting to re-create the same task ID with different arguments should fail. Add RequestConflictException to represent HTTP error code 409 (CONFLICT) for this scenario.
If only one worker in a task stops, don't stop all the other workers for that task, unless the worker that stopped had an error.
Reviewers: Anna Povzner <anna@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
Added consumer only workload to Trogdor. The topics must already be pre-populated. The spec lets the user request topic pattern and range of partitions to assign to [startPartition, endPartition].
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
Currently, WorkerUtils will be able to create topics when there is no security. To be able to work with secure kafka, WorkerUtils.createTopic() needs to be able to take security configs. This PR adds commonClientConf field to both producer bench and roundtrip workload specs so that users can specify security and other common configs once for producer/consumer and adminClient. Also added adminClientConf field to workload specs so that users can specify adminClient specific configs if they want to. For completeness, added consumerConf and producerConf to roundtrip workload spec.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin P. Mccabe <cmccabe@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
Added configs to ProducerBenchSpec:
topicPrefix: name of topics will be of format topicPrefix + topic index. If not provided, default is "produceBenchTopic".
partitionsPerTopic: number of partitions per topic. If not provided, default is 1.
replicationFactor: replication factor per topic. If not provided, default is 3.
The behavior of producer bench is changed such that if some or all topics already exist (with topic names = topicPrefix + topic index), and they have the same number of partitions as requested, the worker uses those topics and does not fail. The producer bench fails if one or more existing topics has number of partitions that is different from expected number of partitions.
Added unit test for WorkerUtils -- for existing methods and new methods.
Fixed bug in MockAdminClient, where createTopics() would over-write existing topic's replication factor and number of partitions while correctly completing the appropriate futures exceptionally with TopicExistsException.
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
Make PayloadGenerator an interface which can have multiple implementations: constant, uniform random, sequential.
Allow different payload generators to be used for keys and values.
This change fixes RoundTripWorkload. Previously RoundTripWorkload was unable to get the sequence number of the keys that it produced.
AgentClient and CoordinatorClient should have the option of logging failures to custom log4j objects. There should also be builders for these objects, to make them easier to extend in the future.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
It generates the producer payload (key and value) and makes sure that the values are
populated to target a realistic compression rate (0.3 - 0.4) if compression is used.
The generated payload is deterministic and can be replayed from a given position.
For now, all generated values are constant size, and key types can be configured
to be either null or 8 bytes.
Added messageSize parameter to producer spec, that specifies produced
key + message size.
Changing KafkaFuture.Future and KafkaFuture.BiConsumer into an interface makes
them a functional interface. This makes them Java 8 lambda compatible.
Author: Colin P. Mccabe <cmccabe@confluent.io>
Author: Steven Aerts <steven.aerts@gmail.com>
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Xavier Léauté <xl+github@xvrl.net>, Tom Bentley <tbentley@redhat.com>, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#4033 from steven-aerts/KAFKA-6018
Handle InvalidTypeIdException as NOT_IMPLEMENTED and add unit tests for all exceptions.
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Rajini Sivaram <rajinisivaram@googlemail.com>
When a log entry is appended to a Kafka topic using `KafkaLog4jAppender`, the producer.send operation may block waiting for metadata. This can result in deadlocks in a couple of scenarios if a log entry from the producer network thread is also at a log level that results in the entry being appended to a Kafka topic.
1. Producer's network thread will attempt to send data to a Kafka topic and this is unsafe since producer.send may block waiting for metadata, causing a deadlock since the thread will not process the metadata request/response.
2. `KafkaLog4jAppender#append` is invoked while holding the lock of the logger. So the thread waiting for metadata in the initial send will be holding the logger lock. If the producer network thread has.a log entry that needs to be appended, it will attempt to acquire the logger lock and deadlock.
This is a temporary workaround to avoid deadlocks in system tests by setting log level to WARN for `Metadata` in `VerifiableLog4jAppender`. The fix has been verified using the system tests log4j_appender_test.py which started failing when the info-level log entry was introduced.
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Satish Duggana <satish.duggana@gmail.com>, tedyu <yuzhihong@gmail.com>
Closes#4375 from rajinisivaram/KAFKA-6415-log4jappender
* Implement process stop faults via SIGSTOP / SIGCONT
* Implement RoundTripWorkload, which both sends messages, and confirms that they are received at least once.
* Allow Trogdor tasks to block until other Trogdor tasks are complete.
* Add CreateTopicsWorker, which can be a building block for a lot of tests.
* Simplify how TaskSpec subclasses in ducktape serialize themselves to JSON.
* Implement some fault injection tests in round_trip_workload_test.py
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4323 from cmccabe/KAFKA-5849
For ducktape: add Kibosh to the testing Dockerfile.
Create files_unreadable_fault_spec.py.
For trogdor: create FilesUnreadableFaultSpec.java.
Add a unit test of using the Kibosh service.
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#4195 from cmccabe/KAFKA-5811
This is follow up to #4072 which added the PushHttpMetricsReporter and converted some services to use it. We somehow missed some compatibility issues that made the ProducerPerformance tool fail when using a newer tools jar with older common/clients jar, which we do with some system tests so we have all the features we need in the tool but can build compatibility tests for older releases.
This just adjusts some API usage to make the tool compatible with all previous releases.
I have a full run of the tests starting [here](https://jenkins.confluent.io/job/system-test-kafka-branch-builder/1122/)
Author: Ewen Cheslack-Postava <me@ewencp.org>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#4214 from ewencp/fix-compatibility-sanity-check-tests
Previously, Trogdor only handled "Faults." Now, Trogdor can handle
"Tasks" which may be either faults, or workloads to execute in the
background.
The Agent and Coordinator have been refactored from a
mutexes-and-condition-variables paradigm into a message passing
paradigm. No locks are necessary, because only one thread can access
the task state or worker state. This makes them a lot easier to reason
about.
The MockTime class can now handle mocking deferred message passing
(adding a message to an ExecutorService with a delay). I added a
MockTimeTest.
MiniTrogdorCluster now starts up Agent and Coordinator classes in
paralle in order to minimize junit test time.
RPC messages now inherit from a common Message.java class. This class
handles implementing serialization, equals, hashCode, etc.
Remove FaultSet, since it is no longer necessary.
Previously, if CoordinatorClient or AgentClient hit a networking
problem, they would throw an exception. They now retry several times
before giving up. Additionally, the REST RPCs to the Coordinator and
Agent have been changed to be idempotent. If a response is lost, and
the request is resent, no harm will be done.
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#4073 from cmccabe/KAFKA-6060
Author: Tom Bentley <tbentley@redhat.com>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4157 from tombentley/KAFKA-6130-verifiable-consumer-max-messages
Multiple inflights means that when there are rolling bounces or other cluster instability, there is an increased likelihood of having previously tried batch expire in the accumulator. This is a fatal error
for a transactional producer, causing the `TransactionalMessageCopier` to exit. To work around this, we bump the request timeout. We can get rid of this when KIP-91 is merged.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4039 from apurvam/MINOR-bump-request-timeout-in-transactional-message-copier
With these changes, we are ensuring that the partitions being reassigned are from non-zero offsets. We also ensure that every message in the log has producerId and sequence number.
This means that it successfully reproduces https://issues.apache.org/jira/browse/KAFKA-6003.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#4029 from apurvam/KAFKA-6016-add-idempotent-producer-to-reassign-partitions
Since we removed the unused `TRACE` option from `SecurityProtocol`, it now seems safer to expose it from `AuthenticationContext`. Additionally this patch exposes javadocs under security.auth and relocates the `Login` and `AuthCallbackHandler` to a non-public package.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#3863 from hachikuji/use-security-protocol-in-auth-context
Adds new metrics to support health checks:
1. Error rates for each request type, per-error code
2. Request size and temporary memory size
3. Message conversion rate and time
4. Successful and failed authentication rates
5. ZooKeeper latency and status
6. Client version
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#3705 from rajinisivaram/KAFKA-5746-new-metrics
Here we introduce client and broker changes to support multiple inflight requests while still guaranteeing idempotence. Two major problems to be solved:
1. Sequence number management on the client when there are request failures. When a batch fails, future inflight batches will also fail with `OutOfOrderSequenceException`. This must be handled on the client with intelligent sequence reassignment. We must also deal with the fatal failure of some batch: the future batches must get different sequence numbers when the come back.
2. On the broker, when we have multiple inflights, we can get duplicates of multiple old batches. With this patch, we retain the record metadata for 5 older batches.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#3743 from apurvam/KAFKA-5494-increase-max-in-flight-for-idempotent-producer
In the coordinator, we should check that 'shutdown' is not true before going to sleep waiting for the condition.
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Apurva Mehta <apurva@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#3755 from cmccabe/KAFKA-5806
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Rajini Sivaram <rajinisivaram@googlemail.com>
Closes#3699 from cmccabe/trogdor-review
Clean up includes:
- Switching try-catch-finally blocks to try-with-resources when possible
- Removing some seemingly unnecessary `SuppressWarnings` annotations
- Resolving some Java warnings
- Closing unclosed Closable objects
- Removing unused code
Author: Vahid Hashemian <vahidhashemian@us.ibm.com>
Reviewers: Balint Molnar <balintmolnar91@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <matthias@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Closes#3222 from vahidhashemian/minor/code_cleanup_1706
With this patch, the `ProducePerfomance` tool can create transactions of differing durations.
This patch was used to to collect the initial set of benchmarks for transaction performance, documented here: https://docs.google.com/spreadsheets/d/1dHY6M7qCiX-NFvsgvaE0YoVdNq26uA8608XIh_DUpI4/edit#gid=282787170
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jun Rao <junrao@gmail.com>
Closes#3400 from apurvam/MINOR-add-transaction-size-to-producre-perf
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Colin P. Mccabe <cmccabe@confluent.io>, Jason Gustafson <jason@confluent.io>
Closes#3339 from ijuma/kafka-5275-admin-client-api-consistency
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Apurva Mehta <apurva@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Closes#3340 from hachikuji/add-random-aborts-to-system-test
Before this patch, we would call `producerBatch.done` directly from the accumulator when expiring batches. This meant that we would not transition to the `ABORTABLE_ERROR` state in the transaction manager, allowing other transactional requests (including Commits!) to go through, even though the produce failed.
This patch modifies the logic so that we call `Sender.failBatch` on every expired batch, thus ensuring that the transaction state is accurate.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
Closes#3252 from apurvam/KAFKA-5385-fail-transaction-if-batches-expire
This currently fails in multiple ways. One of which is most likely KAFKA-5355, where the concurrent consumer reads duplicates.
During broker bounces, the concurrent consumer misses messages completely. This is another bug.
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#3217 from apurvam/KAFKA-5366-add-concurrent-reads-to-transactions-system-test
It avoids the need to handle protocol downgrades and it's safe (i.e. it will never cause
the auto creation of topics).
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#3220 from ijuma/kafka-5374-admin-client-metadata