1. In both RocksDBMetrics and Metrics integration tests, we do not need to wait for consumer to consume records from output topics since the sensors / metrics are registered upon task creation.
2. Merged the two test cases of RocksDB with one app that creates two state stores (non-segmented and segmented).
With these two changes, local runtime of these two tests reduced from 2min+ and 3min+ to under a minute.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Matthias J. Sax <matthias@confluent.io>
One of the new rocksdb unit tests creates a non-temporary rocksdb directory wherever the test is run from, with some rocksdb files left behind after the test(s) are done. We should use the tempDirectory dir for this testing
Reviewers: Guozhang Wang <wangguoz@gmail.com>
The integration test RocksDBMetricsIntegrationTest takes pretty long to complete.
Most of the runtime is spent in the two tests that verify whether the RocksDB
metrics get actual measurements from RocksDB. Those tests need to wait for the thread
that collects the measurements of the RocksDB metrics to trigger the first recordings
of the metrics.
This PR adds a unit test that verifies whether the Kafka Streams metrics get the
measurements from RocksDB and removes the two integration tests that verified it
before. The verification of the creation and scheduling of the RocksDB metrics
recording trigger thread is already contained in KafkaStreamsTest and consequently
it is not part of this PR.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
When it comes to actually closing a task we now treat all states exactly the same, and call StateManagerUtil#closeStateManager regardless of whether it's in CREATED or RESTORING or RUNNING
Unfortunately StateManagerUtil doesn't actually check to make sure that we actually own the lock for this task's state. During a dirty close with eos enabled, we wipe the state -- but in some cases, this means deleting the state out from under another StreamThread who is still in the process of revoking this task.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
We were hitting an IllegalStateException: There is already a changelog registered for ... in trunk-eos due to failing to call TaskManager#cleanup on unrevoekd tasks that we end up closing in handleAssignment after failing to batch commit.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Add type bounds to the ProcessorContext, which bounds the types that can be forwarded to child nodes.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Some tasks get closed inside HandleAssignment and did not remove from the task manager bookkeep list. The next time they would be re-closed which is illegal state.
Reviewers: John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Instance-level:
* number of alive stream threads
Thread-level:
* avg / max number of records polled from the consumer per runOnce, INFO
* avg / max number of records processed by the task manager (i.e. across all tasks) per runOnce, INFO
Task-level:
* number of current buffered records at the moment (i.e. it is just a dynamic gauge), DEBUG.
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <john@confluent.io>
As title suggests, we would like to broaden this check so that we don't fail to close a doom-to-cleanup task.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
First set of cleanup pushed to followup PR after KIP-441 Pt. 5. Main changes are:
1. Moved `RankedClient` and the static `buildClientRankingsByTask` to a new file
2. Moved `Movement` and the static `getMovements` to a new file (also renamed to `TaskMovement`)
3. Consolidated the many common variables throughout the assignment tests to the new `AssignmentTestUtils`
4. New utility to generate comparable/predictable UUIDs for tests, and removed the generic from `TaskAssignor` and all related classes
Reviewers: John Roesler <vvcephei@apache.org>, Andrew Choi <a24choi@edu.uwaterloo.ca>
For some context, when building a streams application, the optimizer keeps track of the key-changing operations and any repartition nodes that are descendants of the key-changer. During the optimization phase (if enabled), any repartition nodes are logically collapsed into one. The optimizer updates the graph by inserting the single repartition node between the key-changing node and its first child node. This graph update process is done by searching for a node that has the key-changing node as one of its direct parents, and the search starts from the repartition node, going up in the parent hierarchy.
The one exception to this rule is if there is a merge node that is a descendant of the key-changing node, then during the optimization phase, the map tracking key-changers to repartition nodes is updated to have the merge node as the key. Then the optimization process updates the graph to place the single repartition node between the merge node and its first child node.
The error in KAFKA-9739 occurred because there was an assumption that the repartition nodes are children of the merge node. But in the topology from KAFKA-9739, the repartition node was a parent of the merge node. So when attempting to find the first child of the merge node, nothing was found (obviously) resulting in StreamException(Found a null keyChangingChild node for..)
This PR fixes this bug by first checking that all repartition nodes for optimization are children of the merge node.
This PR includes a test with the topology from KAFKA-9739.
Reviewers: John Roesler <john@confluent.io>
Revert the decision for the sendOffsetsToTransaction(groupMetadata) API to fail with old version of brokers for the sake of making the application easier to adapt between versions. This PR silently downgrade the TxnOffsetCommit API when the build version is small than 3.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
As documented in the KIP:
We shall set `transaction.timout.ms` default to 10000 ms (10 seconds) on Kafka Streams.
Reviewer: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Adds a new TaskAssignor implementation, currently hidden behind an internal feature flag, that implements the high availability algorithm of KIP-441.
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <vvcephei@apache.org>
Once Scala 2.13.2 is officially released, I will submit a follow up PR
that enables `-Xfatal-warnings` with the necessary warning
exclusions. Compiler warning exclusions were only introduced in 2.13.2
and hence why we have to wait for that. I used a snapshot build to
test it in the meantime.
Changes:
* Remove Deprecated annotation from internal request classes
* Class.newInstance is deprecated in favor of
Class.getConstructor().newInstance
* Replace deprecated JavaConversions with CollectionConverters
* Remove unused kafka.cluster.Cluster
* Don't use Map and Set methods deprecated in 2.13:
- collection.Map +, ++, -, --, mapValues, filterKeys, retain
- collection.Set +, ++, -, --
* Add scala-collection-compat dependency to streams-scala and
update version to 2.1.4.
* Replace usages of deprecated Either.get and Either.right
* Replace usage of deprecated Integer(String) constructor
* `import scala.language.implicitConversions` is not needed in Scala 2.13
* Replace usage of deprecated `toIterator`, `Traversable`, `seq`,
`reverseMap`, `hasDefiniteSize`
* Replace usage of deprecated alterConfigs with incrementalAlterConfigs
where possible
* Fix implicit widening conversions from Long/Int to Double/Float
* Avoid implicit conversions to String
* Eliminate usage of deprecated procedure syntax
* Remove `println`in `LogValidatorTest` instead of fixing the compiler
warning since tests should not `println`.
* Eliminate implicit conversion from Array to Seq
* Remove unnecessary usage of 3 argument assertEquals
* Replace `toStream` with `iterator`
* Do not use deprecated SaslConfigs.DEFAULT_SASL_ENABLED_MECHANISMS
* Replace StringBuilder.newBuilder with new StringBuilder
* Rename AclBuffers to AclSeqs and remove usage of `filterKeys`
* More consistent usage of Set/Map in Controller classes: this also fixes
deprecated warnings with Scala 2.13
* Add spotBugs exclusion for inliner artifact in KafkaApis with Scala 2.12.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Measure the percentage ratio the stream thread spent on processing each task among all assigned active tasks (KIP-444). Also add unit tests to cover the added metrics in this PR and the previous #8358. Also trying to fix the flaky test reported in KAFKA-5842
Co-authored-by: John Roesler <vvcephei@apache.org>
Reviewers: Bruno Cadonna <bruno@confluent.io>, John Roesler <vvcephei@apache.org>
When a caching state store is closed it calls its flush() method.
If flush() throws an exception the underlying state store is not closed.
This commit ensures that state stores underlying a wrapped state stores
are closed even when preceding operations in the close method throw.
Co-authored-by: John Roesler <vvcephei@apache.org>
Reviewers: John Roesler <vvcephei@apache.org>, Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <matthias@confluent.io>
* delete topics before tearing down multi-node clusters to avoid leader elections during shutdown
* tear down all nodes concurrently instead of sequentially
Reviewers: Matthias J. Sax <matthias@confluent.io>
1. Within a single while loop, process the tasks in AAABBBCCC instead of ABCABCABC. This also helps the follow-up PR to time the per-task processing ratio to record less time, hence less overhead.
2. Add thread-level process / punctuate / poll / commit ratio metrics.
3. Fixed a few issues discovered (inline commented).
Reviewers: John Roesler <vvcephei@apache.org>
This algorithm assigns tasks to clients and tries to
- balance the distribution of the partitions of the
same input topic over stream threads and clients,
i.e., data parallel workload balance
- balance the distribution of work over stream threads.
The algorithm does not take into account potentially existing states
on the client.
The assignment is considered balanced when the difference in
assigned tasks between the stream thread with the most tasks and
the stream thread with the least tasks does not exceed a given
balance factor.
The algorithm prioritizes balance over stream threads
higher than balance over clients.
Reviewers: John Roesler <vvcephei@apache.org>
And hence restore / global consumers should never expect FencedInstanceIdException.
When such exception is thrown, it means there's another instance with the same instance.id taken over, and hence we should treat it as fatal and let this instance to close out instead of handling as task-migrated.
Reviewers: Boyang Chen <boyang@confluent.io>, Matthias J. Sax <matthias@confluent.io>