1. Delay the initialization (producer.initTxn) from construction to maybeInitialize; if it times out we just swallow and retry in the next iteration.
2. If completeRestoration (consumer.committed) times out, just swallow and retry in the next iteration.
3. For other calls (producer.partitionsFor, producer.commitTxn, consumer.commit), treat the timeout exception as fatal.
Reviewers: Matthias J. Sax <matthias@confluent.io>
Correct the Connect worker logic to properly disable the new topic status (KIP-558) feature when `topic.tracking.enable=false`, and fix automatic topic status reset after a connector is deleted.
Also adds new `ConnectorTopicsIntegrationTest` and expanded unit tests.
Reviewers: Randall Hauch <rhauch@gmail.com>
Correct the process-rate (and total) sensor to measure throughput (and total record processing count).
Reviewers: Guozhang Wang <guozhang@confluent.io>
- Added additional synchronization and increased timeouts to handle flakiness
- Added some pre-cautionary retries when trying to obtain lag map
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Test cases in `ConsumerPerformanceTest` were failing and causing a system exit rather than throwing the expected exception following #8023. We didn't catch this because the tests were marked as skipped and not failed.
Reviewers: Guozhang Wang <guozhang@confluent.io>
This change updates ConsoleProducer, ConsumerPerformance, VerifiableProducer, and VerifiableConsumer classes to add and prefer the --bootstrap-server flag for defining the connection point of the Kafka cluster. This change is part of KIP-499: https://cwiki.apache.org/confluence/display/KAFKA/KIP-499+-+Unify+connection+name+flag+for+command+line+tool.
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
The KafkaProducer code would set infinite retries (MAX_INT) if the producer was configured with idempotence and no retries were configured by the user. This is superfluous because KIP-91 changed the retry functionality to both be time-based and the default retries config to be MAX_INT.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch adds logic to the test case to ensure that consumer groups are in a valid state prior to attempting offset reset.
Reviewers: Jason Gustafson <jason@confluent.io>
This change mainly have 2 components:
1. extend the existing transactions_test.py to also try out new sendTxnOffsets(groupMetadata) API to make sure we are not introducing any regression or compatibility issue
a. We shrink the time window to 10 seconds for the txn timeout scheduler on broker so that we could trigger expiration earlier than later
2. create a completely new system test class called group_mode_transactions_test which is more complicated than the existing system test, as we are taking rebalance into consideration and using multiple partitions instead of one. For further breakdown:
a. The message count was done on partition level, instead of global as we need to visualize
the per partition order throughout the test. For this sake, we extend ConsoleConsumer to print out the data partition as well to help message copier interpret the per partition data.
b. The progress count includes the time for completing the pending txn offset expiration
c. More visibility and feature improvements on TransactionMessageCopier to better work under either standalone or group mode.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This PR speeds up the deletion process by doing the following:
- Batch whenever possible to minimize the number of requests sent out to other brokers;
- Refactor `onPartitionDeletion` to remove the usage of `allLiveReplicas`.
Reviewers: Jason Gustafson <jason@confluent.io>
The recent increase in the flakiness of one of the offset reset tests (KAFKA-9538) traces back to https://github.com/apache/kafka/pull/7941. After investigation, we found that following this patch, the consumer was sending an additional metadata request prior to performing the group assignment. This slight timing difference was enough to trigger the test failures. The problem turned out to be due to a bug in `SubscriptionState.groupSubscribe`, which no longer counted the local subscription when determining if there were new topics to fetch metadata for. Hence the extra metadata update. This patch restores the old logic.
Without the fix, we saw 30-50% test failures locally. With it, I could no longer reproduce the failure. However, #6561 is probably still needed to improve the resilience of this test.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
Corrects a flaw leading to an exception while building topologies that include both:
* A foreign-key join with the result not explicitly materialized
* An operation after the join that requires source materialization
Also corrects a flaw in TopologyTestDriver leading to output records being enqueued in the wrong order under some (presumably rare) circumstances.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Migrates TopologyTestDriver processing to be closer to the same processing/ordering
semantics as KafkaStreams. This corrects the output order for recursive topologies
as reported in KAFKA-9503, and also works similarly in the case of task idling.
* Added init() method to RocksDBMetricsRecorder
* Added call to init() of RocksDBMetricsRecorder to init() of RocksDB store
* Added call to init() of RocksDBMetricsRecorder to openExisting() of segmented state stores
* Adapted unit tests
* Added integration test that reproduces the situation in which the bug occurred
Reviewers: Guozhang Wang <wangguoz@gmail.com>
During the discussion for KIP-213, we decided to pass "pseudo-topics"
to the internal serdes we use to construct the wrapper serdes for
CombinedKey and hashing the left-hand-side value. However, during
the implementation, this strategy wasn't fully implemented, and we wound
up using the same topic name for a few different data types.
Reviewers: Guozhang Wang <guozhang@confluent.io>
Changed `EmbeddedConnectCluster` to add utility methods that return `Response`, throw `ConnectException` instead of `IOException` for failures, and deprecate the old methods that returned primitive types rather than `Response`.
Also introduce common assertions for embedded clusters under `EmbeddedConnectClusterAssertions`.
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
During the KIP-213 implementation and verification, we neglected to test the
code path for falling back to default serdes if none are given in the topology.
Reviewer: Bill Bejeck <bbejeck@gmail.com>
Relying on integration test to catch an algorithm bug introduces more flakiness, reduce the test into a unit test to reduce the flakiness until we upgrade Java/Scala libs.
Checked the test shall fail with older version of StreamsPartitionAssignor.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Found this bug from the repeated flaky runs of system tests, it seems to be long lurking but also would only happen if there are frequent rebalances / topic creation within a short time, which is exactly the case in some of our smoke system tests.
Also added a unit test.
Reviewers: Boyang Chen <boyang@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>, Matthias J. Sax <matthias@confluent.io>
Author: Manikumar Reddy <manikumar.reddy@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Colin Patrick McCabe <cmccabe@apache.org>
Closes#8075 from omkreddy/KAFKA-9026-Fix
Follows up on the original PR for KAFKA-9445 to address a final round of feedback
Reviewers: John Roesler <vvcephei@apache.org>, Matthias J. Sax <matthias@confluent.io>
When the producer encouteres new topic(s), it now only fetches the metadata for the new topics. For cases where a producer interacts with a lot of topics, this reduces the cost for the topic being evicted from the cache, and during startup when populating the topic cache.
Additionally adds a new producer configuration variable 'metadata.max.idle.ms', which controls how long topic metadata may be idle (i.e. not produced to) before it's finally discarded from the metadata cache.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, dengziming <dengziming1993@gmail.com>
Follow up to original PR #7985 for KIP-523 (adding `KStream#toTable()` operator)
- improve JavaDocs
- add more unit tests
- fix bug for auto-repartitioning
- some code cleanup
Reviewers: High Lee <yello1109@daum.net>, John Roesler <john@confluent.io>
Addresses exception being thrown by `AdminClient` when `listConsumerGroupOffsets` returns a negative offset. A negative offset indicates the absence of a committed offset for a requested partition, and should result in a null in the returned offset map.
Reviewers: Anna Povzner <anna@confluent.io>, Jason Gustafson <jason@confluent.io>
The test case `org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication` has shown to be increasingly flaky recently. This PR aims to make this test more deterministic. Specifically, the flakiness was due to a timing issue between the tasks not starting up in time for the test to start running. This PR remediates that by introducing a status check after every connector is started up. These status checks include that the connector is found on the connect cluster as well as there are tasks created and up and running for that connector. These checks are introduced before the test starts running so that there is a confidence that the connectors and tasks are started up correctly before the test runs.
Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Jason Gustafson <jason@confluent.io>
1. StoreChangelogReaderTest.shouldRequestCommittedOffsetsAndHandleTimeoutException[1] This is due to stricter ternary operator type casting
2. KStreamImplTest.shouldSupportTriggerMaterializedWithKTableFromKStream
This is added recently where String typed values for <String, Integer>, in J8 it is allowed but in J11 it is not allowed.
Reviewers: John Roesler <john@confluent.io>
If transactional.id is set without setting enable.idempotence, the producer will set enable.idempotence to true implicitly. The docs should reflect this.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Followup to KAFKA-7317 and KAFKA-9113, there's some additional cleanup we can do in InternalTopologyBuilder. Mostly refactors the subscription code to make the initialization more explicit and reduce some duplicated code in the update logic.
Also some minor cleanup of the build method.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Instead of always try to update committed offset limits as long as there are buffered records for standby tasks, we leverage on the commit interval to reduce our consumer.committed frequency.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, John Roesler <john@confluent.io>
With the improvement of 447, we are now offering developers a better experience on writing their customized EOS apps with group subscription, instead of manual assignments. With the demo, user should be able to get started more quickly on writing their own EOS app, and understand the processing logic much better.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Disabled by default, but enabled for Jenkins PR builds (maximum of 1 retry per
test with up to 5 retries for the test run).
Reviewers: Ismael Juma <ismael@juma.me.uk>
The client caches metadata fetched from Metadata requests. Previously, each metadata response overwrote all of the metadata from the previous one, so we could rely on the expectation that the broker only returned the leaderId for a partition if it had connection information available. This behavior changed with KIP-320 since having the leader epoch allows the client to filter out partition metadata which is known to be stale. However, because of this, we can no longer rely on the request-level guarantee of leader availability. There is no mechanism similar to the leader epoch to track the staleness of broker metadata, so we still overwrite all of the broker metadata from each response, which means that the partition metadata can get out of sync with the broker metadata in the client's cache. Hence it is no longer safe to validate inside the `Cluster` constructor that each leader has an associated `Node`
Fixing this issue was unfortunately not straightforward because the cache was built to maintain references to broker metadata through the `Node` object at the partition level. In order to keep the state consistent, each `Node` reference would need to be updated based on the new broker metadata. Instead of doing that, this patch changes the cache so that it is structured more closely with the Metadata response schema. Broker node information is maintained at the top level in a single collection and cached partition metadata only references the id of the broker. To accommodate this, we have removed `PartitionInfoAndEpoch` and we have altered `MetadataResponse.PartitionMetadata` to eliminate its `Node` references.
Note that one of the side benefits of the refactor here is that we virtually eliminate one of the hotspots in Metadata request handling in `MetadataCache.getEndpoints` (which was renamed to `maybeFilterAliveReplicas`). The only reason this was expensive was because we had to build a new collection for the `Node` representations of each of the replica lists. This information was doomed to just get discarded on serialization, so the whole effort was wasteful. Now, we work with the lower level id lists and no copy of the replicas is needed (at least for all versions other than 0).
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Ismael Juma <ismael@juma.me.uk>
A few references to KIP-559 in the schema definitions needed to be fixed.
Reviewers: Brajesh Kumar <bristy@users.noreply.github.com>, Ron Dagostino <rdagostino@confluent.io>, Jason Gustafson <jason@confluent.io>