Simple logging additions at TRACE level that should help when the worker can't get caught up to the end of an internal topic.
Reviewers: Gwen Shapira <cshapi@gmail.com>, Aakash Shah <ashah@confluent.io>, Konstantine Karantasis <konstantine@confluent.io>
The href attribute missed the starting double quote, so the hyperlink is interpreted as https://docs.oracle.com/.../agent.html", with a redundant tailing double quote.
Add the missing starting double quote back to fix this issue.
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
After KIP-219, responses are sent immediately and we rely on a combination
of clients and muting of the channel to throttle. The result of this is that
we need to track `apiThrottleTimeMs` as an explicit value instead of
inferring it. On the other hand, we no longer need
`apiRemoteCompleteTimeNanos`.
Extend `BaseQuotaTest` to verify that throttle time in the request channel
metrics are being set. Given the nature of the throttling numbers, the test
is not particularly precise.
I included a few clean-ups:
* Pass KafkaMetric to QuotaViolationException so that the caller doesn't
have to retrieve it from the metrics registry.
* Inline Supplier in SocketServer (use SAM).
* Reduce redundant `time.milliseconds` and `time.nanoseconds`calls.
* Use monotonic clock in ThrottledChannel and simplify `compareTo` method.
* Simplify `TimerTaskList.compareTo`.
* Consolidate the number of places where we update `apiLocalCompleteTimeNanos`
and `responseCompleteTimeNanos`.
* Added `toString` to ByteBufferSend` and `MultiRecordsSend`.
* Restrict access to methods in `QuotaTestClients` to expose only what we need
to.
Reviewers: Jun Rao <junrao@gmail.com>
Class kafka.examples.SimpleConsumerDemo was removed. But the java-simple-consumer-demo.sh was not removed and README was not updated.
This commit removes java-simple-consumer-demo.sh and updates the demo instructions in the examples README.
Author: Jiamei Xie <jiamei.xie@arm.com>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
* The DeadLetterQueueReporter has a KafkaProducer that it must close to clean up resources
* Currently, the producer and its threads are leaked every time a task is stopped
* Responsibility for cleaning up ErrorReporters is transitively assigned to the
ProcessingContext, RetryWithToleranceOperator, and WorkerSinkTask/WorkerSinkTask classes
* One new unit test in ErrorReporterTest asserts that the producer is closed by the dlq reporter
Reviewers: Arjun Satish <arjun@confluent.io>, Chris Egerton <chrise@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Konstantine Karantasis <konstantine@confluent.io>
If a broker contains 8k replicas, we would previously issue 8k ZK calls to retrieve topic
configs when processing the first LeaderAndIsr request. That should translate to 0 after
these changes.
Credit to @junrao for identifying the problem.
Reviewers: Jun Rao <junrao@gmail.com>
The downgrade test does not currently support 2.4 and 2.5. When you enable them, it fails as a result of consumer group static membership. This PR makes the downgrade test work with all of our released versions again.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Boyang Chen, Gwen Shapira
Closes#8518 from lbradstreet/downgrade-test-2.4-2.5
This PR fixes and improves two major issues:
1. When calling KafkaStreams#store we can always get an InvalidStateStoreException, and even waiting for Streams state to become RUNNING is not sufficient (this is also how OptimizedKTableIntegrationTest failed). So I wrapped all the function with a Util wrapper that captures and retries on that exception.
2. While trouble-shooting this issue, I also realized a potential bug in test-util's produceKeyValuesSynchronously, which creates a new producer for each of the record to send in that batch --- i.e. if you are sending N records with a single call, within that call it will create N producers used to send one record each, which is very slow and costly.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, John Roesler <john@confluent.io>
* add a config to set the TaskAssignor
* set the default assignor to HighAvailabilityTaskAssignor
* fix broken tests (with some TODOs in the system tests)
Implements: KIP-441
Reviewers: Bruno Cadonna <bruno@confluent.io>, A. Sophie Blee-Goldman <sophie@confluent.io>
These two options are essentially incompatible, as caching will do nothing to reduce downstream traffic and writes when it has to allow non-unique keys (skipping records where the value is also the same is a separate issue, see KIP-557). But enabling caching on a store that's configured to retain duplicates is actually more than just ineffective, and currently causes incorrect results.
We should just log a warning and disable caching whenever a store is retaining duplicates to avoid introducing a regression. Maybe when 3.0 comes around we should consider throwing an exception instead to alert the user more aggressively.
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>, John Roesler <john@confluent.io>
Fixed typos in two MM2 configs that define the replication factor for internal Connect topics.
Only a single test was affected.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Konstantine Karantasis <konstantine@confluent.io>
Makes the main thread wait for workers to be ready to test the
desired functionality before proceeding.
Reviewers: Ted Yu <yuzhihong@gmail.com>, Ismael Juma <ismael@juma.me.uk>
This PR updates the algorithm which limits the number of members within a group (`group.max.size`) to fix the following two issues:
1. As described in KAFKA-9885, we found out that multiple members of a group can be evicted if the leader of the consumer offset partition changes before the group is persisted. This happens because the current eviction logic always evict the first member rejoining the group.
2. We also found out that dynamic members, when required to have a known member id, are not always limited. The caveat is that the current logic only considers unknown members and uses the group size, which does not include the so called pending members, to accept or reject a member. In this case, when they rejoins, they are not unknown member anymore and thus could bypass the limit. See `testDynamicMembersJoinGroupWithMaxSizeAndRequiredKnownMember` for the whole scenario.
This PR changes the logic to address the above two issues and extends the tests coverage to cover all the member types.
Reviewers: Jason Gustafson <jason@confluent.io>
In this commit we made sure that the auto leader election only happens after the newly starter broker is in the isr.
No accompany tests are added due to the fact that:
this is a change to the private method and no public facing change is made
it is hard to create tests for this change without considerable effort
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jun Rao <junrao@gmail.com>
A broker throws IllegalStateException if the broker epoch in the LeaderAndIsr/UpdateMetadataRequest/StopReplicaRequest is larger than its current broker epoch. However, there is no guarantee that the broker would receive the latest broker epoch before the controller: when the broker registers with ZK, there are few more instructions to process before this broker "knows" about its epoch, while the controller may already get notified and send UPDATE_METADATA request (as an example) with the new epoch. This will result in clients getting stale metadata from this broker.
With this PR, a broker accepts LeaderAndIsr/UpdateMetadataRequest/StopReplicaRequest if the broker epoch is newer than the current epoch.
Reviewers: David Jacot <djacot@confluent.io>, Jason Gustafson <jason@confluent.io>
Add an option to kafka-configs.sh `--add-config-file` that adds the configs from a properties file.
Testing: Added new tests to ConfigCommandTest.scala
Author: Aneel Nazareth <aneel@confluent.io>
Reviewers: David Jacot <djacot@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8184 from WanderingStar/KAFKA-9612
A partition is under reassignment if the either the set of adding
replicas or set removing replicas is non-empty.
Fix the test assertion such that it prints stdout on failure.
Reviewers: Ismael Juma <ismael@juma.me.uk>
This patch fixes a race condition in the join group request handling which sometimes results in not enforcing the maximum number of members allowed in a group.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
For join / sync / commit / heartbeat request, we would remember the sent generation in the created handler object, and then upon getting the error code, we could check whether the sent generation still matches the current generation. If not, it means that the member has already reset its generation or has participated in a new rebalance already. This means:
1. For join / sync-group request, we do not need to call reset-generation any more for illegal-generation / unknown-member. But we would still set the error since at a given time only one join/sync round-trip would be in flight, and hence we should not be participating in a new rebalance. Also for fenced instance error we still treat it as fatal since we should not be participating in a new rebalance, so this is still not expected.
2. For commit request, we do not set the corresponding error for illegal-generation / unknown-member / fenced-instance but raise rebalance-in-progress. For commit-sync it would be still thrown to user, while for commit-async it would be logged and swallowed.
3. For heartbeat request, we do not treat illegal-generation / unknown-member / fenced-instance errors and just consider it as succeeded since this should be a stale heartbeat which can be ignored.
Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
When the Connect worker forwards a REST API request to the leader, it might get back a `RequestTargetException` that suggests the worker should forward the request to a different worker. This can happen when the leader changes, and the worker that receives the original request forwards the request to the worker that it thinks is the current leader, but that worker is not the current leader. In this case. In most cases, the worker that received the forwarded request includes the URL of the current leader, but it is possible (albeit rare) that the worker doesn’t know the current leader and will include a null leader URL in the resulting `RequestTargetException`.
When this rare case happens, the user gets a null pointer exception in their response and the NPE is logged. Instead, the worker should catch this condition and provide a more useful error message that is similar to other existing error messages that might occur.
Added a unit test that verifies this corner case is caught and this particular NPE does not occur.
Author: Randall Hauch <rhauch@gmail.com>
Reviewer: Konstantine Karantasis <konstantine@confluent.io>
Scala 2.13.2 introduces support for suppressing warnings,
which makes it possible to enable fatal warnings. This is
useful enough from a development perspective to justify
this change.
In addition, Scala 2.13.2 also has a Vector implementation
with significant performance improvements and encoding
of String matches to switches.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
* Minor: Some html fixes Streams DSL documentation:
- duplicate id
- duplicate opening <span> element
- surplus closing </div> tag
* Replaced opening/closing quotation mark codes with " (they caused w3c validation to complain).
* Replaced right arrow that wasn't rendered correctly with →
Reviewers: Mickael Maison <mickael.maison@gmail.com>
* Upgrade to Scala 2.13.2 which introduces the ability to suppress warnings.
* Upgrade to scala-collection-compat 2.1.6 as it introduces the
@nowarn annotation for Scala 2.12.
* While at it, also update scala-java8-compat to 0.9.1.
* Fix compiler warnings and add @nowarn for the unfixed ones.
Scala 2.13.2 highlights (besides @nowarn):
* Rewrite Vector (using "radix-balanced finger tree vectors"),
for performance. Small vectors are now more compactly
represented. Some operations are now drastically faster on
large vectors. A few operations may be a little slower.
* Matching strings makes switches in bytecode.
https://github.com/scala/scala/releases/tag/v2.13.2
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Adjusted one assert condition on a testcase as the success was dependant
on thread runtimes and the much lower tolerances due to the reduced time
broke this test.
Reviewers: Ismael Juma <ismael@juma.me.uk>
I added a change to the upgrade test a while back that would make it wait for
ISR rejoin before rolls. This prevents incompatible brokers charging through a
bad roll and disguising a downgrade problem.
We now also check for protocol errors in the broker logs.
Reviewers: Boyang Chen <boyang@confluent.io>, Ismael Juma <ismael@juma.me.uk>
* Add AbstractResponse#errorCounts(Stream) to avoid having to call
AbstractResponse#errorCounts(Collection) with a computed collection.
* A microbenchmark showed that using errorCounts(Stream) was
around 7.5 times faster than errorCounts(Collection). Using forEach()
loops with updateErrorCounts() is slightly faster, but is usually more
code.
* Use updateErrorMap() consistently.
* Replace for statements with forEach() for consistency.
* Use singleton errorMap() consistently.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Ismael Juma <ismael@juma.me.uk>
When debugging KAFKA-9388, I found the reason that the second test method test takes much longer (10s) than the previous one (~500ms) is because they used the same app.id. When the previous clients are shutdown, they would not send leave-group and hence we are still depending on the session timeout (10s) for the members to be removed out of the group.
When the second test is triggered, they will join the same group because of the same application id, and the prepare-rebalance phase would would for the full rebalance timeout before it kicks out the previous members.
Setting different application ids could resolve such issues for integration tests --- I did a quick search and found some other integration tests have the same issue. And after this PR my local unit test runtime reduced from about 14min to 7min.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, John Roesler <john@confluent.io>