* Add AbstractResponse#errorCounts(Stream) to avoid having to call
AbstractResponse#errorCounts(Collection) with a computed collection.
* A microbenchmark showed that using errorCounts(Stream) was
around 7.5 times faster than errorCounts(Collection). Using forEach()
loops with updateErrorCounts() is slightly faster, but is usually more
code.
* Use updateErrorMap() consistently.
* Replace for statements with forEach() for consistency.
* Use singleton errorMap() consistently.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Ismael Juma <ismael@juma.me.uk>
When debugging KAFKA-9388, I found the reason that the second test method test takes much longer (10s) than the previous one (~500ms) is because they used the same app.id. When the previous clients are shutdown, they would not send leave-group and hence we are still depending on the session timeout (10s) for the members to be removed out of the group.
When the second test is triggered, they will join the same group because of the same application id, and the prepare-rebalance phase would would for the full rebalance timeout before it kicks out the previous members.
Setting different application ids could resolve such issues for integration tests --- I did a quick search and found some other integration tests have the same issue. And after this PR my local unit test runtime reduced from about 14min to 7min.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, John Roesler <john@confluent.io>
Also:
* Remove deprecated `=` in resolutionStrategy.
* Replace `AES/GCM/PKCS5Padding` with `AES/GCM/NoPadding`
in `PasswordEncoderTest`. The former is invalid and JDK 14 rejects it,
see https://bugs.openjdk.java.net/browse/JDK-8229043.
With these changes, the build works with Java 14 and Scala 2.12. The
same will apply to Scala 2.13 when Scala 2.13.2 is released (should
happen within 1-2 weeks).
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>, Matthias J. Sax <matthias@confluent.io>
1. In both RocksDBMetrics and Metrics integration tests, we do not need to wait for consumer to consume records from output topics since the sensors / metrics are registered upon task creation.
2. Merged the two test cases of RocksDB with one app that creates two state stores (non-segmented and segmented).
With these two changes, local runtime of these two tests reduced from 2min+ and 3min+ to under a minute.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Matthias J. Sax <matthias@confluent.io>
* Introduce `gradlewAll` script to replace `*All` tasks since the approach
used by the latter doesn't work since Gradle 6.0 and it's unclear when,
if ever, it will work again ( see https://github.com/gradle/gradle/issues/11301).
* Update release script and README given the above.
* Update zinc to 1.3.5.
* Update gradle-versions-plugin to 0.28.0.
The major improvements in Gradle 6.0 to 6.3 are:
- Improved incremental compilation for Scala
- Support for Java 14 (although some Gradle plugins
like spotBugs may need to be updated or disabled,
will do that separately)
- Improved scalac reporting, warnings are clearly
marked as such, which is very helpful.
Tested `gradlewAll` manually for the commands listed in the README
and release script. For `uploadArchive`, I tested it with a local Maven
repository.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
One of the new rocksdb unit tests creates a non-temporary rocksdb directory wherever the test is run from, with some rocksdb files left behind after the test(s) are done. We should use the tempDirectory dir for this testing
Reviewers: Guozhang Wang <wangguoz@gmail.com>
The impact of trace logging is normally small, on the order of 40ns per getEffectiveLevel check, however this adds up with trace is called multiple times per partition in the replica fetch hot path.
This PR removes some trace logs that are not very useful and reduces cases where the level is checked over and over for one fetch request.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
The integration test RocksDBMetricsIntegrationTest takes pretty long to complete.
Most of the runtime is spent in the two tests that verify whether the RocksDB
metrics get actual measurements from RocksDB. Those tests need to wait for the thread
that collects the measurements of the RocksDB metrics to trigger the first recordings
of the metrics.
This PR adds a unit test that verifies whether the Kafka Streams metrics get the
measurements from RocksDB and removes the two integration tests that verified it
before. The verification of the creation and scheduling of the RocksDB metrics
recording trigger thread is already contained in KafkaStreamsTest and consequently
it is not part of this PR.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Adding a dynamically updatable log config is currently error prone, as it is easy to
set them up as a val not a def and this would result in a dynamically updated
broker default not applying to a LogConfig after broker restart.
This PR adds a guard against introducing these issues by ensuring that all log
configs are exhaustively checked via a test.
For example, if the following line was a val and not a def, there would be a
problem with dynamically updating broker defaults for the config.
4bde9bb3cc/core/src/main/scala/kafka/server/KafkaConfig.scala (L1216)
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Currently, tumbling windows are defined as "a special case of hopping time windows" in the streams docs, but hopping windows are only explained in a subsequent section.
I think it would make sense to switch the order of these paragraphs around. To me this also makes more sense semantically.
Testing
Built the site and checked that everything looks ok and html is valid (or at least didn't contain any new warnings that were caused by this change).
Reviewers: Bill Bejeck <bbejeck@apache.org>
This commit reworks the SocketServer to always start the acceptor threads after the processor threads and to always stop the acceptor threads before the processor threads. It ensures that the acceptor shutdown is not blocked waiting on the processors to be fully shutdown by decoupling the shutdown signal and the awaiting. It also ensure that the processor threads drain its newConnection queue to unblock acceptors that may be waiting. However, the acceptors still bind during the startup, only the processing of new connections and requests is further delayed.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>
When building a release candidate with release.py, if it's not the first RC, we need to drop the previous RC's artifacts from the staging repository before closing the new ones. This adds a log message to remind the release manager of this
The patch adds a new test case for validating concurrent read/write behavior in the `Log` implementation. In the process of verifying this, we found a race condition in `read`. The previous logic checks whether the start offset is equal to the end offset before collecting the high watermark. It is possible that the log is truncated in between these two conditions which could cause the high watermark to be equal to the log end offset. When this happens, `LogSegment.read` fails because it is unable to find the starting position to read from.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Method split takes up too many resources and might
cause outOfMemory error when the bigBatch is huge.
Call closeForRecordAppends() to free up resources
like compression buffers.
Change-Id: Iac6519fcc2e432330b8af2d9f68a8d4d4a07646b
Signed-off-by: Jiamei Xie <jiamei.xiearm.com>
*More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.*
*Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.*
Author: Jiamei Xie <jiamei.xie@arm.com>
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jiangjie (Becket) Qin <becket.qin@gmail.com>
Closes#8286 from jiameixie/outOfMemory
`KafkaApis#handleOffsetDeleteRequest` does not build the response correctly because `topics.add` is not in the correct loop. Fortunately, due to how the response is processed by the admin client, it works but sends redundant information on the wire.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
When it comes to actually closing a task we now treat all states exactly the same, and call StateManagerUtil#closeStateManager regardless of whether it's in CREATED or RESTORING or RUNNING
Unfortunately StateManagerUtil doesn't actually check to make sure that we actually own the lock for this task's state. During a dirty close with eos enabled, we wipe the state -- but in some cases, this means deleting the state out from under another StreamThread who is still in the process of revoking this task.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This patch moves the state change logger logs for handling a LeaderAndIsr/StopReplica request inside the replicaStateChangeLock in order to serialize the logs. This helps to tell apart per-partition actions of concurrent LAIR/StopReplica requests in cases where requests pile up waiting on the lock.
Reviewer: Jun Rao <junrao@gmail.com>
QuotaViolationException generates an exception message via String.format in the constructor
even though the message is often not used, e.g. https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ClientQuotaManager.scala#L258. We now override `toString` instead.
It also generates an unnecessary stack trace, which is now avoided using the same pattern as in ApiException.
I have also avoided use of QuotaViolationException for control flow in
ReplicationQuotaManager which is another hotspot that we have seen in practice.
Reviewers: Gwen Shapira <gwen@confluent.io>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
There are two cases in the fetch pass where a partition is unnecessarily looked up
from the partition Pool, when one is already accessible. This will be a fairly minor
improvement on high partition count clusters, but could be worth 1% from some
profiles I have seen.
More importantly, the code is cleaner this way.
Reviewers: Ismael Juma <ismael@juma.me.uk>
FetchRequest.PartitionData.equals unnecessarily uses Object.equals generating a lot of allocations due to boxing, even though primitives are being compared. This is shown in the allocation profile below. Note that the CPU overhead is negligble.

![image](https://user-images.githubusercontent.com/252189/79079019-46686300-7cc1-11ea-9bc9-44fd17bae888.png)
Author: Lucas Bradstreet <lucasbradstreet@gmail.com>
Reviewers: Chia-Ping Tsai, Gwen Shapira
Closes#8473 from lbradstreet/avoid-boxing-partition-data-equals
In KAFKA-9826, a log whose first dirty offset was past the start of the active segment and past the last cleaned point resulted in an endless cycle of picking the segment to clean and discarding it. Though this didn't interfere with cleaning other log segments, it kept the log cleaner thread continuously busy (potentially wasting CPU and impacting other running threads) and filled the logs with lots of extraneous messages.
This was determined to be because the active segment was getting mistakenly picked for cleaning, and because the logSegments code handles (start == end) cases only for (start, end) on a segment boundary: the intent is to return a null list, but if they're not on a segment boundary, the routine returns that segment.
This fix has two parts:
It changes logSegments to handle start==end by returning an empty List always.
It changes the definition of calculateCleanableBytes to not consider anything past the UncleanableOffset; previously, it would potentially shift the UncleanableOffset to match the firstDirtyOffset even if the firstDirtyOffset was past the firstUncleanableOffset. This has no real effect now in the context of the fix for (1) but it makes the code read more like the model that the code is attempting to follow.
These changes require modifications to a few test cases that handled this particular test case; they were introduced in the context of KAFKA-8764. Those situations are now handled elsewhere in code, but the tests themselves allowed a DirtyOffset in the active segment, and expected an active segment to be selected for cleaning.
Reviewer: Jun Rao <junrao@gmail.com>
When the LogManager resumes cleaning it states that compaction is resumed, however the topic in question is not necessarily a compacted one.
Author: Lucas Bradstreet <lucas@confluent.io>
Reviewers: Gwen Shapira, Chia-Ping Tsai
Closes#8466 from lbradstreet/bad-cleaning-message
This fixes a version pinning issue where a transitive dependency had a
major version upgrade that a dependency did not account for, breaking
the build.
Reviewers: Andrew Egelhofer <aegelhofer@confluent.io>, Matthias J. Sax <matthias@confluent.io>
This is a follow-up to #8077. The bug exposed a testing gap in how we group partitions. This patch adds a test case which reproduces the reported problem.
Reviewers: David Arthur <mumrah@gmail.com>
The previous code did not use the collection produced by `takeWhile()`.
It only used the length of that collection to select the next element.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We were hitting an IllegalStateException: There is already a changelog registered for ... in trunk-eos due to failing to call TaskManager#cleanup on unrevoekd tasks that we end up closing in handleAssignment after failing to batch commit.
Reviewers: Guozhang Wang <wangguoz@gmail.com>