Currently we only record completed sends and receives in the selector metrics. If there is a disconnect in the middle of the respective operation, then it is not counted. The metrics will be more accurate if we take into account partial sends and receives.
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com
Do not allow an empty replica set to be passed into the reassignment API.
Reviewers: Colin P. McCabe <cmccabe@apache.org>, José Armando García Sancio <jsancio@gmail.com>
Aims to fix the flaky LogCleanerIntegrationTest#testIsThreadFailed by changing how metrics are cleaned.
Reviewers: Jason Gustafson <jason@confluent.io>
Allow routing of `AdminClient#describeTopics` to any broker in the cluster than just the controller, so that we don't create a hotspot for this API call. `AdminClient#describeTopics` uses the broker's metadata cache which is asynchronously maintained, so routing to brokers other than the controller is not expected to have a significant difference in terms of metadata consistency; all metadata requests are eventually consistent.
This patch also fixes a few flaky test failures.
Reviewers: Ismael Juma <ismael@juma.me.uk>, José Armando García Sancio <jsancio@gmail.com>, Jason Gustafson <jason@confluent.io>
KIP-352 aims to add several new metrics in order to track reassignments much better. We will be able to measure bytes in/out rate and the count of partitions under active reassignment.
We also change the semantic of the UnderReplicatedPartitions metric to cater better for reassignment. Currently it reports under-replicated partitions when during reassignment extra partitions are added as part of the process but this PR changes it so it'll always take the original replica set into account when computing the URP metrics.
The newly added metrics will be:
- kafka.server:type=ReplicaManager,name=ReassigningPartitions
- kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec
- kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec
The changed URP metric:
- kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
With KIP-392, we allow consumers to fetch from followers. This capability is enabled when a replica selector has been provided in the configuration. When not in use, the intent is to preserve current behavior of fetching only from leader. The leader epoch is the mechanism that keeps us honest. When there is a leader change, the epoch gets bumped, consumer fetches fail due to the fenced epoch, and we find the new leader.
However, for old consumers, there is no similar protection. The leader epoch was not available to clients until recently. If there is a preferred leader election (for example), the old consumer will happily continue fetching from the demoted leader until a periodic metadata fetch causes us to discover the new leader. This does not create any problems from a correctness perspective–fetches are still bound by the high watermark–but it is unexpected and may cause unexpected performance characteristics.
This patch fixes this problem by enforcing leader-only fetching for older versions of the fetch request.
Reviewers: Jason Gustafson <jason@confluent.io>
AbstractRequestResponse should be an interface, since it has no concrete elements or implementation. Move AbstractRequestResponse#serialize to RequestUtils#serialize and make it package-private, since it doesn't need to be public.
Reviewers: Ismael Juma <ismael@juma.me.uk>
A partition log in initialized in following steps:
1. Fetch log config from ZK
2. Call LogManager.getOrCreateLog which creates the Log object, then
3. Registers the Log object
Step #3 enables Configuration update thread to deliver configuration
updates to the log. But if any update arrives between step #1 and #3
then that update is missed. It breaks following use case:
1. Create a topic with default configuration, and immediately after that
2. Update the configuration of topic
There is a race condition here and in random cases update made in
second step will get dropped.
This change fixes it by tracking updates arriving between step #1 and #3
Once a Partition is done initializing log, it checks if it has missed any
update. If yes, then the configuration is read from ZK again.
Added unit tests to make sure a dirty configuration is refreshed. Tested
on local cluster to make sure that topic configuration and updates are
handled correctly.
Reviewers: Jason Gustafson <jason@confluent.io>
KAFKA-7215 improved the log cleaner error handling to mitigate thread death but missed one case. Exceptions in grabFilthiestCompactedLog still cause the thread to die.
This patch improves handling to ensure that errors in that function still mark a partition as uncleanable and do not crash the thread.
Reviewers: Jason Gustafson <jason@confluent.io>
All the changes are in ReplicaManager.appendToLocalLog and ReplicaManager.appendRecords. Also, replaced LogAppendInfo.unknownLogAppendInfoWithLogStartOffset with LogAppendInfo.unknownLogAppendInfoWithAdditionalInfo to include those 2 new fields.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
The scalac optimizer is able to inline methods to avoid lambda allocations, eliminating
the runtime cost of higher order functions in many cases. The compilation parameters
we are using here were introduced in 2.12.x, so we don't enable them for Scala 2.11.
Also, we enable a more aggressive inlining policy for the `core` project since it's
not meant to be used as a library.
See https://www.lightbend.com/blog/scala-inliner-optimizer for more information about
the optimizer.
I verified that the lambda allocation in the code below (from LogCleaner.scala) went away
after this change with Scala 2.12 and 2.13.
```scala
private def consumeAbortedTxnsUpTo(offset: Long): Unit = {
while (abortedTransactions.headOption.exists(_.firstOffset <= offset)) {
val abortedTxn = abortedTransactions.dequeue()
ongoingAbortedTxns.getOrElseUpdate(abortedTxn.producerId, new AbortedTransactionMetadata(abortedTxn))
}
}
```
The relevant part of the bytecode when compiled with Scala 2.13 looks like:
```text
private void consumeAbortedTxnsUpTo(long);
Code:
0: aload_0
1: invokespecial #54 // Method abortedTransactions:()Lscala/collection/mutable/PriorityQueue;
4: invokevirtual #175 // Method scala/collection/mutable/PriorityQueue.headOption:()Lscala/Option;
7: dup
8: ifnonnull 13
11: aconst_null
12: athrow
13: astore 4
15: aload 4
17: invokevirtual #145 // Method scala/Option.isEmpty:()Z
20: ifne 48
23: aload 4
25: invokevirtual #148 // Method scala/Option.get:()Ljava/lang/Object;
28: checkcast #177 // class kafka/log/AbortedTxn
```
The increased inlining causes some spurious spotBugs warnings, I added a few suppressions
and fixed one warning by avoiding unnecessary boxing.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
This patch changes the way topic existence is checked in the DeleteOffset API. Previously, it was relying on the committed offsets. Now, it relies on the metadata cache which is better.
Reviewers: Jason Gustafson <jason@confluent.io>
As described in KIP-360, this patch changes producer state retention so that producer state remains cached even after it is removed from the log. Producer state will only be removed now when the transactional id expiration time has passed. This is intended to reduce the incidence of UNKNOWN_PRODUCER_ID errors for producers when records are deleted or when a topic has a short retention time. Tested with unit tests.
Reviewers: Jason Gustafson <jason@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
This patch adds flexible version support for the following inter-broker APIs: ControlledShutdown, LeaderAndIsr, UpdateMetadata, and StopReplica. Version checks have been removed from `getErrorResponse` methods since they were redundant given the checks in `AbstractRequest` and the respective`*Data` types.
Reviewers: Ismael Juma <ismael@juma.me.uk>
https://issues.apache.org/jira/browse/KAFKA-3705
Allows for a KTable to map its value to a given foreign key and join on another KTable keyed on that foreign key. Applies the joiner, then returns the tuples keyed on the original key. This supports updates from both sides of the join.
Reviewers: Guozhang Wang <wangguoz@gmail.com>, Matthias J. Sax <mjsax@apache.org>, John Roesler <john@confluent.io>, Boyang Chen <boyang@confluent.io>, Christopher Pettitt <cpettitt@confluent.io>, Bill Bejeck <bbejeck@gmail.com>, Jan Filipiak <Jan.Filipiak@trivago.com>, pgwhalen, Alexei Daniline
Previously we would log the following on each controller startup:
```
[2019-09-13 22:40:10,272] INFO [Controller id=2] DEPRECATED: Partitions being reassigned through ZooKeeper: Map()
```
This patch only logs the message if the map is non-empty.
Reviewers: Jason Gustafson <jason@confluent.io>
It add support to delete offsets in the `kafka-consumer-group`.
*More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.*
*Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.*
Author: David Jacot <djacot@confluent.io>
Reviewers: Gwen Shapira
Closes#7362 from dajac/KAFKA-8901-delete-offsets-command
As noted in the KIP-467, the updated ProduceResponse is
```
Produce Response (Version: 8) => [responses] throttle_time_ms
responses => topic [partition_responses]
topic => STRING
partition_responses => partition error_code base_offset log_append_time log_start_offset
partition => INT32
error_code => INT16
base_offset => INT64
log_append_time => INT64
log_start_offset => INT64
error_records => [INT32] // new field, encodes the relative offset of the records that caused error
error_message => STRING // new field, encodes the error message that client can use to log itself
throttle_time_ms => INT32
with a new error code:
```
INVALID_RECORD(86, "Some record has failed the validation on broker and hence be rejected.", InvalidRecordException::new);
Reviewers: Jason Gustafson <jason@confluent.io>, Magnus Edenhill <magnus@edenhill.se>, Guozhang Wang <wangguoz@gmail.com>
Replaced UpdateMetadata{Request, Response}, LeaderAndIsr{Request, Response}
and StopReplica{Request, Response} with the automated protocol classes.
Updated the JSON schema for the 3 request types to be more consistent and
less strict (if needed to avoid duplication).
The general approach is to avoid generating new collections in the request
classes. Normalization happens in the constructor to make this possible. Builders
still have to group by topic to maintain the external ungrouped view.
Introduced new tests for LeaderAndIsrRequest and UpdateMetadataRequest to
verify that the new logic is correct.
A few other clean-ups/fixes in code that was touched due to these changes:
* KAFKA-8956: Refactor DelayedCreatePartitions#updateWaiting to avoid modifying
collection in foreach.
* Avoid unnecessary allocation for state change trace logging if trace logging is not enabled
* Use `toBuffer` instead of `toList`, `toIndexedSeq` or `toSeq` as it generally performs
better and it matches the performance characteristics of `java.util.ArrayList`. This is
particularly important when passing such instances to Java code.
* Minor refactoring for clarity and readability.
* Removed usage of deprecated `/:`, unused imports and unnecessary `var`s.
* Include exception in `AdminClientIntegrationTest` failure message.
* Move StopReplicaRequest verification in `AuthorizerIntegrationTest` to the end
to match the comment.
Reviewers: Colin Patrick McCabe <cmccabe@apache.org>
Add a version number to request and response headers. The header
version is determined by the first two 16 bit fields read (API key and
API version). For now, ControlledShutdown v0 has header version 0, and
all other requests have v1. Once KIP-482 is implemented, there will be
a v2 of the header which supports tagged fields.
1. Add the overloaded functions.
2. Update the code in Streams to use the batch API for better latency (this applies to both active StreamsTask for initialize the offsets, as well as the StandbyTasks for updating offset limits).
3. Also update all unit test to replace the deprecated APIs.
Reviewers: Christopher Pettitt <cpettitt@confluent.io>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>, Bill Bejeck <bill@confluent.io>
I realized some flaky tests failed at setup or calls that tries to create offset topics, and I think using one partition and one replica would be sufficient in these cases.
Reviewers: Bill Bejeck <bill@confluent.io>
KIP-455 (18d4e57f6e) bumped the LeaderAndIsr version to 3 but did not change the Controller code to actually send the new version. The ControllerChannelManagerTest had a bug which made it assert wrongly, hence why it did not catch it. This patch fixes said test.
Because the new fields in LeaderAndIsr are not used yet, the gap was not caught by integration tests either.
Reviewers: Jason Gustafson <jason@confluent.io>
It's useful to know when the cleaner runs what the last modified time
of the segment and the deletion horizon is. The current log message
only allows you to infer that one is greater than the other.
Reviewers: Jun Rao <junrao@gmail.com>
This PR makes two changes to code in the ReplicaManager.updateFollowerFetchState path, which is in the hot path for follower fetches. Although calling ReplicaManager.updateFollowerFetch state is inexpensive on its own, it is called once for each partition every time a follower fetch occurs.
1. updateFollowerFetchState no longer calls maybeExpandIsr when the follower is already in the ISR. This avoid repeated expansion checks.
2. Partition.maybeIncrementLeaderHW is also in the hot path for ReplicaManager.updateFollowerFetchState. Partition.maybeIncrementLeaderHW calls Partition.remoteReplicas four times each iteration, and it performs a toSet conversion. maybeIncrementLeaderHW now avoids generating any intermediate collections when updating the HWM.
**Benchmark results for Partition.updateFollowerFetchState on a r5.xlarge:**
Old:
```
1288.633 ±(99.9%) 1.170 ns/op [Average]
(min, avg, max) = (1287.343, 1288.633, 1290.398), stdev = 1.037
CI (99.9%): [1287.463, 1289.802] (assumes normal distribution)
```
New (when follower fetch offset is updated):
```
261.727 ±(99.9%) 0.122 ns/op [Average]
(min, avg, max) = (261.565, 261.727, 261.937), stdev = 0.114
CI (99.9%): [261.605, 261.848] (assumes normal distribution)
```
New (when follower fetch offset is the same):
```
68.484 ±(99.9%) 0.025 ns/op [Average]
(min, avg, max) = (68.446, 68.484, 68.520), stdev = 0.023
CI (99.9%): [68.460, 68.509] (assumes normal distribution)
```
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
This adds an administrative API to delete consumer offsets of a group as well as extends the mechanism to expire offsets of consumer groups.
It makes the group coordinator aware of the set of topics a consumer group (protocol type == 'consumer') is actively subscribed to, allowing offsets of topics which are not actively subscribed to by the group to be either expired or administratively deleted. The expiration rules remain the same.
For the other groups (non-consumer), the API allows to delete offsets when the group is empty and the expiration remains the same.
Reviewers: Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Jason Gustafson <jason@confluent.io>
Replace the `<table>` elements by `<ul>` so the full page width can be used for the configuration descriptions instead of only a very narrow column. I moved the other fields (Type, Default Value, etc) below each entry.
Reviewers: Boyang Chen <boyang@confluent.io>, Jason Gustafson <jason@confluent.io>
If the topic already exists, `handleCreateTopicsRequest` should return TopicExistsException even given an invalid config (replication factor for instance).
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Jason Gustafson <jason@confluent.io>