When configuring RLMM, the configs passed into configure method is the RemoteLogManagerConfig. But in RemoteLogManagerConfig, there's no configs related to remote.log.metadata.*, ex: remote.log.metadata.topic.replication.factor. So, even if users have set the config in broker, it'll never be applied.
This PR fixed the issue to allow users setting RLMM prefix: remote.log.metadata.manager.impl.prefix (default is rlmm.config.), and then, appending the desired remote.log.metadata.* configs, it'll pass into RLMM, including remote.log.metadata.common.client./remote.log.metadata.producer./ remote.log.metadata.consumer. prefixes.
Ex:
# default value
# remote.log.storage.manager.impl.prefix=rsm.config.
# remote.log.metadata.manager.impl.prefix=rlmm.config.
rlmm.config.remote.log.metadata.topic.num.partitions=50
rlmm.config.remote.log.metadata.topic.replication.factor=4
rsm.config.test=value
Reviewers: Christo Lolov <christololov@gmail.com>, Kamal Chandraprakash <kchandraprakash@uber.com>, Divij Vaidya <diviv@amazon.com>
This adds comments to the ConsumerRebalanceListener overrides, in order to briefly explain why we are overriding these methods, when they are called, and what you can or can't do. Especially onPartitionsLost can create some confusion given the default implementation.
Reviewers: Luke Chen <showuon@gmail.com>, David Jacot <djacot@confluent.io>
getAliveBrokerNode returns fenced brokers as alive which is inconsistent with methods like
getAliveBrokerNodes. Add a filter to not return fenced brokers and adds a test to validate that
the two functions are consistent.
Reviewers: Colin P. McCabe <cmccabe@apache.org>
On the controller, move publishing acls to the Authorizer into a dedicated MetadataPublisher,
AclPublisher. This publisher listens for notifications from MetadataLoader, and receives only
committed data. This brings the controller side in line with how the broker has always worked. It
also avoids some ugly code related to publishing directly from the QuorumController. Most important
of all, it clears the way to implement metadata transactions without worrying about Authorizer
state (since it will be handled by the MetadataLoader, along with other metadata image state).
In AclsDelta, we can remove isSnapshotDelta. We always know when the MetadataLoader is giving us a
snapshot. Also bring AclsDelta in line with the other delta classes, where completeSnapshot
calculates the diff between the previous image and the next one. We don't use this delta (since we
just apply the image directly to the authorizer) but we should have it, for consistency.
Finally, change MockAclMutator to avoid the need to subclass AclControlManager.
Reviewers: David Arthur <mumrah@gmail.com>
* MINOR: Add test for describe topic with ID
Add a simple test to verify topic description with topic IDs.
Reviewers: Divij Vaidya <diviv@amazon.com>, dengziming <dengziming1993@gmail.com>
Part of KIP-925.
- Add more tests for HighAvailabilityTaskAssignor
- Remove null and optional check for RackAwareTaskAssignor
- Pass rack aware assignor configs to getMainConsumerConfigs so that they can be picked up in rebalance protocol
- Change STATELESS_NON_OVERLAP_COST to 0. It was a mistake to be 1. Stateless tasks should be moved without this cost.
- Update of existing tests
Reviewers: Matthias J. Sax <matthias@confluent.io>
The KRaft client uses an expiration service to complete FETCH requests that have timed out. This expiration service uses a different thread from the KRaft polling thread. This means that it is unsafe for the expiration service thread to call tryCompleteFetchRequest. tryCompleteFetchRequest reads and updates a lot of states that is assumed to be only be read and updated from the polling thread.
The KRaft client now does not call tryCompleteFetchRequest when the FETCH request has expired. It instead will send the FETCH response that was computed when the FETCH request was first handled.
This change also fixes a bug where the KRaft client was not sending the FETCH response immediately, if the response contained a diverging epoch or snapshot id.
Reviewers: Jason Gustafson <jason@confluent.io>
A HistoricalIterator at epoch N is supposed to only reveal elements at epoch N or earlier. However,
due to a bug, we sometimes will reveal elements which are at a newer epoch than N. The bug does
not affect elements that are in the latest epoch (aka topTier). It only affects elements that are
newer than N, but which do not persist until the latest epoch. This PR fixes the bug and adds a
unit test for this case.
Reviewers: David Arthur <mumrah@gmail.com>
On ext4 file systems we have seen snapshots with zero-length files. This is possible if
the file is closed and moved before forcing the channel to write to disk.
Reviewers: Ron Dagostino <rndgstn@gmail.com>, Alok Thatikunta <athatikunta@confluent.io>
Change in response to KIP-941.
New PR due to merge issue.
Changes line 57 in the RangeQuery class file from:
public static <K, V> RangeQuery<K, V> withRange(final K lower, final K upper) {
return new RangeQuery<>(Optional.of(lower), Optional.of(upper));
}
to
public static <K, V> RangeQuery<K, V> withRange(final K lower, final K upper) {
return new RangeQuery<>(Optional.ofNullable(lower), Optional.ofNullable(upper));
}
Testing strategy:
Since null values can now be entered in RangeQuerys in order to receive full scans, I changed the logic defining query starting at line 1085 in IQv2StoreIntegrationTest.java from:
final RangeQuery<Integer, V> query;
if (lower.isPresent() && upper.isPresent()) {
query = RangeQuery.withRange(lower.get(), upper.get());
} else if (lower.isPresent()) {
query = RangeQuery.withLowerBound(lower.get());
} else if (upper.isPresent()) {
query = RangeQuery.withUpperBound(upper.get());
} else {
query = RangeQuery.withNoBounds();
}
to
query = RangeQuery.withRange(lower.orElse(null), upper.orElse(null));
because different combinations of isPresent() in the bounds is no longer necessary.
Reviewers: John Roesler <vvcephei@apache.org>, Bill Bejeck <bbejeck@apache.org>
Part of KIP-915.
- Change TaskAssignor interface to take RackAwareTaskAssignor
- Integrate RackAwareTaskAssignor to StreamsPartitionAssignor and HighAvailabilityTaskAssignor
- Update HAAssignor tests
Reviewers: Anna Sophie Blee-Goldman <ableegoldman@apache.org>, Matthias J. Sax <matthias@confluent.io>
Part of KIP-925.
- Add configs for rack aware assignor
- Update standby optimizer in RackAwareTaskAssignor to have more rounds
- Refactor some method in RackAwareTaskAssignorTest to test utils so that they can also be used in HighAvailabilityTaskAssignorTest and other tests
Reviewers: Matthias J. Sax <matthias@confluent.io>
Kafka system tests with Java version 17 are failing on this issue:
```python
TimeoutError("MiniKdc didn't finish startup",)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 186, in _do_run
data = self.run_test()
File "/usr/local/lib/python3.6/site-packages/ducktape/tests/runner_client.py", line 246, in run_test
return self.test_context.function(self.test)
File "/usr/local/lib/python3.6/site-packages/ducktape/mark/_mark.py", line 433, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
File "/opt/kafka-dev/tests/kafkatest/sanity_checks/test_verifiable_producer.py", line 74, in test_simple_run
self.kafka.start()
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 635, in start
self.start_minikdc_if_necessary(add_principals)
File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 596, in start_minikdc_if_necessary
self.minikdc.start()
File "/usr/local/lib/python3.6/site-packages/ducktape/services/service.py", line 265, in start
self.start_node(node, **kwargs)
File "/opt/kafka-dev/tests/kafkatest/services/security/minikdc.py", line 114, in start_node
monitor.wait_until("MiniKdc Running", timeout_sec=60, backoff_sec=1, err_msg="MiniKdc didn't finish startup")
File "/usr/local/lib/python3.6/site-packages/ducktape/cluster/remoteaccount.py", line 754, in wait_until
allow_fail=True) == 0, **kwargs)
File "/usr/local/lib/python3.6/site-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: MiniKdc didn't finish startup
```
Specifically, when one runs the test cases and looks at the logs of the MiniKdc:
```java
Exception in thread "main" java.lang.IllegalAccessException: class kafka.security.minikdc.MiniKdc cannot access class sun.security.krb5.Config (in module java.security.jgss) because module java.security.jgss does not export sun.security.krb5 to unnamed module @24959ca4
at java.base/jdk.internal.reflect.Reflection.newIllegalAccessException(Reflection.java:392)
at java.base/java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:674)
at java.base/java.lang.reflect.Method.invoke(Method.java:560)
at kafka.security.minikdc.MiniKdc.refreshJvmKerberosConfig(MiniKdc.scala:268)
at kafka.security.minikdc.MiniKdc.initJvmKerberosConfig(MiniKdc.scala:245)
at kafka.security.minikdc.MiniKdc.start(MiniKdc.scala:123)
at kafka.security.minikdc.MiniKdc$.start(MiniKdc.scala:375)
at kafka.security.minikdc.MiniKdc$.main(MiniKdc.scala:366)
at kafka.security.minikdc.MiniKdc.main(MiniKdc.scala)
```
This error is caused by the fact that sun.security module is no longer supported in Java 16 and higher. Related to the [1].
There are two ways how to solve it, and I present one of them. The second way is to export the ENV variable during the deployment of the containers using Ducktape in [2].
[1] - https://openjdk.org/jeps/396
[2] - https://github.com/apache/kafka/blob/trunk/tests/docker/ducker-ak#L308
Reviewers: Ismael Juma <ismael@juma.me.uk>, Luke Chen <showuon@gmail.com>
Only initialize remote topic metrics when system-wise remote storage is enabled to avoid impacting performance for existing brokers. Also add tests.
Reviewers: Divij Vaidya <diviv@amazon.com>, Kamal Chandraprakash <kamal.chandraprakash@gmail.com>
* KAFKA-15107: Support custom metadata for remote log segment
This commit does the changes discussed in the KIP-917. Mainly, changes the `RemoteStorageManager` interface in order to return `CustomMetadata` and then ensures these custom metadata are stored, propagated, (de-)serialized correctly along with the standard metadata throughout the whole lifecycle. It introduces the `remote.log.metadata.custom.metadata.max.size` to limit the custom metadata size acceptable by the broker and stop uploading in case a piece of metadata exceeds this limit.
On testing:
1. `RemoteLogManagerTest` checks the case when a piece of custom metadata is larger than the configured limit.
2. `RemoteLogSegmentMetadataTest` checks if `createWithUpdates` works correctly, including custom metadata.
3. `RemoteLogSegmentMetadataTransformTest`, `RemoteLogSegmentMetadataSnapshotTransformTest`, and `RemoteLogSegmentMetadataUpdateTransformTest` were added to test the corresponding class (de-)serialization, including custom metadata.
4. `FileBasedRemoteLogMetadataCacheTest` checks if custom metadata are being correctly saved and loaded to a file (indirectly, via `equals`).
5. `RemoteLogManagerConfigTest` checks if the configuration setting is handled correctly.
Reviewers: Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>, Divij Vaidya <diviv@amazon.com>
A stream thread should only change to RUNNING if there are no
active tasks in restoration in the state updater and if there
are no pending tasks to recycle.
There are situations in which a stream thread might only have
standby tasks that are recycled to active task after a rebalance.
In such situations, the stream thread might be faster in checking
active tasks in restoration then the state updater removing the
standby task to recycle from the state updater. If that happens
the stream thread changes to RUNNING although it should wait until
the standby tasks are recycled to active tasks and restored.
Reviewers: Walker Carlson <wcarlson@confluent.io>, Matthias J. Sax <matthias@confluent.io>
in 3.5.0 AbstractStickyAssignor may run useless loop in performReassignments because isBalanced have a trivial mistake, and result in rebalance timeout in some situation.
Co-authored-by: lixy <lixy@tuya.com>
Reviewers: Ritika Reddy <rreddy@confluent.io>, Philip Nee <pnee@confluent.io>, Kirk True <kirk@mustardgrain.com>, Guozhang Wang <wangguoz@gmail.com>
In tiered storage, a segment is eligible for deletion from local disk when it gets uploaded to the remote storage.
If the topic active segment contains some messages and there are no new incoming messages, then the active segment gets rotated to passive segment after the configured log.roll.ms timeout.
The logic to find the candidate segment in RemoteLogManager does not include the recently rotated passive segment as eligible to upload it to remote storage so the passive segment won't be removed even after if it breaches by retention time/size. (ie) Topic won't be empty after it becomes stale.
Added unit test to cover the scenario which will fail without this patch.
Reviewers: Christo Lolov <lolovc@amazon.com>, Luke Chen <showuon@gmail.com>, Satish Duggana <satishd@apache.org>
Part of KIP-925.
For rack aware standby task assignment, we can either use the already existing "rack tags" or as a fall-back the newly added "rack.id". This PR unifies both without the need to change the actual standby task assignment logic.
Reviewers: Matthias J. Sax <matthias@confluent.io>
This patch does a few code cleanups in the group-coordinator module.
It renames Coordinator to CoordinatorShard;
It renames ReplicatedGroupCoordinator to GroupCoordinatorShard. I was never really happy with this name. The new name makes more sense to me;
It removes TopicPartition from the GroupMetadataManager. It was only used in log messages. The log context already includes it so we don't have to log it again.
It renames assignors to consumerGroupAssignors.
Reviewers: Jeff Kim <jeff.kim@confluent.io>, Justine Olshan <jolshan@confluent.io>