This focuses on the currently failing test, #9315 is a more complete fix
that we should also review and merge.
Reviewers: David Jacot <djacot@confluent.io>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Satish Duggana <satishd@apache.org>, David Jacot <djacot@confluent.io>
ducktape 0.7.11 fixes a bug where a unicode exception message would cause test runner to hang up and never finish.
This change should be backported to all the branches using ducktape 0.7.10
Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
Ducktape version 0.7.10 pinned paramiko to version 2.3.2 to deal with random SSHExceptions confluent had been seeing since ducktape was updated to a later version of paramiko.
The idea is that we can backport ducktape 0.7.10 change as far back as possible, while 2.7 and trunk can update to 0.8.0 and python3 separately.
Tested:
In progress, but unlikely to affect anything, since the only difference between ducktape 0.7.9 and 0.7.10 is paramiko version downgrade.
Author: Stanislav Vodetskyi <stan@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9490 from stan-confluent/ducktape-710-26
(cherry picked from commit 1cbc4da0c9)
Signed-off-by: Manikumar Reddy <manikumar.reddy@gmail.com>
The `org.apache.kafka.connect.data.Values#parse` method parses integers, which are larger than `Long.MAX_VALUE` as `double` with `Schema.FLOAT64_SCHEMA`.
That means we are losing precision for these larger integers.
For example:
`SchemaAndValue schemaAndValue = Values.parseString("9223372036854775808");`
returns:
`SchemaAndValue{schema=Schema{FLOAT64}, value=9.223372036854776E18}`
Also, this method parses values that can be parsed as `FLOAT32` to `FLOAT64`.
This PR changes parsing logic, to use `FLOAT32`/`FLOAT64` for numbers that don't have fraction part(`decimal.scale()!=0`) only, and use an arbitrary-precision `org.apache.kafka.connect.data.Decimal` otherwise.
Also, it updates the method to parse numbers, that can be represented as `float` to `FLOAT64`.
Added unit tests, that cover parsing `BigInteger`, `Byte`, `Short`, `Integer`, `Long`, `Float`, `Double` types.
Reviewers: Konstantine Karantasis <k.karantasis@gmail.com>
Fixes a regression introduced in `JsonConverter` with previous upgrades from Jackson Databind 2.9.x to 2.10.x. Jackson Databind version 2.10.0 included a backward-incompatible behavioral change to use `JsonNodeType.MISSING` (and `MissingNode`, the subclass of `JsonNode` that has a type of `MISSING`) instead of `JsonNodeType.NULL` / `NullNode`. See https://github.com/FasterXML/jackson-databind/issues/2211 for details of this change.
This change makes recovers the older `JsonConverter` behavior of returning null on empty input.
Added two unit tests for this change. Both unit tests were independently tested with earlier released versions and passed on all versions that used Jackson 2.9.x and earlier, and failed on all versions that used 2.10.x and that did not have the fixed included in the PR. Both of the new unit tests pass with this fix to `JsonConverter`.
Author: Shaik Zakir Hussain <zhussain@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
ducktape diff: https://github.com/confluentinc/ducktape/compare/v0.7.8...v0.7.9
- bcrypt (a dependency of ducktape) dropped Python2.7 support.
ducktape-0.7.9 now pins bcrypt to a Python2.7-supported version.
Author: Andrew Egelhofer <aegelhofer@confluent.io>
Reviewers: Dhruvil Shah <dhruvil@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#9192 from andrewegel/trunk
This is a backport of (#7687) for 2.3.
Existing producer state expiration uses timestamps from data records only and not from transaction markers. This can cause premature producer expiration when the coordinator times out a transaction because we drop the state from existing batches. This in turn can allow the coordinator epoch to revert to a previous value, which can lead to validation failures during log recovery. This patch fixes the problem by also leveraging the timestamp from transaction markers.
We also change the validation logic so that coordinator epoch is verified only for new marker appends. When replicating from the leader and when recovering the log, we only log a warning if we notice that the coordinator epoch has gone backwards. This allows recovery from previous occurrences of this bug.
Finally, this patch fixes one minor issue when loading producer state from the snapshot file. When the only record for a given producer is a control record, the "last offset" field will be set to -1 in the snapshot. We should check for this case when loading to be sure we recover the state consistently.
Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>
Recently, system tests test_rebalance_[simple|complex] failed
repeatedly with a verfication error. The cause was most probably
the missing clean-up of a state directory of one of the processors.
A node is cleaned up when a service on that node is started and when
a test is torn down.
If the clean-up flag clean_node_enabled of a EOS Streams service is
unset, the clean-up of the node is skipped.
The clean-up flag of processor1 in the EOS tests should stay set before
its first start, so that the node is cleaned before the service is started.
Afterwards for the multiple restarts of processor1 the cleans-up flag should
be unset to re-use the local state.
After the multiple restarts are done, the clean-up flag of processor1 should
again be set to trigger node clean-up during the test teardown.
A dirty node can lead to test failures when tests from Streams EOS tests are
scheduled on the same node, because the state store would not start empty
since it reads the local state that was not cleaned up.
Reviewers: Matthias J. Sax <mjsax@apache.org>, Andrew Choi <andchoi@linkedin.com>, Bill Bejeck <bbejeck@gmail.com>
This PR contains the fix of race condition bug between "consumer thread" and "consumer coordinator heartbeat thread". It reproduces in many production environments.
Condition for reproducing:
1. Consumer thread initiates rejoin to the group because of commit timeout. Call of AbstractCoordinator#joinGroupIfNeeded which leads to sendJoinGroupRequest.
2. JoinGroupResponseHandler writes to the AbstractCoordinator.this.generation new generation data and leaves the synchronized section.
3. Heartbeat thread executes mabeLeaveGroup and clears generation data via resetGenerationOnLeaveGroup.
4. Consumer thread executes onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment); with the cleared generation data. This leads to the corresponding exception.
Note this PR is a backport of #7460 for 2.2. It also contains part of #7451.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Co-authored-by: Nikolay <nizhikov@apache.org>
This fixes a version pinning issue where a transitive dependency had a
major version upgrade that a dependency did not account for, breaking
the build.
Reviewers: Andrew Egelhofer <aegelhofer@confluent.io>, Matthias J. Sax <matthias@confluent.io>
There is a race on receiving a LeaderAndIsr request for a replica with an active log dir reassignment. If the reassignment completes just before the LeaderAndIsr handler updates epoch information, it can lead to an illegal state error since no future log dir exists. This patch fixes the problem by ensuring that the future log dir exists when the fetcher is started. Removal cannot happen concurrently because it requires access the same partition state lock.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, David Arthur <mumrah@gmail.com>
Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
Currently when there is a leader change with a log dir reassignment in progress, we do not update the leader epoch in the partition state maintained by `ReplicaAlterLogDirsThread`. This can lead to a FENCED_LEADER_EPOCH error, which results in the partition being marked as failed, which is a permanent failure until the broker is restarted. This patch fixes the problem by updating the epoch in `ReplicaAlterLogDirsThread` after receiving a new LeaderAndIsr request from the controller.
Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
A port of #8400 for 2.3. The process of sorting source and sink nodes changed in 2.4, so we can't cherry-pick the PR directly as we need to update the expected topology to what it would be in the 2.3 version.
Reviewers: John Roesler <john@confluent.io>, Andrew Choi <a24choi@edu.uwaterloo.ca>
Document the supported endpoint at the top-level (root) REST API resource and the information that it returns when a request is made to a Connect worker.
Fixes an omission in documentation after KAFKA-2369 and KAFKA-6311 (KIP-238)
Reviewers: Toby Drake <tobydrake7@gmail.com>, Soenke Liebau <soenke.liebau@opencore.com>
* Fixed DataException thrown when handling tombstone events with null value
* Passes through original record when finding a null key when it's configured for keys or a null value when it's configured for values.
* Added unit tests for schema and schemaless data
In case of an error while flattening a record with schema, the Flatten transformation was reporting an error about a record without schema, as follows:
```
org.apache.kafka.connect.errors.DataException: Flatten transformation does not support ARRAY for record without schemas (for field ...)
```
The expected behaviour would be an error message specifying "with schemas".
This looks like a simple copy/paste typo from the schemaless equivalent methods, in the same file
Reviewers: Ewen Cheslack-Postava <me@ewencp.org>, Konstantine Karantasis <konstantine@confluent.io>
Simple doc fix in a code snippet in connect.html
Co-authored-by: Scott Ferguson <smferguson@gmail.com>
Reviewers: Ewen Cheslack-Postava <me@ewencp.org>, Konstantine Karantasis <konstantine@confluent.io>
* KAFKA-9707: Fix InsertField.Key not applying to tombstone events
* Fix typo that hardcoded .value() instead of abstract operatingValue
* Add test for Key transform that was previously not tested
Signed-off-by: Greg Harris <gregh@confluent.io>
* Add null value assertion to tombstone test
* Remove mis-named function and add test for passing-through a null-keyed record.
Signed-off-by: Greg Harris <gregh@confluent.io>
* Simplify unchanged record assertion
Signed-off-by: Greg Harris <gregh@confluent.io>
* Replace assertEquals with assertSame
Signed-off-by: Greg Harris <gregh@confluent.io>
* Fix checkstyleTest indent issue
Signed-off-by: Greg Harris <gregh@confluent.io>
Older versions of the JoinGroup rely on a new member timeout to keep the group from growing indefinitely in the case of client disconnects and retrying. The logic for resetting the heartbeat expiration task following completion of the rebalance failed to account for an implicit expectation that shouldKeepAlive would return false the first time it is invoked when a heartbeat expiration is scheduled. This patch fixes the issue by making heartbeat satisfaction logic explicit.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Rajini Sivaram <rajinisivaram@googlemail.com>
Backport of #7297 and #7715 to allow per-node broker overrides and extra JVM args
Co-authored-by: David Arthur <mumrah@gmail.com>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
In Kafka Connect, a ConfigProvider instance can be used concurrently (e.g. via a PUT request to the `/connector-plugins/{connectorType}/config/validate` REST endpoint), but there is no mention of concurrent usage in the Javadocs of the ConfigProvider interface.
It's worth calling out that implementations need to be thread safe.
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
When incorrect connector configuration is detected, the returned exception message suggests to check the connector's configuration against the `{connectorType}/config/validate` endpoint.
Changing the error message to refer to the exact REST endpoint which is `/connector-plugins/{connectorType}/config/validate`
This aligns the exception message with the documentation at: https://kafka.apache.org/documentation/#connect_rest
Reviewers: Konstantine Karantasis <konstantine@confluent.io>
Change unit tests to make sure the consumer group is in Stable state (i.e. consumers have completed joining the group)
(cherry picked from commit 350dce865a)
Reviewers: Ismael Juma <ismael@juma.me.uk>
Co-authored-by: Chia-Ping Tsai <chia7712@gmail.com>
Fixes a bug where KStream#transformValues would forward null values from the provided ValueTransform#transform operation.
A test was added for verification KStreamTransformValuesTest#shouldEmitNoRecordIfTransformReturnsNull. A parallel test for non-key ValueTransformer was not added, as they share the same code path.
Reviewers: Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bbejeck@gmail.com>
Changed `EmbeddedConnectCluster` to add utility methods that return `Response`, throw `ConnectException` instead of `IOException` for failures, and deprecate the old methods that returned primitive types rather than `Response`.
Also introduce common assertions for embedded clusters under `EmbeddedConnectClusterAssertions`.
Author: Konstantine Karantasis <konstantine@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
(#7023) exposed an incompatibility between Kafka <=0.9 and Connect >0.9,
in which the broker does not recognize a request for ApiVersions.
For trunk and 2.4, this test case was removed rather than the issue addressed.
This effectively backports the other half of (#7023) which was left out of (#7791).
Signed-off-by: Greg Harris <gregh@confluent.io>
Author: Greg Harris <gregh@confluent.io>
Reviewers: Randall Hauch <rhauch@gmail.com>, Andrew Choi <andchoi@linkedin.com>
This fix makes the LogCleaner tolerant of gaps in the offset sequence. Previously, this could lead to endless loops of cleaning which required manual intervention.
Reviewers: Jun Rao <junrao@gmail.com>, David Arthur <mumrah@gmail.com>
Currently, when a dynamic change is made to the broker-level default log configuration, existing log configs will be recreated with an empty overridden configs. In such case, when updating dynamic broker configs a second round, the topic-level configs are lost. This can cause unexpected data loss, for example, if the cleanup policy changes from "compact" to "delete."
Reviewers: Rajini Sivaram <rajinisivaram@googlemail.com>, Jason Gustafson <jason@confluent.io>
This commit makes `DistributedHerder` log that some error has happened during task reconfiguration only when it actually has happened.
Author: Ivan Yurchenko <ivan0yurchenko@gmail.com>
Reviewer: Randall Hauch <rhauch@gmail.com>
`EmbeddedConnectCluster` has the ability to mask system exits to avoid killing the jvm. It appears that the default was intended to be `true`, but is actually `false`. The `maskExitProcedures` method on `EmbeddedConnectCluster.Builder` documents the parameter as:
```
* @param mask if false, exit and halt procedures remain unchanged; true is the default.
```
Because this is not enabled by default as intended, we are seeing some build failures which exit abruptly:
```
17:29:11 Execution failed for task ':connect:runtime:integrationTest'.
17:29:11 > Process 'Gradle Test Executor 25' finished with non-zero exit value 1
```
The culprit often appears to be `ExampleConnectIntegrationTest`, which indeed does not override the default value of `maskExitProcedures`.
Reviewers: Ewen Cheslack-Postava <me@ewencp.org>
The v3 JoinGroup logic does not properly complete the initial heartbeat for new members, which then expires after the static 5 minute timeout if the member does not rejoin. The core issue is in the `shouldKeepAlive` method, which returns false when it should return true because of an inconsistent timeout check.
Reviewers: Jason Gustafson <jason@confluent.io>
Allow transaction metadata to be reloaded, even if it already exists as of a previous epoch. This helps with cases where a previous become-follower transition failed to unload corresponding metadata.
Reviewers: Jun Rao <junrao@gmail.com>, Jason Gustafson <jason@confluent.io>
Bumped kafkatest/version.py to 2.2.3-SNAPSHOT
Updated versions following instructions in streams_upgrade_test.py
Reviewers: Bill Bejeck <bbejeck@gmail.com>