Improved the exception messages that are thrown to indicate whether it was a key or value conversion problem.
Author: Mario Molina <mmolimar@gmail.com>
Reviewer: Randall Hauch <rhauch@gmail.com>
Added a section about error reporting in Connect documentation, and another about how to safely use the new errant record reporter in SinkTask implementations.
Author: Aakash Shah <ashah@confluent.io>
Reviewer: Randall Hauch <rhauch@gmail.com>
After 3661f981fff2653aaf1d5ee0b6dde3410b5498db security_config is cached. Hence, the later changes to security flag can't impact the security_config used by later tests.
issue: https://issues.apache.org/jira/browse/KAFKA-10214
Author: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Ron Dagostino <rdagostino@confluent.io>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8949 from chia7712/KAFKA-10214
During Streams' system tests the PIDs of the Streams
clients are collected. The method the collects the PIDs
swallows any exception that might be thrown by the
ssh_capture() function. Swallowing any exceptions
might make the investigation of failures harder,
because no information about what happened are recorded.
Reviewers: John Roesler <vvcephei@apache.org>
Fix findbugs multithreaded correctness warnings for streams, updated variables to be threadsafe
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>, John Roesler <vvcephei@apache.org>
It's currently not possible to unit-test custom processors that use windowed stores,
because the provided windowed store implementations cast the context to
InternalProcessorContext.
This change adds a public API example using windowed stores, and fixes the
casts internally that would make that example fail previously.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>
Call Producer#flush to make sure all records are indeed sent "synchronously" when EOS is not enabled in the OptimizedKTableIntegrationTest#shouldApplyUpdatesToStandbyStore.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Boyang Chen <boyang@confluent.io>
Since https://issues.apache.org/jira/browse/KAFKA-8834, describing topics with the TopicCommand requires privileges to use ListPartitionReassignments or fails to describe the topics with the following error:
> Error while executing topic command : Cluster authorization failed.
This is a quite hard restriction has most of the secure clusters do not authorize non admin members to access ListPartitionReassignments.
This patch catches the `ClusterAuthorizationException` exception and gracefully fails back. We already do this when the API is not available so it remains consistent.
Author: David Jacot <djacot@confluent.io>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8947 from dajac/KAFKA-10212
A simple increase in the timeout of the consumer that verifies that records have been replicated seems to fix the integration tests in `MirrorConnectorsIntegrationTest` that have been failing more often recently.
Reviewers: Ryanne Dolan <ryannedolan@gmail.com>, Sanjana Kaundinya <skaundinya@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>, Konstantine Karantasis <konstantine@confluent.io>
Recently, commit 492306a updated both jetty to version 9.4.27.v20200227 and jersey to version 2.31
However in the latest versions of jetty, the renaming of the method `Response#closeOutput` to `Response#completeOutput` has been reverted, with the latest version using again `Response#closeOutput`.
Jersey has not released a recent version in which `Response#closeOutput` is called directly. In its currently latest version (2.31) `Response#closeOutput` will be called if `Response#completeOutput` throws a `NoSuchMethodError` exception. Given that, this version combination is compatible. Jersey should be upgraded once a new version that uses `Response#closeOutput` directly is out.
Reviewers: Ismael Juma <ismael@juma.me.uk>
We inadvertently changed the binary schema of the suppress buffer changelog
in 2.4.0 without bumping the schema version number. As a result, it is impossible
to upgrade from 2.3.x to 2.4+ if you are using suppression.
* Refactor the schema compatibility test to use serialized data from older versions
as a more foolproof compatibility test.
* Refactor the upgrade system test to use the smoke test application so that we
actually exercise a significant portion of the Streams API during upgrade testing
* Add more recent versions to the upgrade system test matrix
* Fix the compatibility bug by bumping the schema version to 3
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Guozhang Wang <wangguoz@gmail.com>
This should address at least some of the excessive TaskCorruptedExceptions we've been seeing lately. Basically, at the moment we only commit tasks if commitNeeded is true -- this seems obvious by definition. But the problem is we do some essential cleanup in postCommit that should always be done before a task is closed:
* clear the PartitionGroup
* write the checkpoint
The second is actually fine to skip when commitNeeded = false with ALOS, as we will have already written a checkpoint during the last commit. But for EOS, we only write the checkpoint before a close -- so even if there is no new pending data since the last commit, we have to write the current offsets. If we don't, the task will be assumed dirty and we will run into our friend the TaskCorruptedException during (re)initialization.
To fix this, we should just always call prepareCommit and postCommit at the TaskManager level. Within the task, it can decide whether or not to actually do something in those methods based on commitNeeded.
One subtle issue is that we still need to avoid checkpointing a task that was still in CREATED, to avoid potentially overwriting an existing checkpoint with uninitialized empty offsets. Unfortunately we always suspend a task before closing and committing, so we lose the information about whether the task as in CREATED or RUNNING/RESTORING by the time we get to the checkpoint. For this we introduce a special flag to keep track of whether a suspended task should actually be checkpointed or not
Reviewers: Guozhang Wang <wangguoz@gmail.com>
The enum ```State``` is private so it is fine to fix typo without breaking compatibility.
Author: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8932 from chia7712/MINOR-8932
I had to fix several compiler errors due to deprecation of auto application of `()`. A related
Xlint config (`-Xlint:nullary-override`) is no longer valid in 2.13, so we now only enable it
for 2.12. The compiler flagged two new inliner warnings that required suppression and
the semantics of `&` in `@nowarn` annotations changed, requiring a small change in
one of the warning suppressions.
I also removed the deprecation of a number of methods in `KafkaZkClient` as
they should not have been deprecated in the first place since `KafkaZkClient` is an
internal class and we still use these methods in the Controller and so on. This
became visible because the Scala compiler now respects Java's `@Deprecated`
annotation.
Finally, I included a few minor clean-ups (eg using `toBuffer` instead `toList`) when fixing
the compilation warnings.
Noteworthy bug fixes in Scala 2.13.3:
* Fix 2.13-only bug in Java collection converters that caused some operations to perform an extra pass
* Fix 2.13.2 performance regression in Vector: restore special cases for small operands in appendedAll and prependedAll
* Increase laziness of #:: for LazyList
* Fixes related to annotation parsing of @Deprecated from Java sources in mixed compilation
Full release notes:
https://github.com/scala/scala/releases/tag/v2.13.3
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
In order to make the Kafka consumer and stream application migrate from source to target cluster
transparently and conveniently, e.g. in event of source cluster failure, a background job is proposed
to periodically sync the consumer offsets from the source to target cluster, so that when the
consumer and stream applications switche to the target cluster, they will resume to consume from
where they left off at source cluster.
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ryanne Dolan <ryannedolan@gmail.com>, Thiago Pinto, Srinivas Boga
Minor cleanup on streams internal classes, with diamond class removal and long function signature breakdown.
Reviewers: Boyang Chen <boyang@confluent.io>, John Roesler <john@confluent.io>
We just needed to add the check in StreamTask#closeClean to closeAndRecycleState as well. I also renamed closeAndRecycleState to closeCleanAndRecycleState to drive this point home: it needs to be clean.
This should be cherry-picked back to the 2.6 branch
Reviewers: Matthias J. Sax <matthias@confluent.io>, John Roesler <john@confluent.io>, Guozhang Wang <wangguoz@gmail.com>,
If there's any pending data and we haven't flushed the producer when we abort a transaction, a KafkaException is returned for the previous send. This is a bit misleading, since the situation is not an unrecoverable error and so the Kafka Exception is really non-fatal. For now, we should just catch and swallow this in the RecordCollector (see also: KAFKA-10169)
The reason we ended up aborting an un-flushed transaction was due to the combination of
a. always aborting the ongoing transaction when any task is closed/revoked
b. only committing (and flushing) if at least one of the revoked tasks needs to be committed (regardless of whether any non-revoked tasks have data/transaction in flight)
Given the above, we can end up with an ongoing transaction that isn't committed since none of the revoked tasks have any data in the transaction. We then abort the transaction anyway, when those tasks are closed. So in addition to the above (swallowing this exception), we should avoid unnecessarily aborting data for tasks that haven't been revoked.
We can handle this by splitting the RecordCollector's close into a dirty and clean flavor: if dirty, we need to abort the transaction since it may be dirty due to the commit attempt failing. But if clean, we can skip aborting the transaction since we know that either we just committed and thus there is no ongoing transaction to abort, or else the transaction in flight contains no data from the tasks being closed
Note that this means we still abort the transaction any time a task is closed dirty, so we must close/reinitialize any active task with pending data (that was aborted).
In sum:
* hackily check the KafkaException message and swallow
* only abort the transaction during a dirty close
* refactor shutdown to make sure we don't closeClean a task whose data was actually aborted
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
ConsumerPerformance has not implemented options numThreadsOpt and numFetchersOpt as so far.
This patch adds a warning message when used these options according to comments from
https://issues.apache.org/jira/browse/KAFKA-10126 . Once these two options are implemented,
this warning message should be removed.
Reviewers: Boyang Chen <boyang@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
Add unit tests for KafkaProducer.close(), KafkaProducer.abortTransaction(), and KafkaProducer.flush() in the KafkaProducerTest.
Increase KafkaProducer test code coverage from 82% methods, 82% lines to 86% methods, 87% lines when being merged.
Reviewers: Boyang Chen <boyang@confluent.io>
Looks it is a typo, the actual key supposed to be this #replicaFetchWaitMaxTimeMs(replica.fetch.wait.max.ms) instead of that the docs have this #replicaMaxWaitTimeMs
Author: sbellapu <sbellapu@visa.com>
Author: sbellapu <satishbabumsc@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>, Manikumar Reddy <manikumar.reddy@gmail.com>
Closes#8877 from satishbellapu/trunk
Ports the test from #8886 to trunk -- this should be merged to 2.6 branch.
One open question. In 2.6 and trunk we rely on the active tasks to wipe out the store if it crashes. However, assume there is a hard JVM crash and we don't call closeDirty() the store would not be wiped out. Thus, I am wondering, if we would need to fix this (for both active and standby tasks) and do a check on startup if a local store must be wiped out?
The current test passes, as we do a proper cleanup after the exception is thrown.
Reviewers: Guozhang Wang <wangguoz@gmail.com>
Most builds don't require test coverage output, so it's wasteful
to spend cycles tracking coverage information for each method
invoked.
I ran a quick test in a fast desktop machine, the absolute
difference will be larger in a slower machine. The tests were
executed after `./gradlew clean` and with a gradle daemon
that was started just before the test (and mildly warmed up
by running `./gradlew clean` again).
`./gradlew unitTest --continue --profile`:
* With coverage enabled: 6m32s
* With coverage disabled: 5m47s
I ran the same test twice and the results were within 2s of
each other, so reasonably consistent.
16% reduction in the time taken to run the unit tests is a
reasonable gain with little downside, so I think this is a
good change.
Reviewers: Manikumar Reddy <manikumar.reddy@gmail.com>
It is helpful to include as much information as possible when deleting log segments. This patch introduces log messages that give more specific details as to why the log segment was deleted and the specific metadata regarding that log segment.
Reviewers: Jason Gustafson <jason@confluent.io>
This patch fixes a bug in the constructor of `LogTruncationException`. We were passing the divergent offsets to the super constructor as the fetch offsets. There is no way to fix this without breaking compatibility, but the harm is probably minimal since this exception was not getting raised properly until KAFKA-9840 anyway.
Note that I have also moved the check for unknown offset and epoch into `SubscriptionState`, which ensures that the partition is still awaiting validation and that the fetch offset hasn't changed. Finally, I made some minor improvements to the logging and exception messages to ensure that we always have the fetch offset and epoch as well as the divergent offset and epoch included.
Reviewers: Boyang Chen <boyang@confluent.io>, David Arthur <mumrah@gmail.com>
Since admin client allows use to use flexible offset-spec, we can always set to use read-uncommitted regardless of the EOS config.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
## Background
When a partition subscription is initialized it has a `null` position and is in the INITIALIZING state. Depending on the consumer, it will then transition to one of the other states. Typically a consumer will either reset the offset to earliest/latest, or it will provide an offset (with or without offset metadata). For the reset case, we still have no position to act on so fetches should not occur.
Recently we made changes for KAFKA-9724 (#8376) to prevent clients from entering the AWAIT_VALIDATION state when targeting older brokers. New logic to bypass offset validation as part of this change exposed this new issue.
## Bug and Fix
In the partition subscriptions, the AWAIT_RESET state was incorrectly reporting that it had a position. In some cases a position might actually exist (e.g., if we were resetting offsets during a fetch after a truncation), but in the initialization case no position had been set. We saw this issue in system tests where there is a race between the offset reset completing and the first fetch request being issued.
Since AWAIT_RESET#hasPosition was incorrectly returning `true`, the new logic to bypass offset validation was transitioning the subscription to FETCHING (even though no position existed).
The fix was simply to have AWAIT_RESET#hasPosition to return `false` which should have been the case from the start. Additionally, this fix includes some guards against NPE when reading the position from the subscription.
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
Upgrade jetty to 9.4.27.v20200227 and jersey to 2.31
Also remove the workaround used on previous versions from Connect's SSLUtils.
(Reverts KAFKA-9771 - commit ee832d7d)
Reviewers: Ismael Juma <ismael@juma.me.uk>, Chris Egerton <chrise@confluent.io>, Konstantine Karantasis <konstantine@confluent.io>
Reduce test data set from 1000 records to 500.
Some recent test failures indicate that the Jenkins runners aren't
able to process all 1000 records in two minutes.
Also add sanity check that all the test data are readable from the
input topic.
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
* Remove problematic Percentiles measurements until the implementation is fixed
* Fix leaking e2e metrics when task is closed
* Fix leaking metrics when tasks are recycled
Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>
* KAFKA-10150: always transition to SUSPENDED during suspend, no matter the current state only call prepareCommit before closing if task.commitNeeded is true
* Don't commit any consumed offsets during handleAssignment -- revoked active tasks (and any others that need committing) will be committed during handleRevocation so we only need to worry about cleaning them up in handleAssignment
* KAFKA-10152: when recycling a task we should always commit consumed offsets (if any), but don't need to write the checkpoint (since changelog offsets are preserved across task transitions)
* Make sure we close all tasks during shutdown, even if an exception is thrown during commit
Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>