We should catch `InvalidTopicException` and not just
`NoOffsetForPartitionException`. Also, we need to step through
all partitions that might be affected and reset those.
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Bill Bejeck <bbejeck@gmail.com>, Eno Thereska <eno@confluent.io>, Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2747 from mjsax/minor-fix-reset
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Closes#2779 from cmccabe/KAFKA-4993
There should only be a single `KafkaStreams.StreamStateListener` to
ensure synchronization of operations on
`KafkaStreams.StreamStateListener#threadState`.
Author: Armin Braun <me@obrown.io>
Reviewers: Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2801 from original-brownbear/fix-stream-state-listener
Author: Ewen Cheslack-Postava <me@ewencp.org>
Reviewers: Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2660 from ewencp/minor-make-configdef-safer
This fixes:
```
java.lang.AssertionError: expected:<2> but was:<3>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at org.apache.kafka.streams.processor.internals.StateDirectoryTest.shouldCleanUpTaskStateDirectoriesThatAreNotCurrentlyLocked(StateDirectoryTest.java:145)
```
While running test in infinite loop, hit other problems:
- fixed file management (release all locks and close everything)
- increased sleep time for `shouldCleanupStateDirectoriesWhenLastModifiedIsLessThanNowMinusCleanupDelay` too (was flaky as well)
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Eno Thereska <eno@confluent.io>, Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2781 from mjsax/minor-fix-stateDirectoryTest
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Eno Thereska <eno@confluent.io>, Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2777 from mjsax/hotfix-window-serdes-trunk
Several fixes for handling broker failures:
- default replication value for internal topics is now 3 in test itself (not in streams code, that will require a KIP.
- streams producer waits for acks from all replicas in test itself (not in streams code, that will require a KIP.
- backoff time for streams client to try again after a failure to contact controller.
- fix bug related to state store locks (this helps in multi-threaded scenarios)
- fix related to catching exceptions property for network errors.
- system test for all the above
Author: Eno Thereska <eno@confluent.io>
Author: Eno Thereska <eno.thereska@gmail.com>
Reviewers: Matthias J. Sax <matthias@confluent.io>, Damian Guy <damian.guy@gmail.com>, Guozhang Wang <wangguoz@gmail.com>, Dan Norwood <norwood@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#2719 from enothereska/KAFKA-4916-broker-bounce-test
https://issues.apache.org/jira/browse/KAFKA-4810
> Currently SchemaBuilder is strict when checking that certain fields have not been set yet (e.g. version, name, doc). It just checks that the field is null. This is intended to protect the user from buggy code that overwrites a field with different values, but it's a bit too strict currently. In generic code for converting schemas (e.g. Converters) you will sometimes initialize a builder with these values (e.g. because you get a SchemaBuilder for a logical type, which sets name & version), but then have generic code for setting name & version from the source schema.
Changed the validation method to not only check if a field is null but also to check if the new value that is being set is the same as the current value of the field.
ewencp
Author: Vitaly Pushkar <vitaly.pushkar@gmail.com>
Reviewers: Ewen Cheslack-Postava <ewen@confluent.io>
Closes#2806 from vitaly-pushkar/KAFKA-4810-schema-builder-default-fields-validation
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Konstantine Karantasis <konstantine@confluent.io>, Ewen Cheslack-Postava <ewen@confluent.io>
Closes#2763 from cmccabe/KAFKA-4977
Though MirrorMaker uses the `producer.type` value of the
producer properties, ProducerConfig show the warning:
`The configuration 'producer.type' was supplied but
isn't a known config.`
Author: Shun Takebayashi <shun@takebayashi.asia>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2676 from takebayashi/suppress-mirrormaker-warning
Of particular importance are compression buffers (64 KB for LZ4, for example).
Author: Apurva Mehta <apurva@confluent.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2796 from apurvam/idempotent-producer-close-data-stream
Addresses for https://issues.apache.org/jira/browse/KAFKA-4878
* Adjusted the error message to explicitly state errors and their number
* Dried up the logic for generating the message between standalone and distributed
Example
messed up two config keys in the file source config:
````
namse=local-file-source
connector.class=FileStreamSource
tasks.max=1
fisle=test.txt
topic=connect-test
```
Produces:
```
[2017-03-22 08:57:11,896] ERROR Stopping after connector error (org.apache.kafka.connect.cli.ConnectStandalone:99)
java.util.concurrent.ExecutionException: org.apache.kafka.connect.runtime.rest.errors.BadRequestException: Connector configuration is invalid and contains the following 2 error(s):
Missing required configuration "file" which has no default value.
Missing required configuration "name" which has no default value.
You can also find the above list of errors at the endpoint `/{connectorType}/config/validate`
```
Author: Armin Braun <me@obrown.io>
Reviewers: Gwen Shapira, Konstantine Karantasis, Ewen Cheslack-Postava
Closes#2722 from original-brownbear/KAFKA-4878
fixes:
```
java.nio.file.NoSuchFileException: /tmp/test7863510415433793941/topic2-Canonized/topic2-Canonized-197001010000/000015.sst
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:97)
at java.nio.file.Files.readAttributes(Files.java:1686)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:105)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:199)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:199)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:199)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:69)
at java.nio.file.Files.walkFileTree(Files.java:2602)
at java.nio.file.Files.walkFileTree(Files.java:2635)
at org.apache.kafka.common.utils.Utils.delete(Utils.java:555)
at org.apache.kafka.streams.kstream.internals.KStreamWindowAggregateTest.testJoin(KStreamWindowAggregateTest.java:320)
```
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Eno Thereska <eno@confluent.io>, Damian Guy <damian.guy@gmail.com>, Jun Rao <junrao@gmail.com>
Closes#2778 from mjsax/minor-fix-kstreamWindowAggregateTest
The bug meant that the base offset was the same as the batch size instead of 0 so the broker would always recompress batches.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jun Rao <junrao@gmail.com>
Closes#2794 from ijuma/fix-records-builder-construction
Fixes deadlock scenario found during local test run:
The main thread was waiting for the coordinator lock.
The thread performing close() was holding the
coordinator lock and polling to find coordinator.
The test expected close() to timeout, but for timing
out, the main thread had to update time, which it
couldn't since it was waiting for the lock. This fix
avoids using coordinator in the main thread during
the close task.
Author: Rajini Sivaram <rajinisivaram@googlemail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2792 from rajinisivaram/MINOR-closetest-deadlock
This may be a reason why we see Jenkins jobs time out at times.
I can reproduce it locally.
With current trunk there is a possibility to run into this:
```sh
"kafka-streams-close-thread" #585 daemon prio=5 os_prio=0 tid=0x00007f66d052d800 nid=0x7e02 waiting for monitor entry [0x00007f66ae2e5000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.kafka.streams.processor.internals.StreamThread.close(StreamThread.java:345)
- waiting to lock <0x000000077d33c538> (a org.apache.kafka.streams.processor.internals.StreamThread)
at org.apache.kafka.streams.KafkaStreams$1.run(KafkaStreams.java:474)
at java.lang.Thread.run(Thread.java:745)
"appId-bd262a91-5155-4a35-bc46-c6432552c2c5-StreamThread-97" #583 prio=5 os_prio=0 tid=0x00007f66d052f000 nid=0x7e01 waiting for monitor entry [0x00007f66ae4e6000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.kafka.streams.KafkaStreams.setState(KafkaStreams.java:219)
- waiting to lock <0x000000077d335760> (a org.apache.kafka.streams.KafkaStreams)
at org.apache.kafka.streams.KafkaStreams.access$100(KafkaStreams.java:117)
at org.apache.kafka.streams.KafkaStreams$StreamStateListener.onChange(KafkaStreams.java:259)
- locked <0x000000077d42f138> (a org.apache.kafka.streams.KafkaStreams$StreamStateListener)
at org.apache.kafka.streams.processor.internals.StreamThread.setState(StreamThread.java:168)
- locked <0x000000077d33c538> (a org.apache.kafka.streams.processor.internals.StreamThread)
at org.apache.kafka.streams.processor.internals.StreamThread.setStateWhenNotInPendingShutdown(StreamThread.java:176)
- locked <0x000000077d33c538> (a org.apache.kafka.streams.processor.internals.StreamThread)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1600(StreamThread.java:70)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsRevoked(StreamThread.java:1321)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinPrepare(ConsumerCoordinator.java:406)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:349)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:296)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1037)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1002)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:531)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:669)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:326)
```
In a nutshell: `KafkaStreams` and `StreamThread` are both
waiting for each other since another intermittent `close`
(eg. from a test) comes along also trying to lock on
`KafkaStreams` :
```sh
"main" #1 prio=5 os_prio=0 tid=0x00007f66d000c800 nid=0x78bb in Object.wait() [0x00007f66d7a15000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
- locked <0x000000077d45a590> (a java.lang.Thread)
at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:503)
- locked <0x000000077d335760> (a org.apache.kafka.streams.KafkaStreams)
at org.apache.kafka.streams.KafkaStreams.close(KafkaStreams.java:447)
at org.apache.kafka.streams.KafkaStreamsTest.testCannotStartOnceClosed(KafkaStreamsTest.java:115)
```
=> causing a deadlock.
Fixed this by softer locking on the state change, that guarantees
atomic changes to the state but does not lock on the whole object
(I at least could not find another method that would require more
than atomicly-locked access except for `setState`).
Also qualified the state listeners with their outer-class to make
the whole code-flow around this more readable (having two
interfaces with the same naming for interface and method and then
using them between their two outer classes is crazy hard to read
imo :)).
Easy to reproduced yourself by running
`org.apache.kafka.streams.KafkaStreamsTest` in a loop for a bit
(save yourself some time by running 2-4 in parallel :)). Eventually
it will lock on one of the tests (for me this takes less than 1 min
with 4 parallel runs).
Author: Armin Braun <me@obrown.io>
Author: Armin <me@obrown.io>
Reviewers: Eno Thereska <eno@confluent.io>, Damian Guy <damian.guy@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2791 from original-brownbear/fix-streams-deadlock
This is from the KIP-98 proposal.
The main points of discussion surround the correctness logic, particularly the Log class where incoming entries are validated and duplicates are dropped, and also the producer error handling to ensure that the semantics are sound from the users point of view.
There is some subtlety in the idempotent producer semantics. This patch only guarantees idempotent production upto the point where an error has to be returned to the user. Once we hit a such a non-recoverable error, we can no longer guarantee message ordering nor idempotence without additional logic at the application level.
In particular, if an application wants guaranteed message order without duplicates, then it needs to do the following in the error callback:
1. Close the producer so that no queued batches are sent. This is important for guaranteeing ordering.
2. Read the tail of the log to inspect the last message committed. This is important for avoiding duplicates.
Author: Apurva Mehta <apurva@confluent.io>
Author: hachikuji <jason@confluent.io>
Author: Apurva Mehta <apurva.1618@gmail.com>
Author: Guozhang Wang <wangguoz@gmail.com>
Author: fpj <fpj@apache.org>
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com>
Closes#2735 from apurvam/exactly-once-idempotent-producer
See the JIRA for the full details. Essentially the test assertions depend on receiving reliable events from the consumer processes, but this is not generally possible in the presence of a hard failure (i.e. `kill -9`). Until we solve this problem, the hard failure scenarios will be turned off.
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Apurva Mehta <apurva@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Closes#2771 from hachikuji/KAFKA-4689
Fix for adding state stores with regex defined sources
Author: bbejeck <bbejeck@gmail.com>
Reviewers: Matthias J. Sax, Damian Guy, Guozhang Wang
Closes#2618 from bbejeck/KAFKA-4791_unable_to_add_statestore_regex_topics
Author: Jason Gustafson <jason@confluent.io>
Author: Ismael Juma <github@juma.me.uk>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2762 from hachikuji/ensure-decompression-stream-closed
We got test error `org.apache.kafka.common.errors.TopicExistsException: Topic 'inputTopic' already exists.` in some builds. Can reproduce reliably at local machine. Root cause it async "topic delete" that might not be finished before topic gets re-created.
Author: Matthias J. Sax <matthias@confluent.io>
Reviewers: Ismael Juma, Damian Guy, Guozhang Wang
Closes#2757 from mjsax/minor-fix-resetintegrationtest
Note: None of the use cases for offset fetch would lead to a `TOPIC_AUTHORIZATION_FAILED` error (fetching offset of an unauthorized partition would return an `UNKNOWN_TOPIC_OR_PARTITION` error). That is why it is being removed from the `PARTITION_ERRORS` list.
Author: Vahid Hashemian <vahidhashemian@us.ibm.com>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#2653 from vahidhashemian/minor/update_possible_errors_in_offset_fetch_response
Author: Dong Lin <lindong28@gmail.com>
Reviewers: Jiangjie Qin <becket.qin@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2760 from lindong28/KAFKA-4973
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Jun Rao <junrao@gmail.com>, Apurva Mehta <apurva@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Closes#2691 from cmccabe/KAFKA-4902
This is a minor change but it helps to improve the log readability.
Author: Kamal C <kamal.chandraprakash@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2709 from Kamal15/util
The maxBytes field should be set to DEFAULT_RESPONSE_MAX_BYTES,
the same way as the constructor using the Struct does.
codeveloped with mimaison
Author: Edoardo Comar <ecomar@uk.ibm.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>
Closes#2694 from edoardocomar/MINOR-FetchRequest
- Improves streams efficiency by more than 200K requests/second (small 100 byte requests)
- Gets streams efficiency very close to pure consumer (see results in https://jenkins.confluent.io/job/system-test-kafka-branch-builder/746/console)
- Maintains same fairness across tasks
- Schedules all records in the queue in-between poll() calls, not just one per task.
Author: Eno Thereska <eno@confluent.io>
Author: Eno Thereska <eno.thereska@gmail.com>
Reviewers: Damian Guy, Matthias J. Sax, Guozhang Wang
Closes#2643 from enothereska/minor-schedule-round-robin
I manually tested that Crc32CTest and AbstractChecksums pass with JDK 9. I also verified that `Java9ChecksumFactory` is used in that case.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#2739 from ijuma/kafka-1449-crc32c
Author: Colin P. Mccabe <cmccabe@confluent.io>
Reviewers: Michael G. Noll <michael@confluent.io>, Ismael Juma <ismael@juma.me.uk>
Closes#2727 from cmccabe/KAFKA-4944
The transient failures make it harder to spot real failures and we can live without what is being tested (adding security to ZK via a rolling upgrade) until KIP-101 lands.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Apurva Mehta <apurva@confluent.io>, Jun Rao <junrao@gmail.com>
Closes#2742 from ijuma/disable-zk-upgrade-test
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Apurva Mehta <apurva@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
Closes#2738 from hachikuji/streaming-compressed-iterator
This brought down a cluster by causing continuous controller moves.
ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects that aren't thread-safe:
* Selector
* NetworkClient
* SSLEngine (this was the big one for us. We turn on SSL for interbroker communication).
As per the "Concurrency Notes" section from https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html:
> two threads must not attempt to call the same method (either wrap() or unwrap()) concurrently
SSLEngine.wrap gets called in:
* SslTransportLayer.write
* SslTransportLayer.handshake
* SslTransportLayer.close
It turns out that the ZkEventThread and RequestSendThread can concurrently call SSLEngine.wrap:
* ZkEventThread calls SslTransportLayer.close from ControllerChannelManager.removeExistingBroker
* RequestSendThread can call SslTransportLayer.write or SslTransportLayer.handshake from NetworkClient.poll
Suppose the controller moves for whatever reason. The former controller could have had a RequestSendThread who
was in the middle of sending out messages to the cluster while the ZkEventThread began executing
KafkaController.onControllerResignation, which calls ControllerChannelManager.shutdown, which sequentially cleans
up the controller-to-broker queue and connection for every broker in the cluster. This cleanup includes the call
to ControllerChannelManager.removeExistingBroker as mentioned earlier, causing the concurrent call to SSLEngine.wrap.
This concurrent call throws a BufferOverflowException which ControllerChannelManager.removeExistingBroker catches so
the ControllerChannelManager.shutdown moves onto cleaning up the next controller-to-broker queue and connection,
skipping the cleanup steps such as clearing the queue, stopping the RequestSendThread, and removing the entry from its
brokerStateInfo map.
By failing out of the Selector.close, the sensors corresponding to the broker connection has not been cleaned up. Any
later attempt at initializing an identical Selector will result in a sensor collision and therefore cause Selector
initialization to throw an exception. In other words, any later attempts by this broker to become controller again
will fail on initialization. When controller initialization fails, the controller deletes the /controller znode and
lets another broker take over.
Now suppose the controller moves enough times such that every broker hits the BufferOverflowException concurrency
issue. We're now guaranteed to fail controller initialization due to the sensor collision on every controller
transition, so the controller will move across brokers continuously.
This patch avoids the concurrent use of non-threadsafe classes in ControllerChannelManager.removeExistingBroker
by shutting down the RequestSendThread before closing the NetworkClient.
Author: Onur Karaman <okaraman@linkedin.com>
Reviewers: Joel Koshy <jjkoshy.w@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2746 from onurkaraman/KAFKA-4959
Author: Dong Lin <lindong28@gmail.com>
Reviewers: Jun Rao <junrao@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Jiangjie Qin <becket.qin@gmail.com>
Closes#2476 from lindong28/KAFKA-4586
Author: Armin Braun <me@obrown.io>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Jason Gustafson <jason@confluent.io>
Closes#2699 from original-brownbear/KAFKA-4569
1. Use Initialization-on-demand holder idiom that relies on JVM lazy-loading instead of explicit initialization check.
2. Method handles were designed to be faster than Core Reflection, particularly if the method handle can be stored in a static final field (the JVM can then optimise the call as if it was a regular method call). Since the code is of similar complexity (and simpler if we consider the whole PR), I am treating this as a clean-up instead of a performance improvement (which would require doing benchmarks).
3. Remove unused `ByteBufferReceive`.
4. I removed the snappy library from the classpath and verified that `CompressionTypeTest` (which uses LZ4) still passes. This shows that the right level of laziness is achieved even if we use one of the lazily loaded compression algorithms.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Jason Gustafson <jason@confluent.io>
Closes#2740 from ijuma/use-method-handles-for-compressed-stream-supplier
Thanks to Dong Lin for finding the lastStableOffset issue.
Author: Ismael Juma <ismael@juma.me.uk>
Reviewers: Dong Lin <lindong28@gmail.com>, Jason Gustafson <jason@confluent.io>
Closes#2737 from ijuma/fix-fetch-response-lso
Author: Jason Gustafson <jason@confluent.io>
Reviewers: Jun Rao <junrao@gmail.com>, Apurva Mehta <apurva@confluent.io>, Guozhang Wang <wangguoz@gmail.com>, Ismael Juma <ismael@juma.me.uk>
Closes#2614 from hachikuji/exactly-once-message-format