src-kafka/contrib/hadoop-producer/README.md

Hadoop to Kafka Bridge
======================

What's new?
-----------

* Kafka 0.8 support
  * No more ZK-based load balancing (backwards incompatible change)
* Semantic partitioning is now supported in KafkaOutputFormat. Just specify a
  key in the output committer of your job. The Pig StoreFunc doesn't support
  semantic partitioning.
* Config parameters are now the same as the Kafka producer, just prepended with
  kafka.output (e.g., kafka.output.max.message.size). This is a backwards
  incompatible change.

What is it?
-----------

The Hadoop to Kafka bridge is a way to publish data from Hadoop to Kafka. There
are two possible mechanisms, varying from easy to difficult: writing a Pig
script and writing messages in Avro format, or rolling your own job using the
Kafka `OutputFormat`. 

Note that there are no write-once semantics: any client of the data must handle
messages in an idempotent manner. That is, because of node failures and
Hadoop's failure recovery, it's possible that the same message is published
multiple times in the same push.

How do I use it?
----------------

With this bridge, Kafka topics are URIs and are specified as URIs of the form
`kafka://<kafka-server>/<kafka-topic>` to connect to a specific Kafka broker.

### Pig ###

Pig bridge writes data in binary Avro format with one message created per input
row. To push data via Kafka, store to the Kafka URI using `AvroKafkaStorage`
with the Avro schema as its first argument. You'll need to register the
appropriate Kafka JARs. Here is what an example Pig script looks like:

    REGISTER hadoop-producer_2.8.0-0.8.0.jar;
    REGISTER avro-1.4.0.jar;
    REGISTER piggybank.jar;
    REGISTER kafka-0.8.0.jar;
    REGISTER jackson-core-asl-1.5.5.jar;
    REGISTER jackson-mapper-asl-1.5.5.jar;
    REGISTER scala-library.jar;

    member_info = LOAD 'member_info.tsv' AS (member_id : int, name : chararray);
    names = FOREACH member_info GENERATE name;
    STORE member_info INTO 'kafka://my-kafka:9092/member_info' USING kafka.bridge.AvroKafkaStorage('"string"');

That's it! The Pig StoreFunc makes use of AvroStorage in Piggybank to convert
from Pig's data model to the specified Avro schema.

Further, multi-store is possible with KafkaStorage, so you can easily write to
multiple topics and brokers in the same job:

    SPLIT member_info INTO early_adopters IF member_id < 1000, others IF member_id >= 1000;
    STORE early_adopters INTO 'kafka://my-broker:9092/early_adopters' USING AvroKafkaStorage('$schema');
    STORE others INTO 'kafka://my-broker2:9092/others' USING AvroKafkaStorage('$schema');

### KafkaOutputFormat ###

KafkaOutputFormat is a Hadoop OutputFormat for publishing data via Kafka. It
uses the newer 0.20 mapreduce APIs and simply pushes bytes (i.e.,
BytesWritable). This is a lower-level method of publishing data, as it allows
you to precisely control output.

Included is an example that publishes some input text line-by-line to a topic.
With KafkaOutputFormat, the key can be a null, where it is ignored by the
producer (random partitioning), or any object for semantic partitioning of the
stream (with an appropriate Kafka partitioner set). Speculative execution is
turned off by the OutputFormat.

What can I tune?
----------------

* kafka.output.queue.bytes: Bytes to queue in memory before pushing to the Kafka
  producer (i.e., the batch size). Default is 1,000,000 (1 million) bytes.

Any of Kafka's producer parameters can be changed by prefixing them with
"kafka.output" in one's job configuration. For example, to change the
compression codec, one would add the "kafka.output.compression.codec" parameter
(e.g., "SET kafka.output.compression.codec 0" in one's Pig script for no
compression). 

For easier debugging, the above values as well as the Kafka broker information
(kafka.metadata.broker.list), the topic (kafka.output.topic), and the schema
(kafka.output.schema) are injected into the job's configuration. By default,
the Hadoop producer uses Kafka's sync producer as asynchronous operation
doesn't make sense in the batch Hadoop case.
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`Hadoop to Kafka Bridge`
			`======================`

KAFKA-348 merge trunk to branch 1239902:1310937 patch by Joe Stein reviewed by Jun Rao git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/branches/0.8@1344526 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`What's new?`
			`-----------`

KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`* Kafka 0.8 support`
			`* No more ZK-based load balancing (backwards incompatible change)`
			`* Semantic partitioning is now supported in KafkaOutputFormat. Just specify a`
			`key in the output committer of your job. The Pig StoreFunc doesn't support`
			`semantic partitioning.`
			`* Config parameters are now the same as the Kafka producer, just prepended with`
			`kafka.output (e.g., kafka.output.max.message.size). This is a backwards`
			`incompatible change.`
KAFKA-348 merge trunk to branch 1239902:1310937 patch by Joe Stein reviewed by Jun Rao git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/branches/0.8@1344526 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`What is it?`
			`-----------`

			`The Hadoop to Kafka bridge is a way to publish data from Hadoop to Kafka. There`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`are two possible mechanisms, varying from easy to difficult: writing a Pig`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`script and writing messages in Avro format, or rolling your own job using the`
			Kafka `OutputFormat`.

			`Note that there are no write-once semantics: any client of the data must handle`
			`messages in an idempotent manner. That is, because of node failures and`
			`Hadoop's failure recovery, it's possible that the same message is published`
			`multiple times in the same push.`

			`How do I use it?`
			`----------------`

KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`With this bridge, Kafka topics are URIs and are specified as URIs of the form`
			`kafka://<kafka-server>/<kafka-topic>` to connect to a specific Kafka broker.
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
			`### Pig ###`

			`Pig bridge writes data in binary Avro format with one message created per input`
			row. To push data via Kafka, store to the Kafka URI using `AvroKafkaStorage`
			`with the Avro schema as its first argument. You'll need to register the`
			`appropriate Kafka JARs. Here is what an example Pig script looks like:`

KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`REGISTER hadoop-producer_2.8.0-0.8.0.jar;`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`REGISTER avro-1.4.0.jar;`
			`REGISTER piggybank.jar;`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`REGISTER kafka-0.8.0.jar;`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`REGISTER jackson-core-asl-1.5.5.jar;`
			`REGISTER jackson-mapper-asl-1.5.5.jar;`
			`REGISTER scala-library.jar;`

KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`member_info = LOAD 'member_info.tsv' AS (member_id : int, name : chararray);`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`names = FOREACH member_info GENERATE name;`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`STORE member_info INTO 'kafka://my-kafka:9092/member_info' USING kafka.bridge.AvroKafkaStorage('"string"');`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
			`That's it! The Pig StoreFunc makes use of AvroStorage in Piggybank to convert`
			`from Pig's data model to the specified Avro schema.`

			`Further, multi-store is possible with KafkaStorage, so you can easily write to`
			`multiple topics and brokers in the same job:`

			`SPLIT member_info INTO early_adopters IF member_id < 1000, others IF member_id >= 1000;`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`STORE early_adopters INTO 'kafka://my-broker:9092/early_adopters' USING AvroKafkaStorage('$schema');`
			`STORE others INTO 'kafka://my-broker2:9092/others' USING AvroKafkaStorage('$schema');`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
			`### KafkaOutputFormat ###`

			`KafkaOutputFormat is a Hadoop OutputFormat for publishing data via Kafka. It`
			`uses the newer 0.20 mapreduce APIs and simply pushes bytes (i.e.,`
			`BytesWritable). This is a lower-level method of publishing data, as it allows`
			`you to precisely control output.`

KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`Included is an example that publishes some input text line-by-line to a topic.`
			`With KafkaOutputFormat, the key can be a null, where it is ignored by the`
			`producer (random partitioning), or any object for semantic partitioning of the`
			`stream (with an appropriate Kafka partitioner set). Speculative execution is`
			`turned off by the OutputFormat.`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
			`What can I tune?`
			`----------------`

KAFKA-991; Rename config queue size to queue bytes in hadoop producer; patched by Swapnil Ghike, reviewed by Joel Koshy. 11 years ago			`* kafka.output.queue.bytes: Bytes to queue in memory before pushing to the Kafka`
			`producer (i.e., the batch size). Default is 1,000,000 (1 million) bytes.`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago
			`Any of Kafka's producer parameters can be changed by prefixing them with`
			`"kafka.output" in one's job configuration. For example, to change the`
			`compression codec, one would add the "kafka.output.compression.codec" parameter`
			`(e.g., "SET kafka.output.compression.codec 0" in one's Pig script for no`
			`compression).`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago
KAFKA-348 merge trunk to branch 1239902:1310937 patch by Joe Stein reviewed by Jun Rao git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/branches/0.8@1344526 13f79535-47bb-0310-9956-ffa450edef68 13 years ago			`For easier debugging, the above values as well as the Kafka broker information`
kafka-871; Rename ZkConfig properties; patched by Swapnil Ghike; reviewed by Jun Rao 12 years ago			`(kafka.metadata.broker.list), the topic (kafka.output.topic), and the schema`
KAFKA-713 Update Hadoop producer for Kafka 0.8 changes; reviewed by Neha Narkhede 12 years ago			`(kafka.output.schema) are injected into the job's configuration. By default,`
			`the Hadoop producer uses Kafka's sync producer as asynchronous operation`
			`doesn't make sense in the batch Hadoop case.`
Initial checkin of Kafka to Apache SVN. This corresponds to https://github.com/kafka-dev/kafka/commit/709afe4ec75489bc00a44335de8821fa726bb97e except that git specific files have been removed and code has been put into trunk/branches/site/etc. This is just a copy of master, branches and history are not being converted since we can't find a good tool for it. git-svn-id: https://svn.apache.org/repos/asf/incubator/kafka/trunk@1152970 13f79535-47bb-0310-9956-ffa450edef68 13 years ago