|
|
|
Hadoop to Kafka Bridge
|
|
|
|
======================
|
|
|
|
|
|
|
|
What's new?
|
|
|
|
-----------
|
|
|
|
|
|
|
|
* Kafka 0.8 support
|
|
|
|
* No more ZK-based load balancing (backwards incompatible change)
|
|
|
|
* Semantic partitioning is now supported in KafkaOutputFormat. Just specify a
|
|
|
|
key in the output committer of your job. The Pig StoreFunc doesn't support
|
|
|
|
semantic partitioning.
|
|
|
|
* Config parameters are now the same as the Kafka producer, just prepended with
|
|
|
|
kafka.output (e.g., kafka.output.max.message.size). This is a backwards
|
|
|
|
incompatible change.
|
|
|
|
|
|
|
|
What is it?
|
|
|
|
-----------
|
|
|
|
|
|
|
|
The Hadoop to Kafka bridge is a way to publish data from Hadoop to Kafka. There
|
|
|
|
are two possible mechanisms, varying from easy to difficult: writing a Pig
|
|
|
|
script and writing messages in Avro format, or rolling your own job using the
|
|
|
|
Kafka `OutputFormat`.
|
|
|
|
|
|
|
|
Note that there are no write-once semantics: any client of the data must handle
|
|
|
|
messages in an idempotent manner. That is, because of node failures and
|
|
|
|
Hadoop's failure recovery, it's possible that the same message is published
|
|
|
|
multiple times in the same push.
|
|
|
|
|
|
|
|
How do I use it?
|
|
|
|
----------------
|
|
|
|
|
|
|
|
With this bridge, Kafka topics are URIs and are specified as URIs of the form
|
|
|
|
`kafka://<kafka-server>/<kafka-topic>` to connect to a specific Kafka broker.
|
|
|
|
|
|
|
|
### Pig ###
|
|
|
|
|
|
|
|
Pig bridge writes data in binary Avro format with one message created per input
|
|
|
|
row. To push data via Kafka, store to the Kafka URI using `AvroKafkaStorage`
|
|
|
|
with the Avro schema as its first argument. You'll need to register the
|
|
|
|
appropriate Kafka JARs. Here is what an example Pig script looks like:
|
|
|
|
|
|
|
|
REGISTER hadoop-producer_2.8.0-0.8.0.jar;
|
|
|
|
REGISTER avro-1.4.0.jar;
|
|
|
|
REGISTER piggybank.jar;
|
|
|
|
REGISTER kafka-0.8.0.jar;
|
|
|
|
REGISTER jackson-core-asl-1.5.5.jar;
|
|
|
|
REGISTER jackson-mapper-asl-1.5.5.jar;
|
|
|
|
REGISTER scala-library.jar;
|
|
|
|
|
|
|
|
member_info = LOAD 'member_info.tsv' AS (member_id : int, name : chararray);
|
|
|
|
names = FOREACH member_info GENERATE name;
|
|
|
|
STORE member_info INTO 'kafka://my-kafka:9092/member_info' USING kafka.bridge.AvroKafkaStorage('"string"');
|
|
|
|
|
|
|
|
That's it! The Pig StoreFunc makes use of AvroStorage in Piggybank to convert
|
|
|
|
from Pig's data model to the specified Avro schema.
|
|
|
|
|
|
|
|
Further, multi-store is possible with KafkaStorage, so you can easily write to
|
|
|
|
multiple topics and brokers in the same job:
|
|
|
|
|
|
|
|
SPLIT member_info INTO early_adopters IF member_id < 1000, others IF member_id >= 1000;
|
|
|
|
STORE early_adopters INTO 'kafka://my-broker:9092/early_adopters' USING AvroKafkaStorage('$schema');
|
|
|
|
STORE others INTO 'kafka://my-broker2:9092/others' USING AvroKafkaStorage('$schema');
|
|
|
|
|
|
|
|
### KafkaOutputFormat ###
|
|
|
|
|
|
|
|
KafkaOutputFormat is a Hadoop OutputFormat for publishing data via Kafka. It
|
|
|
|
uses the newer 0.20 mapreduce APIs and simply pushes bytes (i.e.,
|
|
|
|
BytesWritable). This is a lower-level method of publishing data, as it allows
|
|
|
|
you to precisely control output.
|
|
|
|
|
|
|
|
Included is an example that publishes some input text line-by-line to a topic.
|
|
|
|
With KafkaOutputFormat, the key can be a null, where it is ignored by the
|
|
|
|
producer (random partitioning), or any object for semantic partitioning of the
|
|
|
|
stream (with an appropriate Kafka partitioner set). Speculative execution is
|
|
|
|
turned off by the OutputFormat.
|
|
|
|
|
|
|
|
What can I tune?
|
|
|
|
----------------
|
|
|
|
|
|
|
|
* kafka.output.queue.bytes: Bytes to queue in memory before pushing to the Kafka
|
|
|
|
producer (i.e., the batch size). Default is 1,000,000 (1 million) bytes.
|
|
|
|
|
|
|
|
Any of Kafka's producer parameters can be changed by prefixing them with
|
|
|
|
"kafka.output" in one's job configuration. For example, to change the
|
|
|
|
compression codec, one would add the "kafka.output.compression.codec" parameter
|
|
|
|
(e.g., "SET kafka.output.compression.codec 0" in one's Pig script for no
|
|
|
|
compression).
|
|
|
|
|
|
|
|
For easier debugging, the above values as well as the Kafka broker information
|
|
|
|
(kafka.metadata.broker.list), the topic (kafka.output.topic), and the schema
|
|
|
|
(kafka.output.schema) are injected into the job's configuration. By default,
|
|
|
|
the Hadoop producer uses Kafka's sync producer as asynchronous operation
|
|
|
|
doesn't make sense in the batch Hadoop case.
|
|
|
|
|