Monday, January 26, 2015

Spark + Kafka + Cassandra (Part 2 - Spark Kafka [mass] Producer)


Welcome to the part two of the series 'Spark + Kafka + Cassandra'.

Building on top of part one and preparing for part three, here we'll spin up a little application that does only one thing well - spews out random messages for a number of seconds.  Basically, we wanted to have a tool that could be used to pump a large quantity of messages into Kafka to:
  • Test a Kafka setup
  • Tinker with throughput
  • Feed a consumer
  • etc.
Also, this example can easily be altered to spew whatever one might like (CSV, JSON, data from a flat file or HDFS, etc.).  Make it yours - spew your own stuff.

This second instalment will have you:
  • Run ZooKeeper local (as part of Kafka distribution)
  • Run Kafka local
  • Run a local Spark Cluster
  • Create a Kafka Producer wrapped up in a Spark application
  • Submit the application to generate messages sent to a given topic for some number of seconds.


  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • ZooKeeper and Kafka running local 
  • git

Set it up...

If you have not walked through Part 1 of this series, go there and walk through it to get ZooKeeper and Kafka up and running on your local machine.  When you've done that, come back here and proceed to the next step.

Clone the Example

Begin by cloning the example project from github - spark-kafka-msg-generator, & cd into the project directory.

[bkarels@rev27 work]$ git clone
Cloning into 'spark-kafka-msg-generator'...
remote: Counting objects: 19, done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 19 (delta 0), reused 15 (delta 0)
Unpacking objects: 100% (19/19), done.
[bkarels@rev27 work]$ cd spark-kafka-msg-generator/
[bkarels@rev27 spark-kafka-msg-generator]$

Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@rev27 spark-kafka-msg-generator]$ sbt
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
[info] Loading project definition from /home/bkarels/work/spark-kafka-msg-generator/project
[info] Updating {file:/home/bkarels/work/spark-kafka-msg-generator/project/}spark-kafka-msg-generator-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to MsgSpewer (in build file:/home/bkarels/work/spark-kafka-msg-generator/)
> assembly
[info] SHA-1: cddec14059d6a435847f8dd4b4b6f15f6899c0c3
[info] Packaging /home/bkarels/work/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar ...
[info] Done packaging.
[success] Total time: 53 s, completed Jan 26, 2015 9:10:16 AM

As per normal, take note of where your jar file gets put (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to sparkFuProducer.jar.

Spark it up!

If your local Spark cluster is not up and running, do that now.  Go here to see about getting 'r' done.

Make Sparks fly! (i.e. run it)

Having reviewed the code you have seen that this application accepts one, two, or three argument.  By default, the application will publish to topic sparkfu for 10 seconds without printing more than the normal output to the terminal.  You can optionally set the following three arguments:

[[topic] [duration]] [verbose]]]
Argument Examples:
Publish to topic sparkfu for 20 seconds and do print additional output:
sparkfu 20 true

Publish to topic filthpig for 3600 seconds and do not print additional output:
filthpig 3600

Publish to topic schweinehund for 10 seconds and do not print additional output:

Like Part 1, when this application is submitted to the cluster it will have no clear output.  Here again we can use the local consumer.  Do as below or look back to Part 1 to see about this.
$KAFKA_HOME/bin/ --zookeeper localhost:2181 --topic sparkfu --from-beginning

OK, spark it up:
$SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.RandomMessages --master local[*] ~/dev/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar sparkfu 20 true
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
This will run for: 20s
15/01/26 09:24:42 INFO SyncProducer: Connected to rev27:9092 for producing
5000 * 18 seconds left...
10000 * 17 seconds left...
115000 * 3 seconds left...
120000 * 2 seconds left...
130000 * 1 seconds left...
Produced 134137 msgs in 20s -> 6706.0m/s.
If you were running your local consumer you should see a big ol' pile of alphanumeric randomness stream by.  Just like that you can spew a pretty good chunk of messages to a topic on command. 

What's next?

On to part three where we'll consume messages out of Kafka with Spark Streaming and write them to Cassandra.  See part three here!

1 comment:

  1. I am getting this error while running assembly. Any idea

    [warn] Merging 'com/esotericsoftware/minlog/Log.class' with strategy 'first'
    [trace] Stack trace suppressed: run last *:assembly for the full output.
    [error] (*:assembly) deduplicate: different file contents found in the following:
    [error] /home/xyz/.ivy2/cache/org.apache.spark/spark-network-common_2.11/jars/spark-network-common_2.11-1.3.1.jar:com/google/common/base/Absent.class
    [error] /home/xyz/.ivy2/cache/
    [error] Total time: 46 s, completed May 9, 2016 4:32:42 PM