Thursday, January 29, 2015

Spark + Kafka + Cassandra (Part 3 - Spark Streaming Kafka Consumer)


Welcome to the part three of the series 'Spark + Kafka + Cassandra'.

Building on top of part one and part two, now it is time to consume a bunch of stuff from Kafka using Spark Streaming and dump it into Cassandra.  There really was no nice way to illustrate consumption without putting the messages somewhere - so why not go straight to c*?  Don't care about c*?  Feel free to write to stdout, HDFS, text file, whatever.

This piece is effectively designed to work with the message generator from part two, but you can put messages into your Kafka topic however you choose.

This instalment will have you:
  • Run ZooKeeper local (as part of Kafka distribution)
  • Run Kafka local
  • Run a local Spark Cluster
  • Create a Kafka Producer wrapped up in a Spark application (Part two)
  • Submit the application to generate messages sent to a given topic for some number of seconds.
  • Have a running Spark Streaming application ready and waiting to consume your topic and dump the results into a Cassandra table.


  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • ZooKeeper and Kafka running local 
  • git

Set it up...

It is not mandatory that you have gone through parts one and two of this series, but it will make this part more seamless to do so.  I recommend doing that and then return here.

Clone the Example

Begin by cloning the example project from github - spark-streaming-kafka-consumer, & cd into the project directory.

[bkarels@rev27 work]$ git clone
Cloning into 'spark-streaming-kafka-consumer'...
remote: Counting objects: 29, done.
remote: Total 29 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (29/29), done.
[bkarels@rev27 work]$ cd spark-streaming-kafka-consumer/
[bkarels@rev27 spark-streaming-kafka-consumer]$

Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@rev27 spark-streaming-kafka-consumer]$ sbt
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
[info] Loading project definition from /home/bkarels/work/spark-streaming-kafka-consumer/project
[info] Updating {file:/home/bkarels/work/spark-streaming-kafka-consumer/project/}spark-streaming-kafka-consumer-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to Spark Fu Streaming Kafka Consumer (in build file:/home/bkarels/work/spark-streaming-kafka-consumer/)
> assembly
[info] SHA-1: 4597553bcdecc6db6a67cad0883cc56cadb0be03
[info] Packaging /home/bkarels/work/spark-streaming-kafka-consumer/target/scala-2.10/sparkFuStreamingConsumer.jar ...
[info] Done packaging.
[success] Total time: 29 s, completed Jan 29, 2015 9:24:32 AM

Like before, take note of your jar file location (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to sparkFuStreamingConsumer.jar.

Spark it up!

If your local Spark cluster is not up and running, do that now.  Go here to see about getting 'r' done.

Make Sparks fly! (i.e. run it)

Having reviewed the code you have seen that out of the box, this streaming application expects ZooKeeper local on port 2181 & Kafka local with a topic of sparkfu.  Additionally, a local Cassandra instance with keyspace sparkfu and table messages should also exist.  The CQL for this table is at the root of the project.  See my post on local Cassandra if you do not yet have that bit in place.  Once Cassandra is up and running execute the CQL to create the messages table.

You should be able to see the following when done:
cqlsh> use sparkfu;
cqlsh:sparkfu> DESCRIBE TABLE messages;

CREATE TABLE messages (
  key text,
  msg text,
  PRIMARY KEY ((key))
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};
...and so we're sure we're starting from zero:
cqlsh:sparkfu> TRUNCATE messages;
cqlsh:sparkfu> SELECT key, msg FROM messages;

(0 rows)

OK, spark it up:
[bkarels@rev27 spark-streaming-kafka-consumer]$ $SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.Consumer --master spark:// /home/bkarels/work/spark-streaming-kafka-consumer/target/scala-2.10/sparkFuStreamingConsumer.jar
Output will flow about here...
If all has gone well, your streaming application is now running on your local Spark cluster waiting for messages to hit your Kafka topic.

You can verify at http://localhost:4040.

If you have just recently completed part two, your application will likely be busy pulling messages from the sparkfu topic.  However, let's assume that's not the case.  So, fire up the message generator from part two and run it for a few seconds to publish some messages to the sparkfu topic.

[bkarels@rev27 spark-kafka-msg-generator]$ $SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.RandomMessages --master local[*] /home/bkarels/dev/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar sparkfu 5 true
15/01/29 09:45:40 INFO SyncProducer: Connected to rev27:9092 for producing
5000 * 3 seconds left...
Produced 9282 msgs in 5s -> 1856.0m/s.
So, we should have 9282 messages in our topic (in this example).  We should also, by the time we look have that same number of messages in our messages table in Cassandra.

cqlsh:sparkfu> SELECT COUNT(*) FROM messages;


(1 rows)

Due to the nature of this example, the data itself will be uninteresting:

 cqlsh:sparkfu> SELECT key,msg FROM messages LIMIT 5;

 key                       | msg
 kUs3nkk0mv5fJ6eGvcLDrkQTd | 1422546343283,AMpOMjZeJozSy3t519QcUHRwl,...
 P6cUTChERoqZ7bOyDa3XjnHNs | 1422546344238,VDPCAhV3k3m5wfaUY0jAB8qB0,...
 KlRKLYnnlZY6NCpbKyQEIKLrF | 1422546343576,hdzvBKR7z2raTsxNYFoTFmeS2,...
 YGrBt2ZI7PPXrpopLsSTAwYrD | 1422546341519,cv0b7MEPdnrK1HuRL0GPDzMMP,...
 YsDWO67wKMuWBzyRpOiRSNpq2 | 1422546344491,RthQLkxPc5es7f2fYjTXRJnNu...

(5 rows)

So there you have it - about the most simple Spark Streaming Kafka consumer you can do.  Shoulders of giants people.  So tweak it, tune it, expand it, use it as your springboard.

What's next?

Still in flight for the next part of this series:
  • Not sure - I may just work custom Kafka encode/decode into this example.
  • May also, model the data in more of time series fashion to make reading from Kafka a more practical exercise.
  • Your thoughts?
 Thanks for checking this out - more to come!


  1. Thanks for this great post! I got almost everything working except the one below:

    foreachRDD at SparkFuConsumer.scala:49
    Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 128, java.lang.Exception: Could not compute split, block input-0-1422993397600 not found
    at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.executor.Executor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

    Driver stacktrace:

  2. Thank for you interesting tutorial mixing all the technologies I was research the lasts weeks.
    Do you think that this architecture will suit a system for save industry sensor data?

    later on I wanted to query it using spark and apply machine learning for get more in deep knowledge but first step is to store the data.

    Thanks again

    1. I will not say that gathering sensor data is a "classic" use case as Spark has not been around long enough for anything Spark to be "classic" - but yes, the Kafka -> Spark Streaming -> Cassandra stack is very well suited to sensor data. That said, it is not your only option. Consider your use cases carefully and don't be afraid to look at things that can also work well in your enterprise. Spark is sexy and shiny, but Flume, Storm, and HDFS could get the job done just as well in many cases. 0MQ, Rabbit, Kestrel and others also play where Kafka plays - each has pros and cons and should be at least reviewed.

      So, now that I've posted my "not a fan-boy" bits - Kafka -> Spark Streaming -> Cassandra is a wicked awesome stack that I think will become more predominant in the coming year. As Spark matures I suspect we will see projects migrate from Storm. Similarly, as Cassandra gains acceptance we'll see hot data move out of other stores and into C*. A lot of time and money has been spent on traditional map reduce with HDFS and until stake holders can be shown demonstrated ROI on the new stack there will be little movement. While that movement has started, I believe we'll see it accelerate as we move through 2015.

  3. I forget to reply you giving thanks for the info. Greetings from Germany(but I am spanish) you have a free beer waiting :P

  4. I enjoyed all your tutorials and I really thank you for easing the learning curve of Spark, r3volutionary!

    I want to ask you a question regarding a particular use case. I just want to query data from Cassandra, process it and produce an output (for example into Kafka). The data process is expected to be non-tribial, maybe merging the Cassandra data with other systems, consume caches, etc. The process must be almost real time (pseudo-synchronous):
    -Is this stack needed or its too much? There are simpler ways to consume Cassandra data and aggregate/process it which also offers scalability and good performance?
    -What would be the proper workflow? Keep a Spark job alive which consumes user petitions or start a new Spark job for each user petition?

    Thank you! Looking forward to seeing new tutorials!

    1. Albert - sorry it has taken me so long to see this - my paying job has kept me too busy to dedicate more time to Spark-Fu of recent. More stuff is coming, just as soon as free time becomes free...

      I'm afraid I do not fully grok your use case - indeed it does sound non-trivial. I would just advise caution not to create a circular consumption of streams - that could get very ugly, very fast. That is, make sure all of your pipelines have a destination.

      I will say that Kafka and Spark Streaming are wonderful and sexy, but they are not always the final solution. Don't be afraid to look at how some immutable portion of your intermediate results might be kept in HDFS to be reprocessed periodically. Also, if you're executing C* queries, depending on how you're going about it, a standard Spark job (non-streaming) on a schedule (e.g. cron) might fit your requirements. If you have some serious near real time requirements perhaps there are elements of solr that will work for you?

      If you did want to stick to more strict Spark/Kafka/C* stack, recall that you can tune your interval on your streaming jobs. So, if your primary job is outputing intermediate results to various Kafka topics, the streaming jobs that consume those streams may process at slightly longer intervals. Of course, that all depends on the size and speed of your cluster, the volume and rate of your many variables.

      Best of luck to you - let me know how things go!

      Oh, also, remember you can look beyond Kafka to feed Spark Streaming. If it fits your use case HDFS, Kinesis, 0MQ, or even Flume might get you past a technical hurdle.

  5. The information shared are very much useful My sincere thanks for sharing this post Please Continue to share this post
    Hadoop Training in Chennai

    1. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai. or learn thru Java Online Training India . Nowadays Java has tons of job opportunities on various vertical industry.

  6. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    hadoop training in chennai

  7. Excellent blog. Thank you for sharing with us. The information you shared is very effective for learners and I have got some important suggestions from your blog post. Software Testing Training in Chennai | Big data Analytics Training in Chennai

  8. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post..Big Data Hadoop Training in Bangalore | Data Science Training in Bangalore

  9. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.

    Hadoop Training in Marathallai

    Hadoop Training in BtmLayout

  10. Fantastic blog., The Art Education was superb., Really interesting to read, Thanks for sharing such a nice blog.Cloud Computing Project Center in Chennai | Cloud Computing Projects in Velachery

  11. Your good knowledge and kindness in playing with all the pieces were
    very useful. I don’t know what I would have done if I had not
    encountered such a step like this.

    AWS Training in Bangalore

    AWS Training in Bangalore

  12. Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
    Cloud Computing Project Center in Chennai | Cloud Computing Project Center in Velachery

  13. Interesting post! This is really helpful for me. I like it! Thanks for sharing!
    Mobile application developers in Chennai | PHP developers Chennai

  14. Thank you a lot for providing individuals with a very
    spectacular possibility to read critical reviews from this site.

    aws training in bangalore

    aws training in chennai

  15. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.

    java training in bangalore

  16. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    python training in bangalore

  17. Thanks for one marvelous posting! I enjoyed reading it;Great post.The information was very useful.Keep the good work goin on!!
    Hadoop training in chennai | Mainframe training in chennai | SAP SD training in chennai

  18. Thanks for sharing this niche useful informative post to our knowledge.
    brochure designers in chennai | brochure design company in chennai

  19. Streaming is a relatively recent development, because broadband connection had to run fast enough to show the data in real time. If there is an interruption due to congestion on the internet, for example, the audio or video will drop out or the screen will go blank. Watch Rick and Morty free online


  20. Best Solidworks training institute in noida

    SolidWorks is a solid modeling computer-aided design (CAD) and computer-aided engineering (CAE) computer program that runs on Microsoft Windows. SolidWorks is published by Dassault Systems. Solid Works: well, it is purely a product to design machines. But, of course, there are other applications, like aerospace, automobile, consumer products, etc. Much user friendly than the former one, in terms of modeling, editing designs, creating mechanisms, etc.
    Solid Works is a Middle level, Main stream software with focus on Product development & this software is aimed at Small scale & Middle level Companies whose interest is to have a reasonably priced CAD system which can support their product development needs and at the same time helps them get their product market faster.

    Company Address:
    Phone No: 0120-4330760 ,+91-880-282-0025

  21. 3D Animation Training in Noida

    Best institute for 3d Animation and Multimedia

    Best institute for 3d Animation Course training Classes in Noida- webtrackker Is providing the 3d Animation and Multimedia training in noida with 100% placement supports. for more call - 8802820025.

    3D Animation Training in Noida

    Company Address:

    Webtrackker Technology

    C- 67, Sector- 63, Noida

    Phone: 01204330760, 8802820025



  22. AWS Training in Bangalore - Live Online & Classroom
    myTectra Amazon Web Services (AWS) certification training helps you to gain real time hands on experience on AWS. myTectra offers AWS training in Bangalore using classroom and AWS Online Training globally. AWS Training at myTectra delivered by the experienced professional who has atleast 4 years of relavent AWS experince and overall 8-15 years of IT experience. myTectra Offers AWS Training since 2013 and retained the positions of Top AWS Training Company in Bangalore and India.

    IOT Training in Bangalore - Live Online & Classroom
    IOT Training course observes iot as the platform for networking of different devices on the internet and their inter related communication. Reading data through the sensors and processing it with applications sitting in the cloud and thereafter passing the processed data to generate different kind of output is the motive of the complete curricula. Students are made to understand the type of input devices and communications among the devices in a wireless media.

  23. Very interesting blog which helps me to get the in depth knowledge about the technology, Thanks for sharing such a nice blog..
    Good discussion.
    Six Sigma Training in Abu Dhabi
    Six Sigma Training in Dammam
    Six Sigma Training in Riyadh

  24. Great Article… I love to read your articles because your writing style is too good, its is very very helpful for all of us. Do check Six Sigma Training in Bangalore | Six Sigma Training in Dubai & Get trained by an expert who will enrich you with the latest trends.

  25. Thank for sharing very valuable information.nice article.keep posting.For more information visit
    aws online training
    aws training in hyderabad
    amazon web services(AWS) online training

  26. Sap fico training institute in Noida

    Sap fico training institute in Noida - Webtrackker Technology is IT Company which is providing the web designing, development, mobile application, and sap installation, digital marketing service in Noida, India and out of India. Webtrackker is also providing the sap fico training in Noida with working trainers.

    C - 67, sector- 63, Noida, India.
    F -1 Sector 3 (Near Sector 16 metro station) Noida, India.

    +91 - 8802820025

  27. Sap fico training institute in Noida

    Sap fico training institute in Noida - Webtrackker Technology is IT Company which is providing the web designing, development, mobile application, and sap installation, digital marketing service in Noida, India and out of India. Webtrackker is also providing the sap fico training in Noida with working trainers.

    C - 67, sector- 63, Noida, India.
    F -1 Sector 3 (Near Sector 16 metro station) Noida, India.

    +91 - 8802820025

  28. This comment has been removed by the author.

  29. Generally, video streaming is just taking a video and sound flag at the source and transmitting over the web. This enables you to send any intuitive video stream to any site that can get dynamic iptv service 2019

  30. Even more amazing, many websites offer multiple games for free. What fan wouldn't be happy with that?live sport stream

  31. Thank you so much for posting this. I really appreciate your work. Keep it up. Great work!Best software training company with placement in Hyderabad

  32. Attend The Python Training in Bangalore From ExcelR. Practical Python Training in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python Training in Bangalore.

  33. Great blog thanks for sharing Looking for the best creative agency to fuel new brand ideas? Adhuntt Media is not just a digital marketing company in chennai. We specialize in revamping your brand identity to drive in best traffic that converts.

  34. Nice blog thanks for sharing Choosing the right place to buy your first plant isn’t that hard of a choice anymore. Presenting the best plant nursery in Chennai - Karuna Nursery Gardens is proud to showcase more than 3000+ plants ready to be chosen from.

  35. Excellent blog thanks for sharing Run your salon business successfully by tying up with the best beauty shop in Chennai - The Pixies Beauty Shop. With tons of prestigious brands to choose from, and amazing offers we’ll have you amazed.

  36. Very useful blog thanks for sharing While choosing your perfect ride for driving, Accord Cars comes with and the best packages for you to pick from. Car rentals for self drive in Chennai are done the easier. Just pick out your plan from hourly, daily, weekly and even monthly plans available.

  37. Interesting blog thanks for sharing With over a three decade of beauty expertise at our fingertips, we believed that everyone has the right to be beautiful. And so began the journey of our very own Pearl’s Beautician course in Chennai.

  38. Information through streaming video can be captured by news stations and individualsBusiness Management Articles, and can be used for personal and professional reasons.moviebox ios

  39. I appreciate your effort and you have done a great job.thanks for the ideas and please add more in future.web design company in velachery

  40. Attend The Data Analytics Courses in Bangalore with Placement From ExcelR. Practical Data Analytics Courses in Bangalore with Placement Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analytics Courses in Bangalore with Placement.
    ExcelR Data Analytics Courses in Bangalore with Placement

  41. I learned World's Trending Technology from certified experts for free of cost. I Got a job in decent Top MNC Company with handsome 14 LPA salary, I have learned the World's Trending Technology from Data Science Training in btm experts who know advanced concepts which can help to solve any type of Real-time issues in the field of Python. Really worth trying Freelance seo expert in bangalore

  42. Awesome blog thankks for sharing 100% virgin Remy Hair Extension in USA, importing from India. Premium and original human hair without joints and bondings. Available in Wigs, Frontal, Wavy, Closure, Bundle, Curly, straight and customized color hairstyles Extensions.

  43. Very useful blog thanks for sharing IndPac India the German technology Packaging and sealing machines in India is the leading manufacturer and exporter of Packing Machines in India.

  44. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.
    courses on data analytics

  45. Nice blog,I understood the topic very clearly,And want to study more like this.
    Data Scientist Course

  46. I'd like to thank you for the efforts you have put in writing this website. I am hoping to check out the same high-grade blog posts from you later on as well.develop In truth, your creative writing abilities has motivated me to get my very own blog now ;)

  47. Attend The Data Science Courses From ExcelR. Practical Data Science Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Science Courses.
    Data Science Courses
    Data Science Interview Questions

  48. It's very useful article with inforamtive and insightful content and i had good experience with this information.Enroll today to get free access to our live demo session which is a great opportunity to interact with the trainer directly which is a placement based Salesforce training India with job placement and certification . I strongly recommend my friends to join this Salesforce training institutes in hyderabad practical course, great curriculum Salesforce training institutes in Bangalore with real time experienced faculty Salesforce training institutes in Chennai. Never delay to enroll for a free demo at Salesforce training institutes in Mumbai who are popular for Salesforce training institutes in Pune.

  49. I have to search sites with relevant information on given topic and provide them to teacher our opinion and the article.

    Simple Linear Regression

    Correlation vs Covariance

  50. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
    Correlation vs Covariance
    Simple linear regression
    data science interview questions

  51. Great Article
    Cloud Computing Projects

    Networking Projects

    Final Year Projects for CSE

    JavaScript Training in Chennai

    JavaScript Training in Chennai

    The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

  52. I need to communicate my deference of your composing aptitude and capacity to make perusers read from the earliest starting point as far as possible. I might want to peruse more up to date presents and on share my musings with you.
    360DigiTMG Data Analytics Course

  53. I am another customer of this site so here I saw various articles and posts posted by this site,I curious more energy for some of them trust you will give more information further.
    data science course in malaysia

  54. I looked at some very important and to maintain the length of the strength you are looking for on your website
    iot course in noida


  55. On the off chance that your searching for Online Illinois tag sticker restorations, at that point you have to need to go to the privileged place.
    AI Courses

  56. I have read your blog. Your information is really useful for beginner. information provided here are unique and easy to understand. Thanks for this useful information. This is a great inspiring article. I am pretty much pleased with your good work.
    AWS Training Institute in Chennai | AWS Training Center in Velachery | AWS Certification Training in Velachery

  57. Excellent post... Thank you for sharing such a informative and information blog with us.keep updating such a wonderful post..
    MicorSoft Azure Training Institute in Chennai | Azure Training Center in Chennai | Azure Certification Training in velachery | Online Azure training in Velachery

  58. There are lots of Natural Remedies for Achalasia in market but these are very expensive. A product made by Natural Herbs Clinic is one of the useful and low prices Natural Remedy for Achalasia which works without any risk. It is a low price product and made with herbal ingredients that work without any side effects.

  59. Attend The Data Analyst Course From ExcelR. Practical Data Analyst Course Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Data Analyst Course.
    Data Analyst Course