Thursday, December 4, 2014

Spark 1.1.1 to read & write to Cassandra 2.0.11 - Simple Example


Overview


In my last post we looked at the most simple way to read some data from Apache Cassandra using Apache Spark from your local machine.  Taking the next logical step we will now write some data to Cassandra. The set up for this post is nearly identical to what we did here.  Assuming you have done that work this should only take a couple minutes.

Fetch the example from gihub

From github, clone the sbt-spark-cassandra-rw project.

Assembly

From a terminal, cd into the sbt-spark-cassandra-rw project and fire up sbt.  Once ready, call assembly to create your application jar file.
[bkarels@ahimsa simple-rw]$ sbt            
[info] Loading project definition from /home/bkarels/dev/simple-rw/project
[info] Set current project to Simple RW Project (in build file:/home/bkarels/dev/simple-rw/)                              
> assembly
...
[info] Packaging /home/bkarels/dev/simple-rw/target/scala-2.10/simpleSpark-RW.jar ...                                       
[info] Done packaging.                                                                                                      
[success] Total time: 21 s, completed Dec 3, 2014 10:52:33 AM
As before, take note of where your jar is put (highlighted bit above).

Prepare the data

At the sbt-spark-cassandra-rw project root you will find the file things.cql.  Using DevCenter or cqlsh, execute this script against your target Cassandra cluster.  If you need to set-up a local cluster for development look here.

In this example we will look a a group of things.  Things have keys and values.  But, for a thing to matter it must have a value of greater than one.  So, we will pull down all things, filter out the things that matter, and write only things that matter into their own table (thingsthatmatter).

It is worth noting that the target table must exist for us to write to it.  Unlike some other NoSQL data stores, we must plan ahead a bit more with Cassandra.

Spark it up!

If your local Spark cluster is not up and running, do that now.  If you need to review how to go about that, you can look here.

Make Sparks fly! (i.e. run it)

This bit is identical to the previous example.  With your application assembled, Cassandra up and prepared, and your Spark cluster humming; go to a terminal and submit your job.
[bkarels@ahimsa simple-rw]$ $SPARK_HOME/bin/spark-submit --class com.sparkfu.simple.SimpleApp --master spark://127.0.0.1:7077 /home/bkarels/dev/simple-rw/target/scala-2.10/simpleSpark-RW.jar
...
14/12/03 20:42:45 INFO spark.SparkContext: Job finished: toArray at SimpleApp.scala:27, took 0.641849986 s
(key8,27)
(key7,6)
(key1,28)
(key3,7)
(key4,99)
(key0,42)
(key9,1975)
(key6,100)
If all has gone well your terminal will spit out a list of all the things that matter as above.  Unaltered, the application will truncate the thingsthatmatter table just before it exits.  If you comment out that section in the source, re-assemble, and re-submit the job you could further confirm this by query:

 Note how all things that matter have a value greater than zero.

What's next?

To make this work we used a simple tuple to read our data into, transform it, and write it back.  While this works for very simple examples we will outgrow this very...we have already outgrown this.  That said, next we'll look to read data directly into a Scala case class, transform that data (perhaps into a new class), and write it back to Cassandra.

No comments:

Post a Comment