Thursday, March 26, 2015

Coming soon to spark-fu!

It has been a busy 2015 (though you couldn't tell by looking here).  I have a lot of stuff incubating that needs to be pushed out here once all the bugs are dead (and I find the time to write it all up).

Things to look for in the coming weeks:

  1. Cluster install using Ambari
  2. Spark on Yarn introduction
  3. Updated examples with Scala 2.11.x and Kafka 0.8.2.1
  4. Freshening previous stuff with current Spark version
  5. Going current on SBT as well.
So hold tight - this stuff is coming.  Also, just a heads up, some of the source out at github may be updated prior to the tutorials reflecting the changes.  Everything should still work, but you may need to tweak versions of scala, spark, kafka, etc..

Thursday, January 29, 2015

Spark + Kafka + Cassandra (Part 3 - Spark Streaming Kafka Consumer)

Overview

Welcome to the part three of the series 'Spark + Kafka + Cassandra'.

Building on top of part one and part two, now it is time to consume a bunch of stuff from Kafka using Spark Streaming and dump it into Cassandra.  There really was no nice way to illustrate consumption without putting the messages somewhere - so why not go straight to c*?  Don't care about c*?  Feel free to write to stdout, HDFS, text file, whatever.


This piece is effectively designed to work with the message generator from part two, but you can put messages into your Kafka topic however you choose.

This instalment will have you:
  • Run ZooKeeper local (as part of Kafka distribution)
  • Run Kafka local
  • Run a local Spark Cluster
  • Create a Kafka Producer wrapped up in a Spark application (Part two)
  • Submit the application to generate messages sent to a given topic for some number of seconds.
  • Have a running Spark Streaming application ready and waiting to consume your topic and dump the results into a Cassandra table.

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • ZooKeeper and Kafka running local 
  • git


Set it up...

It is not mandatory that you have gone through parts one and two of this series, but it will make this part more seamless to do so.  I recommend doing that and then return here.



Clone the Example

Begin by cloning the example project from github - spark-streaming-kafka-consumer, & cd into the project directory.

[bkarels@rev27 work]$ git clone https://github.com/bradkarels/spark-streaming-kafka-consumer.git
Cloning into 'spark-streaming-kafka-consumer'...
remote: Counting objects: 29, done.
remote: Total 29 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (29/29), done.
[bkarels@rev27 work]$ cd spark-streaming-kafka-consumer/
[bkarels@rev27 spark-streaming-kafka-consumer]$


Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@rev27 spark-streaming-kafka-consumer]$ sbt
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
[info] Loading project definition from /home/bkarels/work/spark-streaming-kafka-consumer/project
[info] Updating {file:/home/bkarels/work/spark-streaming-kafka-consumer/project/}spark-streaming-kafka-consumer-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to Spark Fu Streaming Kafka Consumer (in build file:/home/bkarels/work/spark-streaming-kafka-consumer/)
> assembly
...
[info] SHA-1: 4597553bcdecc6db6a67cad0883cc56cadb0be03
[info] Packaging /home/bkarels/work/spark-streaming-kafka-consumer/target/scala-2.10/sparkFuStreamingConsumer.jar ...
[info] Done packaging.
[success] Total time: 29 s, completed Jan 29, 2015 9:24:32 AM
>

Like before, take note of your jar file location (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to sparkFuStreamingConsumer.jar.

Spark it up!

If your local Spark cluster is not up and running, do that now.  Go here to see about getting 'r' done.

Make Sparks fly! (i.e. run it)

Having reviewed the code you have seen that out of the box, this streaming application expects ZooKeeper local on port 2181 & Kafka local with a topic of sparkfu.  Additionally, a local Cassandra instance with keyspace sparkfu and table messages should also exist.  The CQL for this table is at the root of the project.  See my post on local Cassandra if you do not yet have that bit in place.  Once Cassandra is up and running execute the CQL to create the messages table.

You should be able to see the following when done:
cqlsh> use sparkfu;
cqlsh:sparkfu> DESCRIBE TABLE messages;

CREATE TABLE messages (
  key text,
  msg text,
  PRIMARY KEY ((key))
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
...
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};
...and so we're sure we're starting from zero:
cqlsh:sparkfu> TRUNCATE messages;
cqlsh:sparkfu> SELECT key, msg FROM messages;

(0 rows)

OK, spark it up:
[bkarels@rev27 spark-streaming-kafka-consumer]$ $SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.Consumer --master spark://127.0.0.1:7077 /home/bkarels/work/spark-streaming-kafka-consumer/target/scala-2.10/sparkFuStreamingConsumer.jar
...
Output will flow about here...
If all has gone well, your streaming application is now running on your local Spark cluster waiting for messages to hit your Kafka topic.

You can verify at http://localhost:4040.



If you have just recently completed part two, your application will likely be busy pulling messages from the sparkfu topic.  However, let's assume that's not the case.  So, fire up the message generator from part two and run it for a few seconds to publish some messages to the sparkfu topic.

[bkarels@rev27 spark-kafka-msg-generator]$ $SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.RandomMessages --master local[*] /home/bkarels/dev/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar sparkfu 5 true
...
15/01/29 09:45:40 INFO SyncProducer: Connected to rev27:9092 for producing
5000 * 3 seconds left...
Produced 9282 msgs in 5s -> 1856.0m/s.
So, we should have 9282 messages in our topic (in this example).  We should also, by the time we look have that same number of messages in our messages table in Cassandra.

cqlsh:sparkfu> SELECT COUNT(*) FROM messages;

 count
-------
  9282

(1 rows)

Due to the nature of this example, the data itself will be uninteresting:

 cqlsh:sparkfu> SELECT key,msg FROM messages LIMIT 5;

 key                       | msg
---------------------------+------------------------------------------------------
 kUs3nkk0mv5fJ6eGvcLDrkQTd | 1422546343283,AMpOMjZeJozSy3t519QcUHRwl,...
 P6cUTChERoqZ7bOyDa3XjnHNs | 1422546344238,VDPCAhV3k3m5wfaUY0jAB8qB0,...
 KlRKLYnnlZY6NCpbKyQEIKLrF | 1422546343576,hdzvBKR7z2raTsxNYFoTFmeS2,...
 YGrBt2ZI7PPXrpopLsSTAwYrD | 1422546341519,cv0b7MEPdnrK1HuRL0GPDzMMP,...
 YsDWO67wKMuWBzyRpOiRSNpq2 | 1422546344491,RthQLkxPc5es7f2fYjTXRJnNu...

(5 rows)


So there you have it - about the most simple Spark Streaming Kafka consumer you can do.  Shoulders of giants people.  So tweak it, tune it, expand it, use it as your springboard.


What's next?

Still in flight for the next part of this series:
  • Not sure - I may just work custom Kafka encode/decode into this example.
  • May also, model the data in more of time series fashion to make reading from Kafka a more practical exercise.
  • Your thoughts?
 Thanks for checking this out - more to come!

Monday, January 26, 2015

Spark + Kafka + Cassandra (Part 2 - Spark Kafka [mass] Producer)

Overview

Welcome to the part two of the series 'Spark + Kafka + Cassandra'.

Building on top of part one and preparing for part three, here we'll spin up a little application that does only one thing well - spews out random messages for a number of seconds.  Basically, we wanted to have a tool that could be used to pump a large quantity of messages into Kafka to:
  • Test a Kafka setup
  • Tinker with throughput
  • Feed a consumer
  • etc.
Also, this example can easily be altered to spew whatever one might like (CSV, JSON, data from a flat file or HDFS, etc.).  Make it yours - spew your own stuff.

This second instalment will have you:
  • Run ZooKeeper local (as part of Kafka distribution)
  • Run Kafka local
  • Run a local Spark Cluster
  • Create a Kafka Producer wrapped up in a Spark application
  • Submit the application to generate messages sent to a given topic for some number of seconds.

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • ZooKeeper and Kafka running local 
  • git


Set it up...

If you have not walked through Part 1 of this series, go there and walk through it to get ZooKeeper and Kafka up and running on your local machine.  When you've done that, come back here and proceed to the next step.



Clone the Example

Begin by cloning the example project from github - spark-kafka-msg-generator, & cd into the project directory.

[bkarels@rev27 work]$ git clone https://github.com/bradkarels/spark-kafka-msg-generator.git
Cloning into 'spark-kafka-msg-generator'...
remote: Counting objects: 19, done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 19 (delta 0), reused 15 (delta 0)
Unpacking objects: 100% (19/19), done.
[bkarels@rev27 work]$ cd spark-kafka-msg-generator/
[bkarels@rev27 spark-kafka-msg-generator]$


Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@rev27 spark-kafka-msg-generator]$ sbt
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
[info] Loading project definition from /home/bkarels/work/spark-kafka-msg-generator/project
[info] Updating {file:/home/bkarels/work/spark-kafka-msg-generator/project/}spark-kafka-msg-generator-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to MsgSpewer (in build file:/home/bkarels/work/spark-kafka-msg-generator/)
> assembly
...
[info] SHA-1: cddec14059d6a435847f8dd4b4b6f15f6899c0c3
[info] Packaging /home/bkarels/work/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar ...
[info] Done packaging.
[success] Total time: 53 s, completed Jan 26, 2015 9:10:16 AM

As per normal, take note of where your jar file gets put (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to sparkFuProducer.jar.

Spark it up!

If your local Spark cluster is not up and running, do that now.  Go here to see about getting 'r' done.

Make Sparks fly! (i.e. run it)

Having reviewed the code you have seen that this application accepts one, two, or three argument.  By default, the application will publish to topic sparkfu for 10 seconds without printing more than the normal output to the terminal.  You can optionally set the following three arguments:

[[topic] [duration]] [verbose]]]
Argument Examples:
Publish to topic sparkfu for 20 seconds and do print additional output:
sparkfu 20 true

Publish to topic filthpig for 3600 seconds and do not print additional output:
filthpig 3600

Publish to topic schweinehund for 10 seconds and do not print additional output:
schweinehund


Like Part 1, when this application is submitted to the cluster it will have no clear output.  Here again we can use the local consumer.  Do as below or look back to Part 1 to see about this.
$KAFKA_HOME/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic sparkfu --from-beginning


OK, spark it up:
$SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.RandomMessages --master local[*] ~/dev/spark-kafka-msg-generator/target/scala-2.10/sparkFuProducer.jar sparkfu 20 true
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
This will run for: 20s
...
15/01/26 09:24:42 INFO SyncProducer: Connected to rev27:9092 for producing
5000 * 18 seconds left...
10000 * 17 seconds left...
...
115000 * 3 seconds left...
120000 * 2 seconds left...
130000 * 1 seconds left...
Produced 134137 msgs in 20s -> 6706.0m/s.
If you were running your local consumer you should see a big ol' pile of alphanumeric randomness stream by.  Just like that you can spew a pretty good chunk of messages to a topic on command. 


What's next?

On to part three where we'll consume messages out of Kafka with Spark Streaming and write them to Cassandra.  See part three here!

Tuesday, January 20, 2015

Spark + Kafka + Cassandra (Part 1 - Spark Kafka Producer)


Overview

Welcome to the part one of the series 'Spark + Kafka + Cassandra'.

In this series we will look to build up a Spark, Kafka, Cassandra stack that can be used as the foundation for real projects on real clusters that do real work.

As is typical here at Spark-Fu! we will start with the most basic case and build on that incrementally.  Keeping with that, this first instalment will have you:
  • Run ZooKeeper local (as part of Kafka distribution)
  • Run Kafka local
  • Run a local Spark Cluster
  • Create a Kafka Producer wrapped up in a Spark application
  • Submit the application and consume the messages in a terminal
In subsequent parts of the series we'll look at consuming messages from Kafka using Spark Streaming, using custom encoders, adding Cassandra into the mix, and hopefully top it off with something that will run against a small cluster.

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • ZooKeeper and Kafka running local 
  • git


Get ZooKeeper & Kafka

So, before we even clone the example application, we have to do a bit of work to get ZooKeeper and Kafka up and running on your machine.

At the time of this post, 0.8.2-beta is the latest release. The current stable version is 0.8.1.1.  So let's use 0.8.1.1 - start by downloading it here.  (NOTE: This is the Scala 2.10.x version!)  Other variants are all available here if you have other needs.

Once downloaded, extract the archive - for this example I'm a fan of putting things at:

~/kafka_2.10-0.8.1.1

The rest of the example will assume this is the case.

Run ZooKeeper Local

To be polite, setting up local ZooKeeper is about as fun as giving a bad tasting pill to a pissed of cat with claws.  Thankfully, the Kafka folks have done us a solid and bundled a mechanism to get up and running with ZooKeeper as simple as:

[bkarels@rev27 kafka_2.10-0.8.1.1]$ bin/zookeeper-server-start.sh config/zookeeper.properties
Yes, there is an infinite number of ways to get ZK up and running - but the above is as far as I'm going here.  Simple and functional.

Run Kafka Local

The next set of steps is near exact copy of the Kafka Quickstart.  I have tuned the bits here that jive with the example application so do look at the Kafka docs, but do follow the steps below.

Assuming you have Kafka downloaded and extracted as above, lets start things up, create our topic, and get a list of topics.

Start the server:
[bkarels@rev27 kafka_2.10-0.8.1.1]$ bin/kafka-server-start.sh config/server.properties &
Create topic `sparkfu`:
[bkarels@rev27 kafka_2.10-0.8.1.1]$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic sparkfu &
List our topic(s) as an exercise:
[bkarels@rev27 kafka_2.10-0.8.1.1]$ bin/kafka-topics.sh --list --zookeeper localhost:2181
All of the above commands spit out a goodly amount of "stuff" to your console, but just a bit of careful examination should have you feeling confident that everything is up and running correctly.   Most importantly, when you listed out your topics that somewhere in that output you saw sparkfu listed (will be on it's own line).

To be certain, we can test your sparkfu topic.  Open two terminals and navigate to ~/kafka_2.10-0.8.1.1.  In one terminal start up a consumer to sit and listen for messages on your topic:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic sparkfu --from-beginning
In your other terminal we'll fire up a producer:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic sparkfu
Once your producer is started and staring at you, you can enter any string and press [Enter] to produce the message.  Within a moment you should see your message parroted back to you in the terminal where you started up the consumer.  If you crave greater details review the Kafka documentation.

Kill your producer, but leave your consumer running - you'll need it in just a bit...

Clone the Example

Begin by cloning the example project from github - spark-kafka-producer, & cd into the project directory.

[bkarels@rev27 work]$ git clone https://github.com/bradkarels/spark-kafka-producer.git
Cloning into 'spark-kafka-producer'...
remote: Counting objects: 19, done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 19 (delta 0), reused 15 (delta 0)
Unpacking objects: 100% (19/19), done.
[bkarels@rev27 work]$ cd spark-kafka-producer/
[bkarels@rev27 spark-kafka-producer]$


Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@rev27 spark-kafka-producer]$ sbt
Picked up _JAVA_OPTIONS: -Xms1G -Xmx2G -XX:PermSize=512m -XX:MaxPermSize=1G
[info] Loading project definition from /home/bkarels/work/spark-kafka-producer/project
[info] Updating {file:/home/bkarels/work/spark-kafka-producer/project/}spark-kafka-producer-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Set current project to Spark Fu Kafka Producer (in build file:/home/bkarels/work/spark-kafka-producer/)
> assembly
...
[info] SHA-1: 69db758e5dd205ae60875f00388cd5c935955773
[info] Packaging /home/bkarels/work/spark-kafka-producer/target/scala-2.10/sparkFuProducer.jar ...
[info] Done packaging.
[success] Total time: 11 s, completed Jan 20, 2015 12:08:44 PM
>
As per normal, take note of where your jar file gets put (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to sparkFuProducer.jar.

Spark it up!

If your local Spark cluster is not up and running, do that now.  It's not!?  Mon Dieu!  Go here to see about remedying that.

Make Sparks fly! (i.e. run it)

Assuming you have poked around in the code, you will have seen that the Messenger within MillaJovovich.scala (a few will see my humor...) will produce four messages for the topic sparkfu and send them.  So, when this application is submitted to the cluster it will have no clear output like our previous examples.  This is where the consumer you have running will come in.  If you stopped it no worries, just restart it as you did above.
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic sparkfu --from-beginning
Truth be told you could start it up after the fact since we are telling it to read all messages from the start of time.  But isn't it more fun to see messages come in as your job is sending them?  Yes, it is.

OK, spark it up:
[bkarels@rev27 kafka_2.10-0.8.1.1]$ $SPARK_HOME/bin/spark-submit --class com.bradkarels.simple.Messenger --master spark://127.0.0.1:7077 /home/bkarels/dev/spark-kafka-producer/target/scala-2.10/sparkFuProducer.jar
If all has gone well you should see output like the following in your consumer as the application runs:
[2015-01-20 12:28:09,547] INFO Closing socket connection to /127.0.0.1. (kafka.network.Processor)
What the foo?
What the bar?
What the baz?
No! No! NO! What the FU!
[2015-01-20 12:28:10,071] INFO Closing socket connection to /192.168.254.153. (kafka.network.Processor)


That's it.  Well, that's it as far as scraping ever so lightly at the surface of this topic.  In the following parts of this series we will dig a bit deeper.

What's next?

As I alluded to in the overview the next parts in this series will provide examples for:
  • Rudimentary Kafka consumer in Spark
  • Custom Kafka Message encoder
  • Consuming Kafka messages using Spark Streaming
  • A larger example pulling all these bits together
 Thanks for following along - stay tuned!

Wednesday, January 7, 2015

Happy New Year...vacation is over bitches...

With the new year a week old, the Spark-Fu slowdown must come to an end.  Took some much needed time off over the holiday to recharge.  But that's all over with now and I'll be ramping back up with more Apache Spark on Cassandra examples using Scala and Python in the coming days and weeks.

Stayed tuned all.

Happy New Year from Spark-Fu!

Tuesday, December 16, 2014

Simple Python application on Apache Spark Cluster

Overview


As an exercise, I am working on duplicating my previous examples in Python.  It is clear that Python has, and is gaining, traction in the data world.  So, it makes sense to have a working knowledge of it.

As with my other examples, everything will find it's way to my Github repositories - forking and enhancements welcome.

This effort is about the most simplistic Python submit to Spark Cluster example possible.  But, when you move beyond the REPL, you have to start somewhere right?

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • A Spark cluster (how to do it here.)
  • git


Clone the Example

Begin by cloning the example project from github - super-simple-spark-python-app and cd into the project directory.


[bkarels@ahimsa work]$ git clone git@github.com:bradkarels/super-simple-spark-python-app.git
Initialized empty Git repository in /home/bkarels/work/super-simple-spark-python-app/.git/
remote: Counting objects: 13, done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 3), reused 6 (delta 1)
Receiving objects: 100% (13/13), done.
Resolving deltas: 100% (3/3), done.
[bkarels@ahimsa work]$ cd super-simple-spark-python-app/


Move the file tenzingyatso.txt to your home directory.
[bkarels@ahimsa super-simple-spark-python-app]$ mv tenzingyatso.txt ~

Modify simple.py path to the sample file (save and close).

file = sc.textFile("/home/bkarels/tenzingyatso.txt")
becomes...
file = sc.textFile("/home/yourUserNameHere/tenzingyatso.txt")
...or some such similar thing.

Spark it up (with python)!

If your local Spark cluster is not up and running, do that now.  If you need to review how to go about that, you can look here.

Make Sparks fly! (i.e. run it)


Since this example does not have a packaged application (e.g. jar, egg, etc.), we can invoke spark-submit with just our simple python file.


[bkarels@ahimsa super-simple-spark-python-app]$ $SPARK_HOME/bin/spark-submit --master spark://127.0.0.1:7077 ./simple.py
Your expected output to the console should be a line count of 7 wrapped in a nice battery of asterisks and the copy from the first line of the example file.  If you see that - this has worked.

Tuesday, December 9, 2014

Writing to Cassandra 2.0.x with Spark 1.1.1 - moving beyond Tuple2

UPDATE! (2014-12-12): Wide tuple write example added.  I won't lie, it is not elegant - but worth knowing the basic concepts of how to work with tuple bigger than two.  Pull the updated project code to check it out.

Overview

The most rudimentary mechanisms for writing Tuple2 to Cassandra are pretty well beaten to a pulp.  In this post I hope to shine light on few additional, more complex ways to write data to Cassandra as not everything we do with a Spark/Cassandra stack will be reduceByKey on key/value pairs.

In the first example in this series we will fetch some data from Cassandra directly into a case class.  Then we will transform that RDD into a more compact version of itself and write the resulting collection of case classes back to a different table in our Cassandra cluster.

We will be sticking to our 'human' example from previous examples however, I have tuned the schema a bit.  To keep your life simple, I recommend executing the updated CQL which will create a new keyspace and tables for this example.

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • Scala 2.10.4
  • SBT 0.13.x
  • A Spark cluster (how to do it here.)
  • A Cassandra cluster (how to do it here.)
  • git

Clone the Example

Begin by cloning the example project from github - sbt-spark-cassandra-writing, & cd into the project directory.

[bkarels@ahimsa work]$ git clone https://github.com/bradkarels/sbt-spark-cassandra-writing.git simple-writing
Initialized empty Git repository in /home/bkarels/work/simple-writing/.git/
remote: Counting objects: 21, done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 21 (delta 0), reused 17 (delta 0)
Unpacking objects: 100% (21/21), done.
[bkarels@ahimsa work]$ cd simple-writing/
[bkarels@ahimsa simple-writing]$

Prepare the Data

At the root of the project you will see two CQL files: sparkfu.cql and populateHumans.cql.  You will need to execute these two files against your local Cassandra instance from Datastax DevCenter or cqlsh (or some other tool) to set things up.

Begin by executing sparkfu.cql to create your keyspace and tables:

CREATE KEYSPACE sparkfu WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

CREATE TABLE sparkfu.human (
    id TIMEUUID,
    firstname TEXT,
    lastname TEXT,
    gender TEXT,
    address0 TEXT,
    address1 TEXT,
    city TEXT,
    stateprov TEXT,
    zippostal TEXT,
    country TEXT,
    phone TEXT,
    isgoodperson BOOLEAN,
    PRIMARY KEY(id)
);

// Letting Cassandra use default names for the indexs for ease.
CREATE INDEX ON sparkfu.human ( isgoodperson ); // We want to be able to find good people quickly

CREATE INDEX ON sparkfu.human ( stateprov ); // Maybe we need good people by state?

CREATE INDEX ON sparkfu.human ( firstname ); // Good people tend to be named "Brad" - let's find them fast too!

// Clearly this is a horrible model you would never ever use in production, but since this is just a simple example.
CREATE TABLE sparkfu.goodhuman (
    firstname TEXT,
    lastname TEXT,
    PRIMARY KEY(firstname,lastname)
);
Next load up your table with the sample data by executing populateHumans.cql

INSERT INTO sparkfu.human (id, firstname, lastname, gender, address0, address1,city, stateprov, zippostal, country, phone, isgoodperson)
  VALUES (now(),'Pete', 'Jones', 'm', '555 Astor Lane',null,'Minneapolis','MN','55401','USA','6125551212',True);
...
[Some CQL removed for brevity - file in project source. ]
...
INSERT INTO sparkfu.human (id, firstname, lastname, gender, address0, address1,city, stateprov, zippostal, country, phone, isgoodperson)
  VALUES (now(),'Brad', 'Karels', 'm', '123 Nice Guy Blvd.',null,'Minneapolis','MN','55402','USA','6125551212',True); 
INSERT INTO sparkfu.human (id, firstname, lastname, gender, address0, address1,city, stateprov, zippostal, country, phone, isgoodperson)
  VALUES (now(),'Alysia', 'Yeoh', 't', '1 Bat Girl Way',null,'Metropolis','YX','55666','USA','3215551212',True);
Errors?  No.  Good - we are set to proceed.

Prepare the Application

In a terminal, from the project root, fire up SBT and assembly the project to create your application jar file.
[bkarels@ahimsa simple-writing]$ sbt
[info] Loading project definition from /home/bkarels/work/simple-writing/project
[info] Updating {file:/home/bkarels/work/simple-writing/project/}simple-writing-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...                                    
[info] Done updating.                                                                  
[info] Set current project to Sandy Author (in build file:/home/bkarels/work/simple-writing/)
> assembly                                                                                  
[info] Updating {file:/home/bkarels/work/simple-writing/}simple-writing...                  
[info] Resolving org.fusesource.jansi#jansi;1.4 ...                                         
[info] Done updating.                                                                       
[info] Compiling 1 Scala source to /home/bkarels/work/simple-writing/target/scala-2.10/classes...
...
[ a whole bunch of assembly output will be here... ]
...
[info] SHA-1: 0202b523259e5688311e4b2bcb16c63ade4b7067
[info] Packaging /home/bkarels/work/simple-writing/target/scala-2.10/SandyAuthor.jar ...
[info] Done packaging.
[success] Total time: 19 s, completed Dec 10, 2014 4:41:29 PM
>
As per normal, take note of where your jar file gets put (highlighted above).  Also, note that we have set the resulting jar file name in assembly.sbt - here we have set it to SandyAuthor.jar(You can only say Cassandra and Writer so many times - having some fun with jar names...)

Spark it up!

If your local Spark cluster is not up and running, do that now.  If you need to review how to go about that, you can look here.

Make Sparks fly! (i.e. run it)

In keeping with our theme of finding good people, we'll do it again.  But this time, as you will see in the project source, we will fetch only good people in the first place, transform them into a simplified RDD, and store them off safely separated from all the not good people.

From your terminal, using the location of SandyAuthor.jar from above, submit the application to your Spark cluster:

[bkarels@ahimsa simple-writing]$ $SPARK_HOME/bin/spark-submit --class com.sparkfu.simple.Writer --master spark://127.0.0.1:7077 /home/bkarels/dev/simple-writing/target/scala-2.10/SandyAuthor.jar
...
[ lots of Spark output dumped to terminal here...]
...
14/12/10 16:48:51 INFO spark.SparkContext: Job finished: toArray at SimpleWriting.scala:34, took 0.641733148 s
Alysia Yeoh is a good, simple person.
Edward Snowden is a good, simple person.
Fatima Nagossa is a good, simple person.
Pete Jones is a good, simple person.
Brad Karels is a good, simple person.
Mother Theresa is a good, simple person.
Hiro Ryoshi is a good, simple person.
Neil Harris is a good, simple person.
B Real is a good, simple person.
...
[and then the wide tuple example output]
...

Alysia Yeoh is transgender and can be reached at 3215551212.                                                                                                                                                                                
Neil Harris is male and can be reached at 9045551212.                                                                                                                                                                                       
Mother Theresa is female and can be reached at null.                                                                                                                                                                                        
Hiro Ryoshi is female and can be reached at 7155551212.                                                                                                                                                                                     
Brad Karels is male and can be reached at 6125551212.                                                                                                                                                                                       
Pete Jones is male and can be reached at 6125551212.                                                                                                                                                                                        
Edward Snowden is male and can be reached at null.                                                                                                                                                                                          
Fatima Nagossa is female and can be reached at 7895551212.                                                                                                                                                                                  
B Real is male and can be reached at 9995551212.

[bkarels@ahimsa simple-writing]$
If all has gone well you should see output like the above.  Please note, we are also TRUNCATING the goodperson and personfromtuple tables just before the program exits, so if you look in Cassandra those tables will be empty.  Feel free to comment out those lines and validate directly in Cassandra.

Then play around with it.  Extend the case classes, add columns to the schema, add more transformations, etc..  These simple examples will serve you best if you extend and make them your own.

What's next?

There is likely good cause to start playing with creating and altering table structures dynamically.