Tuesday, November 25, 2014

Using spark-submit to send an application to a local Spark cluster

In my last post (Running a local Apache Spark Cluster)
I went over how to spin up a local Spark cluster for development and prototyping.  Now it is time to build the most basic Spark application to submit to your local cluster.  While this example is heavily based on this example, we will tweak a couple of bits to make it just slightly more interesting.

What you should expect:
  1. Pull down and quickly modify the source.
  2. Package the application into a jar file.
  3. Submit the application using spark-submit to your locally running cluster (or any cluster where the sample file exists on all nodes).
  4. View the expected results in your terminal.

The ready to consume application can be found at:
https://github.com/bradkarels/super-simple-spark-app

See the README.md file for direction on how to modify the application to run on your environment.

You will need to have Java, Scala, and SBT installed locally.

(From the README.md file)

Step 1:
Move the file tenzingyatso.txt to a known location on your file system (E.g. /tmp/tenzingyatso.txt)

Step 2:
Modify SuperSimple.scala so the path to tenzingyatso.txt is correct for your system.

From:
val compassionFile = "/home/bkarels/tenzingyatso.txt"

To:
val compassionFile = "/tmp/tenzingyatso.txt"

Step 3:
From the root of this project run package from within SBT:

$sbt
...
> package
...
*** Take note of where the application jar is written ***
[info] Packaging /home/bkarels/dev/super-simple-spark-app/target/scala-2.10/super-simple-spark-app_2.10-0.1.jar ...
[info] Done packaging.

> exit

Step 4:
Since this has been designed to run against a local cluster, navigate to your $SPARK_HOME and use spark-submit to send the application to your cluster:

(example)
[bkarels@ahimsa spark_1.1.0]$ ./bin/spark-submit --class com.bradkarels.spark.simple.SuperSimple --master spark://127.0.0.1:7077 /home/bkarels/dev/super-simple-spark-app/target/scala-2.10/super-simple-spark-app_2.10-0.1.jar
...
Talks of peace: 3
Speaks of love: 2

FIN

Running a local Apache Spark cluster

We can't all have a dedicated cluster to play with and even if we do, having complete control of a disposable environment has it's advantages.  Here we will examine the most simple path to setting up a local cluster on your machine.  Remember, one of the great powers of Spark is that the same code you run on your underpowered single machine on a tiny dataset will run on hundreds of nodes and petabytes of data.  So let's make your laptop useful for development and prototyping shall we...

For this example I am running a CentOS 6.5 virtual machine (VMWare) set to use a single processor with four cores and ~12Gb RAM.  So we'll give one core to our Spark Master and one core each to two worker nodes with 1Gb of memory.  Clearly, you could do more if you have more cores and more memory, but we're not looking to break processing speed records - just to move your Spark knowledge to the next level. 

(How fun would it be to do this on a stack of RasberryPis?)

The state of the art Apache Spark release, as of the time of this writing, is 1.1.0 and that is what will be used.

You probably have a binary Spark distribution downloaded to your machine but if you do not, do that now.  Apache Spark can be downloaded here.  Once downloaded extract it to a local directory - mine is at:

/home/bkarels/spark_1.1.0

(this will become $SPARK_HOME)

The Spark developers have done you a huge favour and have added a set of scripts at SPARK_HOME/sbin/ to do most of what we want to accomplish (get a local spark cluster on a single machine).  Again, in the spirit of getting you up and running without exploring every possibility, here is what you need to do.

You can tip up a Spark master and worker independently using SPARK_HOME/sbin/start-master.sh and SPARK_HOME/sbin/start-slaves.sh together.  But we're going to add a couple environment variables and let SPARK_HOME/sbin/start-all.sh tip up a master and two workers in a single step.

The official Apache Spark documentation for this can be found here if you want to dig deeper.  But for now use your favorite editor to pop open ~/.bashrc.

Add the following items (mind where your SPARK_HOME actually is):

# Spark local environment variables
export SPARK_HOME=/home/bkarels/spark_1.1.0
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_LOCAL_DIRS=$SPARK_HOME/work
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1G
export SPARK_WORKER_INSTANCES=2
export SPARK_DAEMON_MEMORY=384m
SPARK_MASTER_IP, SPARK_MASTER_PORT, & SPARK_MASTER_WEBUI_PORT
Here, these are explicitly set with the defaults.  This is just to illustrate that you could easily customize these values.  The remaining elements are where we get what we are looking for in our local cluster.

By default, start-all.sh fires up a single worker that uses all available cores and your available memory - 1Gb. (E.g. On a four core machine with 12Gb RAM the worker would use all four cores and 11Gb of memory.)  If that works for you, great!  If not, let's tune things a bit.

SPARK_WORKER_INSTANCES
Defaults to 1.  If you set this to anything greater than one, be sure that the value multiplied by the value of SPARK_WORKER_CORES is less than or equal to the total number of cores available on your machine.

Also, if you set to a value greater than one, verify that the value multiplied by the value of SPARK_WORKER_MEMORY is less than your total system memory.

On my example machine with four cores and 12Gb of memory we could do:
(workers X cores X memory in Gb)
4x1x3
2x2x6
1x4x12
...and other combos that do not max things out:
3x1x1
2x1x2
2x1x1 (our example)

SPARK_WORKER_MEMORY & SPARK_WORKER_INSTANCES
Hopefully these are self explanatory - so I'll let the Apache docs stand.  Just be mindful of the notes above.

It is worth noting that for development and prototyping a small worker memory should be sufficient.  (E.g. Worker memory of 256Mb would be more than enough if your sample data set was a 64Mb log file or 150Mb of emails.)  Remember, failing fast in development is a good thing!

SPARK_DAEMON_MEMORY
By default this value is 512m.  I have turned it down here just to illustrate that it can be tuned.  So if you are getting tight on resources running your IDE, Cassandra, Spark, & etc.; you could turn this down a bit.  (This and memory per worker perhaps.)

Enough with the details, let's run this thing...

First, don't forget to source your .bashrc file:
[bkarels@ahimsa ~]$ . .bashrc

Navigate to SPARK_HOME/sbin:
[bkarels@ahimsa ~]$ cd $SPARK_HOME/sbin
[bkarels@ahimsa sbin]$

Run start-all.sh entering your password when prompted:
[bkarels@ahimsa sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.master.Master-1-ahimsa.out                                                     
bkarels@localhost's password:                                                                              
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.worker.Worker-1-ahimsa.out

Looks good - let's verify by checking out the master's webui at localhost:8080.


VoilĂ !  Your completely customizable, local, disposable Spark cluster is up and running and ready to accept jobs.  (We'll get to that bit soon!)

Lastly - shutting it down.  Here again the Spark engineers have done the work, you need only call stop-all.sh.

[bkarels@ahimsa sbin]$ ./stop-all.sh
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[bkarels@ahimsa sbin]$
FIN

Friday, November 21, 2014

Get your Cassandra on with ccm (Cassandra Cluster Manager)

UPDATE! (2014-12-9)  After a small bit of experimentation it seems that running the Spark-Fu examples against Cassandra clusters spun up using CCM may be the cause of the performance issues I have been experiencing.  Using a single node Cassandra "cluster" from the tarball has presented the kind of performance I would expect for these simple examples on a laptop.  I followed the great tutorial from Datastax Academy to set this up.  You will need to sign up for the Academy - but it's free and has much great content so I have no issues recommending it.  That said, using CCM to experiment with Cassandra clusters locally seems to be otherwise wonderful and stable.

To do things with Apache Spark on Cassandra we need first have Cassandra.  The best/fastest way (that I know of) to get a Cassandra cluster locally to prototype with is to use CCM.

What is CCM? 

CCM is the Cassandra Cluster Manager.

What does CCM do?

CCM creates multi-node clusters for development and testing on a local machine.  It has no capacity for use in production.

How do I get started with all this CCM voodoo?

I am running this all on CentOS 6.5 - directions will vary for other environments.  We also assume you have java 7 or high installed.

Step 1: Download & install the epel packages:

See -> https://fedoraproject.org/wiki/EPEL

Download the package (e.g. CentOS 6.x)

http://mirror.metrocast.net/fedora/epel/6/i386/epel-release-6-8.noarch.rpm

Install:
[bkarels@ahimsa ~]$ sudo rpm -Uvh epel-release-6-8.noarch.rpm

Step 2: Install python-pip:

[bkarels@ahimsa ~]$ sudo yum -y install python-pip

[bkarels@ahimsa ~]$ pip install cql PyYAML

Step 3: Install Apache Ant (CCM depends on ant)

See ant.apache.org for install instructions.

Step 4: Install ccm: (Cassandra Cluster Manager)


[bkarels@ahimsa ~]$ git clone https://github.com/pcmanus/ccm.git
[bkarels@ahimsa ~]$ cd ccm/
[bkarels@ahimsa ~]$ sudo ./setup.py install

Step 5 (Optional): Get Help

To get help: (this is really a great way to dig in to ccm)
[bkarels@ahimsa ~]$  ccm -help
[bkarels@ahimsa ~]$  ccm [command] -help

Step 6: Do some stuff
CCM has two primary types of operations: 
  1. Cluster commands
  2. Node commands
Cluster commands take the form:
$ ccm [cluster command] [options]

Node commands take the form:
$ ccm [node name] [node command] [options]

So, lets spin up a three node local cluster real quick like: 

[bkarels@ahimsa ~]$ ccm create cluster0 -v 2.0.11
Downloading http://archive.apache.org/dist/cassandra/2.0.11/apache-cassandra-2.0.11-src.tar.gz to /tmp/ccm-bwFLa4.tar.gz (10.836MB)
  11362079  [100.00%]
Extracting /tmp/ccm-bwFLa4.tar.gz as version 2.0.11 ...
Compiling Cassandra 2.0.11 ...
Current cluster is now: cluster0
[bkarels@ahimsa ~]$ ccm list
 *cluster0
[bkarels@ahimsa ~]$ ccm populate --nodes 3
[bkarels@ahimsa ~]$ ccm start
[bkarels@ahimsa ~]$ ccm status
Cluster: 'cluster0'
-------------------
node1: UP
node3: UP
node2: UP
[bkarels@ahimsa ~]$ ccm node2 stop
[bkarels@ahimsa ~]$ ccm status
Cluster: 'cluster0'
-------------------
node1: UP
node3: UP
node2: DOWN
[bkarels@ahimsa ~]$

And just like that you have a three node cluster on your local machine that you can start to play with.  Of course this set of instructions barely scratches the surface of what is possible, but our focus is Spark so this is just to give us something we can read from and write to.

FIN