Tuesday, November 25, 2014

Running a local Apache Spark cluster

We can't all have a dedicated cluster to play with and even if we do, having complete control of a disposable environment has it's advantages.  Here we will examine the most simple path to setting up a local cluster on your machine.  Remember, one of the great powers of Spark is that the same code you run on your underpowered single machine on a tiny dataset will run on hundreds of nodes and petabytes of data.  So let's make your laptop useful for development and prototyping shall we...

For this example I am running a CentOS 6.5 virtual machine (VMWare) set to use a single processor with four cores and ~12Gb RAM.  So we'll give one core to our Spark Master and one core each to two worker nodes with 1Gb of memory.  Clearly, you could do more if you have more cores and more memory, but we're not looking to break processing speed records - just to move your Spark knowledge to the next level. 

(How fun would it be to do this on a stack of RasberryPis?)

The state of the art Apache Spark release, as of the time of this writing, is 1.1.0 and that is what will be used.

You probably have a binary Spark distribution downloaded to your machine but if you do not, do that now.  Apache Spark can be downloaded here.  Once downloaded extract it to a local directory - mine is at:


(this will become $SPARK_HOME)

The Spark developers have done you a huge favour and have added a set of scripts at SPARK_HOME/sbin/ to do most of what we want to accomplish (get a local spark cluster on a single machine).  Again, in the spirit of getting you up and running without exploring every possibility, here is what you need to do.

You can tip up a Spark master and worker independently using SPARK_HOME/sbin/start-master.sh and SPARK_HOME/sbin/start-slaves.sh together.  But we're going to add a couple environment variables and let SPARK_HOME/sbin/start-all.sh tip up a master and two workers in a single step.

The official Apache Spark documentation for this can be found here if you want to dig deeper.  But for now use your favorite editor to pop open ~/.bashrc.

Add the following items (mind where your SPARK_HOME actually is):

# Spark local environment variables
export SPARK_HOME=/home/bkarels/spark_1.1.0
Here, these are explicitly set with the defaults.  This is just to illustrate that you could easily customize these values.  The remaining elements are where we get what we are looking for in our local cluster.

By default, start-all.sh fires up a single worker that uses all available cores and your available memory - 1Gb. (E.g. On a four core machine with 12Gb RAM the worker would use all four cores and 11Gb of memory.)  If that works for you, great!  If not, let's tune things a bit.

Defaults to 1.  If you set this to anything greater than one, be sure that the value multiplied by the value of SPARK_WORKER_CORES is less than or equal to the total number of cores available on your machine.

Also, if you set to a value greater than one, verify that the value multiplied by the value of SPARK_WORKER_MEMORY is less than your total system memory.

On my example machine with four cores and 12Gb of memory we could do:
(workers X cores X memory in Gb)
...and other combos that do not max things out:
2x1x1 (our example)

Hopefully these are self explanatory - so I'll let the Apache docs stand.  Just be mindful of the notes above.

It is worth noting that for development and prototyping a small worker memory should be sufficient.  (E.g. Worker memory of 256Mb would be more than enough if your sample data set was a 64Mb log file or 150Mb of emails.)  Remember, failing fast in development is a good thing!

By default this value is 512m.  I have turned it down here just to illustrate that it can be tuned.  So if you are getting tight on resources running your IDE, Cassandra, Spark, & etc.; you could turn this down a bit.  (This and memory per worker perhaps.)

Enough with the details, let's run this thing...

First, don't forget to source your .bashrc file:
[bkarels@ahimsa ~]$ . .bashrc

Navigate to SPARK_HOME/sbin:
[bkarels@ahimsa ~]$ cd $SPARK_HOME/sbin
[bkarels@ahimsa sbin]$

Run start-all.sh entering your password when prompted:
[bkarels@ahimsa sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.master.Master-1-ahimsa.out                                                     
bkarels@localhost's password:                                                                              
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.worker.Worker-1-ahimsa.out

Looks good - let's verify by checking out the master's webui at localhost:8080.

VoilĂ !  Your completely customizable, local, disposable Spark cluster is up and running and ready to accept jobs.  (We'll get to that bit soon!)

Lastly - shutting it down.  Here again the Spark engineers have done the work, you need only call stop-all.sh.

[bkarels@ahimsa sbin]$ ./stop-all.sh
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[bkarels@ahimsa sbin]$


  1. Thanks! This was helpful r3volutionary!

  2. Hey thanks for this !!! I am also checking the Kafka, Spark Cassandra!..really well explained....NAMASTE