Tuesday, November 25, 2014

Running a local Apache Spark cluster

We can't all have a dedicated cluster to play with and even if we do, having complete control of a disposable environment has it's advantages.  Here we will examine the most simple path to setting up a local cluster on your machine.  Remember, one of the great powers of Spark is that the same code you run on your underpowered single machine on a tiny dataset will run on hundreds of nodes and petabytes of data.  So let's make your laptop useful for development and prototyping shall we...

For this example I am running a CentOS 6.5 virtual machine (VMWare) set to use a single processor with four cores and ~12Gb RAM.  So we'll give one core to our Spark Master and one core each to two worker nodes with 1Gb of memory.  Clearly, you could do more if you have more cores and more memory, but we're not looking to break processing speed records - just to move your Spark knowledge to the next level. 

(How fun would it be to do this on a stack of RasberryPis?)

The state of the art Apache Spark release, as of the time of this writing, is 1.1.0 and that is what will be used.

You probably have a binary Spark distribution downloaded to your machine but if you do not, do that now.  Apache Spark can be downloaded here.  Once downloaded extract it to a local directory - mine is at:

/home/bkarels/spark_1.1.0

(this will become $SPARK_HOME)

The Spark developers have done you a huge favour and have added a set of scripts at SPARK_HOME/sbin/ to do most of what we want to accomplish (get a local spark cluster on a single machine).  Again, in the spirit of getting you up and running without exploring every possibility, here is what you need to do.

You can tip up a Spark master and worker independently using SPARK_HOME/sbin/start-master.sh and SPARK_HOME/sbin/start-slaves.sh together.  But we're going to add a couple environment variables and let SPARK_HOME/sbin/start-all.sh tip up a master and two workers in a single step.

The official Apache Spark documentation for this can be found here if you want to dig deeper.  But for now use your favorite editor to pop open ~/.bashrc.

Add the following items (mind where your SPARK_HOME actually is):

# Spark local environment variables
export SPARK_HOME=/home/bkarels/spark_1.1.0
export SPARK_MASTER_IP=127.0.0.1
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_LOCAL_DIRS=$SPARK_HOME/work
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1G
export SPARK_WORKER_INSTANCES=2
export SPARK_DAEMON_MEMORY=384m
SPARK_MASTER_IP, SPARK_MASTER_PORT, & SPARK_MASTER_WEBUI_PORT
Here, these are explicitly set with the defaults.  This is just to illustrate that you could easily customize these values.  The remaining elements are where we get what we are looking for in our local cluster.

By default, start-all.sh fires up a single worker that uses all available cores and your available memory - 1Gb. (E.g. On a four core machine with 12Gb RAM the worker would use all four cores and 11Gb of memory.)  If that works for you, great!  If not, let's tune things a bit.

SPARK_WORKER_INSTANCES
Defaults to 1.  If you set this to anything greater than one, be sure that the value multiplied by the value of SPARK_WORKER_CORES is less than or equal to the total number of cores available on your machine.

Also, if you set to a value greater than one, verify that the value multiplied by the value of SPARK_WORKER_MEMORY is less than your total system memory.

On my example machine with four cores and 12Gb of memory we could do:
(workers X cores X memory in Gb)
4x1x3
2x2x6
1x4x12
...and other combos that do not max things out:
3x1x1
2x1x2
2x1x1 (our example)

SPARK_WORKER_MEMORY & SPARK_WORKER_INSTANCES
Hopefully these are self explanatory - so I'll let the Apache docs stand.  Just be mindful of the notes above.

It is worth noting that for development and prototyping a small worker memory should be sufficient.  (E.g. Worker memory of 256Mb would be more than enough if your sample data set was a 64Mb log file or 150Mb of emails.)  Remember, failing fast in development is a good thing!

SPARK_DAEMON_MEMORY
By default this value is 512m.  I have turned it down here just to illustrate that it can be tuned.  So if you are getting tight on resources running your IDE, Cassandra, Spark, & etc.; you could turn this down a bit.  (This and memory per worker perhaps.)

Enough with the details, let's run this thing...

First, don't forget to source your .bashrc file:
[bkarels@ahimsa ~]$ . .bashrc

Navigate to SPARK_HOME/sbin:
[bkarels@ahimsa ~]$ cd $SPARK_HOME/sbin
[bkarels@ahimsa sbin]$

Run start-all.sh entering your password when prompted:
[bkarels@ahimsa sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.master.Master-1-ahimsa.out                                                     
bkarels@localhost's password:                                                                              
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/bkarels/spark_1.1.0/sbin/../logs/spark-bkarels-org.apache.spark.deploy.worker.Worker-1-ahimsa.out

Looks good - let's verify by checking out the master's webui at localhost:8080.


Voilà!  Your completely customizable, local, disposable Spark cluster is up and running and ready to accept jobs.  (We'll get to that bit soon!)

Lastly - shutting it down.  Here again the Spark engineers have done the work, you need only call stop-all.sh.

[bkarels@ahimsa sbin]$ ./stop-all.sh
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
bkarels@localhost's password:
localhost: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[bkarels@ahimsa sbin]$
FIN

9 comments:

  1. Thanks! This was helpful r3volutionary!

    ReplyDelete
  2. Hey thanks for this !!! I am also checking the Kafka, Spark Cassandra!..really well explained....NAMASTE

    ReplyDelete
  3. I want to know why the web page doesn't work with me the web page says "localhost refused to connect"

    ReplyDelete
  4. Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big Data Hadoop Online Course Bangalore

    ReplyDelete

  5. اعالى الخليج تقدم افضل خدمات نقل العفش الدولى المتميزه باسعار متميزة ومنها :

    شركة شحن عفش من الرياض الى الامارات
    نقل عفش من الرياض الى الاردن شركة شحن عفش من الرياض الى الاردن

    ReplyDelete
  6. Get real time project based and job oriented Salesforce training India course materials for Salesforce Certification with securing a practice org, database terminology, admin and user interface navigation and custom fields creation, reports & analytics, security, customization, automation and web to lead forms.

    ReplyDelete
  7. MAJOR168 is open for football betting today. There are many big camps together BTi SBOBET IBCBET CMD365 if you are looking for a football betting website. Do not miss this site, there is football, there are all sports in the world. คาสิโนออนไลน์. Betting is available 24 hours a day with the best odds per pair in Thailand. Guaranteed automatic deposit and withdrawal system 10 seconds.

    Live sports betting Online football betting Good price with every football match open for today online football betting SAGAME88 There are many big camps together, SBOBET IBCBET BTi CMD365, the only website complete in online football betting คาสิโนออนไลน์. There is every sport on the planet in here. With the automatic deposit and withdrawal system for 10 seconds, we have a live football system to watch every night.


    We offer a wide variety of services. Called him the only player to finish with everything else does not have to go to the web preview ufabet as online. Online casinos Baccarat online Online betting games, Slot online, and with new technology, you can play ufabet via mobile phone today. Mobile Baccarat, play online via the website


    Ufabet1688 of us again the way we are websites directly, not through a General Services, where customers will know it absolutely was extremely really no cheating possible on-site gambling online, it is ufabet1688 of us will hit prices.

    ReplyDelete
  8. UEFA BET , or UFABET, is a comprehensive online betting service website without any agents or agents. Which has a wide variety of games and sports to choose from Including many types of online casinos, UFABET can be considered as an online gambling site that most people prefer to use. Because with a website design that is easy to use and does not need to understand a lot In addition, this website also supports many languages, including Chinese, Hong Kong, Thai, English, so most users, both new and old, choose to switch to the service with สมัคร ufa Bet more. The best online gambling sites in Asia.

    ReplyDelete
  9. Internet slots (Slot Online) is the introduction of a gambling machine. Slot machine As stated above Used to make electronic games known as web-based slots, as a result of the development era, many people have turned to gamble with each other by computers. Will bring slot games to make web based gambling games Via the world wide web network process Which players can play throughout the slot routine or even will have fun with Slots with the service provider's website Which internet slots games are available in the form of participating in rules. It's similar to playing on a slot machine. Both practical photos and sounds are equally thrilling as they go to lounge in the casino ever.
    บาคาร่า
    ufa
    ufabet
    แทงบอล
    แทงบอล
    แทงบอล

    ReplyDelete