Tuesday, December 16, 2014

Simple Python application on Apache Spark Cluster

Overview


As an exercise, I am working on duplicating my previous examples in Python.  It is clear that Python has, and is gaining, traction in the data world.  So, it makes sense to have a working knowledge of it.

As with my other examples, everything will find it's way to my Github repositories - forking and enhancements welcome.

This effort is about the most simplistic Python submit to Spark Cluster example possible.  But, when you move beyond the REPL, you have to start somewhere right?

Prerequisites:

  • Java 1.7+ (Oracle JDK required)
  • A Spark cluster (how to do it here.)
  • git


Clone the Example

Begin by cloning the example project from github - super-simple-spark-python-app and cd into the project directory.


[bkarels@ahimsa work]$ git clone git@github.com:bradkarels/super-simple-spark-python-app.git
Initialized empty Git repository in /home/bkarels/work/super-simple-spark-python-app/.git/
remote: Counting objects: 13, done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 3), reused 6 (delta 1)
Receiving objects: 100% (13/13), done.
Resolving deltas: 100% (3/3), done.
[bkarels@ahimsa work]$ cd super-simple-spark-python-app/


Move the file tenzingyatso.txt to your home directory.
[bkarels@ahimsa super-simple-spark-python-app]$ mv tenzingyatso.txt ~

Modify simple.py path to the sample file (save and close).

file = sc.textFile("/home/bkarels/tenzingyatso.txt")
becomes...
file = sc.textFile("/home/yourUserNameHere/tenzingyatso.txt")
...or some such similar thing.

Spark it up (with python)!

If your local Spark cluster is not up and running, do that now.  If you need to review how to go about that, you can look here.

Make Sparks fly! (i.e. run it)


Since this example does not have a packaged application (e.g. jar, egg, etc.), we can invoke spark-submit with just our simple python file.


[bkarels@ahimsa super-simple-spark-python-app]$ $SPARK_HOME/bin/spark-submit --master spark://127.0.0.1:7077 ./simple.py
Your expected output to the console should be a line count of 7 wrapped in a nice battery of asterisks and the copy from the first line of the example file.  If you see that - this has worked.

No comments:

Post a Comment