DISCLAIMER: The following steps might seem too much to some readers but then this has worked for me. If you are able to reduce the steps and still get it working, kindly post your solution so that I could learn from it too. Thank you.
” SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.” (https://amplab-extras.github.io/SparkR-pkg/)
It took me two days to figure out this solution. The one limitation of this solution is that it works only on the command line interpreter meaning you can invoke sparkR from command prompt but not from using any front end IDE like RStudio. I’m still trying to figure out how to get sparkR working on RStudio.
With that said, lets set it up for Windows environment. Its no rocket science 🙂 The trick is to ensure that you set the environment variables correctly. I’m using Windows 7 HP edition 64 bit OS. First step is to download Maven, SBT, In my previous post I have provided information on from where to download these. So if you have not done so, you are directed to check that post out first. The link is here.
Download SparkR from here. If you are on Windows OS click on Zip download and/or if you are on linux/Mac OS then click on Tar download.
Once again, I will provide the environment variables as they are set up on my machine.
- Set the variable name as JAVA_HOME (in case JAVA is not installed on your computer then follow these steps). Next set the variable value as the JDK PATH. In my case it is ‘C:\Program Files\Java\jdk1.7.0_79\’ (please type the path without the single quote)
- Similarly, create a new system variable and name it as PYTHON_PATH. Set the variable value as the Python Path on your computer. In my case it is ‘C:\Python27\’ (please type the path without the single quote)
- Create a new system variable and name it as HADOOP_HOME. Set the variable value as C:\winutils. (Note: There is no need to install Hadoop. The spark shell only requires the Hadoop path which in this case holds the value to winutils that will let us compile the spark program on a windows environment.
- Create a new system variable and name it as SPARK_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\SPARK\BIN’
- Create a new system variable and name it as SBT_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\PROGRAM FILES (x86)\SBT\’
- Create a new system variable and name it as MAVEN_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\PROGRAM FILES\APACHE MAVEN 3.3.3\’
Once all these variables have been created, next select the “Path” variable under “System variables” and click on the Edit button. A window called “Edit System variable” will pop up. Leave the Variable name “Path” as it is. In the variable value, append the following string as given
Click on Ok button to close the environment variable window.
Now open up the terminal (the command prompt window) and type sparkR and press the enter key. You should see the following screen. In a similar fashion you can also invoke pysparkR by typing the command pyspark. If you want to invoke scala then the command is spark-shell
Next post will be on how to invoke sparkR from RStudio as well as a sample program