Working with Apache SparkR on RStudio in Windows OS

Note: This post was originally published on June 18, 2015. At that time, the apache spark project was just launched and installing a local instance of sparkR or pyspark in windows OS was difficult and cumbersome. A lot has changed since then. Therefore, I am updating this post to provide the ease of installation.

Updated Post on 29-June-2016

Step 1: Download and Install RTools for windows

You are required to download and install RTools for windows from here. Choose the version according to your R version. I chose RTools 3.3.exe as my R version is 3.3.0. Once downloaded, unzip it and execute the installer. Follow the onscreen instructions and install it.

Step 2: Download Spark

Open your web browser and open this web page: http://spark.apache.org/. This is the official website for the Apache Spark project. You should see a large green button to the right of the page that reads “Download Spark”. Click the green button.

You should follow the steps 1 to 3 to create a download link for a Spark Package of your choice. On the “2. Choose a package type”option, select any pre-built package type from the drop-down list . Since we want to experiment locally on windows, a pre-built package for Hadoop 2.6  and later will suffice. On the “3. Choose a download type” option, select “Direct Download” from the drop-down list. After selecting the download type, a link is created next to the option “4. Download Spark”. Click this link to download Spark.

Step 3: Unzip Built Package

Unzip and save the files to a directory folder of your choice. I chose to save to “C:\Apache\Spark-1.6.2”.

Step 4: Run in Command Prompt

Now start your favorite command shell and change directory to your Spark folder, in my case it is in C:\Apache\Spark-1.6.2

To start SparkR, simply run the command ".\bin\sparkR" on the top-level Spark directory like C:\Apache\Spark-1.6.2\.\bin\sparkR and press the enter key on the keyboard. You will see logs on your screen that should take at most 15 seconds to launch SparkR. If everything ran smoothly you should see a welcome message that reads “Welcome to SparkR!”

Running in RStudio

Step 5: Run in RStudio

  • Step 5.1: Set System Environment

Once you have opened RStudio, you need to set the system environment first, you can check this post first. Next, you have to point your R session to the installed version of SparkR. Use the code shown in Figure a below but replace the SPARK_HOME variable using the path to your Spark folder. Mine is “C:\Apache\Spark-1.6.2”.

  • Step 5.2: Set Library Path

Now, set the library path for sparkR as shown

 # Executing SparkR on RStudio
## Set the System environment varaible
Sys.setenv(SPARK_HOME="C:\\Apache\\Spark-1.6.2")
## Set the library path
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
## load the sparkR library
library(SparkR)
## Create a spark context and a sql context
sc<-sparkR.init(master = "local")
sqlContext<-sparkRSQL.init(sc)

If you get no errors, you are all set. Enjoy.


The original post published on June 18, 2015

Finally, after three days I was able to solve the problem. The problem was that I could not figure out how to integrate Apache SparkR in RStudio on a Windows OS environment.

The existing solutions on Google are linux based. So, I will spare you all the trouble that I went through 🙂

 

A characteristic behaviour of Windows OS is that for any application or system program that needs to be executed it (the OS kernel) requires a handle (Linux OS terminology) or a Path (Windows OS terminology). So, if you need any non-windows application to work with windows os then its important that the application path must be defined in the system environment variables. In a previous post, I have elaborated how it can be done.

Please note that the official website suggests to install SparkR from github by using the command

http://amplab-extras.github.io/SparkR-pkg/

on spark-shell but it never worked for me. It always resulted in the error “mvn.bat not found” even though I had set it up correctly in the system environment.

Help, came in the form of “Google” where today I found this post by Optimus

Solution

The solution is pretty straightforward. You need to create a symbolic link to tell R interpreter the location of the SparkR directory. And in windows os its achieved by the [/sourcecode]mklink command with [/sourcecode]/D option that specifies that the given link will be a Soft link.

Foremost, I reiterate ensure that the system environment variables are in place. Next, launch the command prompt and run it as an administrator. Now execute the command as given below;

mklink /D "C:\Program Files\R\R-3.1.3\library\SparkR" "C:\spark-1.4.0\R \lib\SparkR"

where “C:\Program Files\R\R-3.1.3\library\SparkR” is the destination location from where you would like SparkR to execute which in this case will be RStudio “C:\spark-1.4.0\R\lib\SparkR” is the source location where you have SparkR installed.if the command is successful then you will see the followingCLI

Eureka! Problem solved.

Now, launch RStudio and execute the command to load the SparkR package in RStudio as

 library (SparkR)

on the R prompt Initialising SparkR

The package SparkR is successfully attached.

How to work with SparkR?

  • A spark program first creates a SparkContext object so as to access a cluster.
  • If you are using pyspark then python automatically creates the spark context for you. This is shown below. Notice when you initiate pyspark at the terminal/command prompt you will see the message pyspark contextSparkContext available as sc, HiveContext available as sqlContext” Please note, pyspark requires python version 2.6 or higher. In python shell if you type sc and press enter key you will see the memory location of it. This proves that python automatically creates the SparkContext.
  • if you are using Scala or R as the spark interface then you will need to create the spark context. To learn how to create a spark context in Scala look here and in R look here

Next step is to initialize SparkR. This can be done by the command sparkR.init() and we save it to a spark local cluster sc  environment as given. Check here for the SparkR official documentation

> sc = sparkR.init (master="local")
# where the master url value "local" means to run spark locally within one thread with no parallelism at all. Use master="local[k]"to run Spark locally with K worker threads. (ideally, set this to the number of cores on your machine).

To check how many cores your computer see here. The following screenshot shows how to create spark context in R. Note in this spark context there is no parallelism because the master URL is local. My computer has 4 cores so if i want parallelism to speed up data processing then i will create the spark context as

 >sc= sparkR.init (master="local[4]")# If i want to use all the four cores of the CPU This can also be written sc= sparkR.init ("local[4]") or >sc= sparkR.init ("local"). To stop sc use "sc.stop()"

sparkR_SparkContext

Once the spark local cluster object is created, you can then start to analyse your big data. At the time of writing this post, the official SparkR documentation has no implementation in R. But you can try the ones given in Python, Java or Scala and try them out in sparkR Version.A basic implementation is shown here.

SparkContext is used to create Resilient Distributed Dataset (RDD)  Note: RDD once created is immutable meaning it cannot be changed.

To initialize a new SqlContext use

>sqlContext=sparkRSQL.init(sc)

By navigating in your browser to localhost:4040 (or 4041, depending on the port), you can monitor jobs and ensure that calls you are making to the Spark host are actually working.sparkREnvironment1

And if you click on the Environment tab you will that you are running SparkR as shown below.

sparkREnvironment

You may now begin your big data analysis with the massive power of R and Spark. Cheers to that 🙂

To stop Spark Context, type

>sparkR.stop()

I hope this helps.

Advertisements

2 thoughts on “Working with Apache SparkR on RStudio in Windows OS

Comments are closed.