Small Data Analysis using SparkR

I chose to title this post as “Small Data” because of two reasons namely, because the dataset is very small in size and does not really do justice to sparkR data crunching and two by using a small dataset I wanted to check the functionality of sparkR.

The SparkR documentation is available here and it’s usage is available here

“The entry point into SparkR is the

SparkContext

which connects your R program to a Spark cluster. You can create a

SparkContext

using

sparkR.init

and pass in options such as the application name etc. Further, to work with DataFrames we will need a

SQLContext

, which can be created from the SparkContext. If you are working from the SparkR shell, the SQLContext and SparkContext should already be created for you.” However, if you are working in any IDE then you need to create the SQLContext and SparkContext

I’m working in RStudio so I need to create SQLContext and SparkContext which is given as;

Starting up SparkR

  • Load SparkR in R as
    >library (SparkR)

Initialising a Local Spark Context and SqlContext in R

  • Initializing a local SparkContext cluster as
>sc = sparkR.init (master="local")
# for creating a local cluster with no parallelism
>sqlContext = sparkRSQL.init (sc)

Initialising a Resilient Distributed Dataset (RDD) in R

  • Initialising and Creating a Resilient Distributed Dataset (RDD) in R as
>sc = sparkR.init ("local[4]")
# where index 4 refers to number of CPU cores

# Note spark context once created is immutable.if you want to create a new spark context then you will have to remove or destroy the previous spark context because if you try to create a new spark context with the previous one not removed you will get a message like below

>sc_1 = sparkR.init(master="local[2]")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context
  • To check local cluster sc type
     > #Java ref type org.apache.spark.api.java.JavaSparkContext id 0 

Creating DataFrame from RDD

” With a

SQLContext

, applications can create DataFrames from a local R data frame, from a Hive table, or from other data sources

  • From local data frame The usage is
    >createDataFrame(sqlContext, data) 

    where the arguments sqlContext is a sqlContext and data is an RDD or list or a data frame

The usage of function createDataFrame is

createDataFrame(sqlContext, data, schema = NULL, samplingRatio = 1)

WHERE, >sqlContext is a SQLContext;
data is a RDD or list or data.frame and schema a list of column names or named list (StructType),
optional

Example:

>data = iris  # Where iris is a R dataset
>df = createDataFrame (sqlContext, data)
>df
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]

View the RDD

An R data frame can be viewed with the method View but how to view the RDD in R? It can be viewed by the

showDF(x, numRows = 20) 
# where numRows is the number of rows to show. Default is 20 rows

So following our example let’s see how it works on the iris RDD

> showDF(df)
+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| 0.2| setosa|
| 4.9| 3.0| 1.4| 0.2| setosa|
| 4.7| 3.2| 1.3| 0.2| setosa|
| 4.6| 3.1| 1.5| 0.2| setosa|
| 5.0| 3.6| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5.0| 3.4| 1.5| 0.2| setosa|
| 4.4| 2.9| 1.4| 0.2| setosa|
| 4.9| 3.1| 1.5| 0.1| setosa|
+------------+-----------+------------+-----------+-------+

DataFrame operations on RDD

  • Selecting rows and columns
>head( select (df, df$Species, df$Petal_Width))
Species Petal_Width
1 setosa 0.2
2 setosa 0.2
3 setosa 0.2
4 setosa 0.2
5 setosa 0.2
6 setosa 0.4

# Note the usage of select statement to read the data frame. The select statement returns specified rows and columns

  • you may use the conventional head () to see the first few rows and columns of the data frame like
> head(df)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
  • To return the count of all rows in the data frame use count (x) where x is the SparkSQL dataframe like
>count(df)
[1] 150
#Some SparkR Transformations#
  • Filter - filter the rows of a data frame according to a given condition. It can also be use to create a new data frame. An example is given as;
>x=filter(df, df$Petal_Length > 1.3)
# print petal length greater than 1.3
x
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:
double, Petal_Width:double, Species:string]
> head(x)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa

Finally, to stop the spark context, type

 >spark.stop()
Advertisements