The significance of python’s lambda function

For some time I was unable to figure out what in the world is “lambda” function in python. Off course, I referred to the official documentation but it did not help much. Not that I am saying its poorly written.. It is very well written but you see the problem here is that I needed an easy explanation that my grey cells could easily catch.

So anyway, I have now understood it and I briefly explain it in case I need to refer it back again. I can always check it in here.

In a single line definition these are quick dirty functions. Meaning if you are too tired of writing a full function definition like

> def heart_beat (pulse):
         return pulse*100
> doctor=heart_beat(20)
> print doctor # Result

But if you were to use the lambda function, you would not need to define and name the function as above. It could have rather been done as follows

> doctor = lambda heart_beat: heart_beat * 100
> doctor # Result will be printed on screen
> doctor (20)
> #Result

Leave a comment

Filed under Python

Guide to Data Science Competitions

Originally posted on Happy Endpoints:

“Don't worry about a thing,every littleSummer is finally here and so are the long form virtual hackathons. Unlike a traditional hackathon, which focus on what you can build in one place in one limited time span, virtual hackathons typically give you a month or more to work from where ever you like.

And for those of us who love data, we are not left behind. There are a number of data science competitions to choose from this summer. Whether it’s a new Kaggle challenge (which are posted year round) or the data science component of Challenge Post’s Summer Jam Series, there are plenty of opportunities to spend the summer either sharpening or showing off your skills.

The Landscape: Which Competitions are Which?

  • Kaggle
    Kaggle competitions have corporate sponsors that are looking for specific business questions answered with their sample data. In return, winners are rewarded handsomely, but you have to win first.
  • Summer Jam

View original 313 more words

Leave a comment

Filed under Resources

To read multiple files from a directory and save to a data frame

There are various solution to this questions like these but I will attempt to answer the problems that I encountered with there working solution that either I found or created by my own.
Question 1: My initial problem was how to read multiple .CSV files and store them into a single data frame.
Solution: Use a lapply() function and rbind(). One of the working R code I found here provided by Hadley. The code is

# The following code reads multiple csv files into a single data frame
load_data <- function(path) { 
 files <- dir(path, pattern = '\\*.csv', full.names = TRUE)
 tables <- lapply(files, read.csv), tables)

And then use the function like

> load_data("D://User//Temp")

Leave a comment

Filed under data preprocessing, R

Small Data Analysis using SparkR

I chose to title this post as “Small Data” because of two reasons namely, because the dataset is very small in size and does not really do justice to sparkR data crunching and two by using a small dataset I wanted to check the functionality of sparkR.

The SparkR documentation is available here and it’s usage is available here

“The entry point into SparkR is the


which connects your R program to a Spark cluster. You can create a




and pass in options such as the application name etc. Further, to work with DataFrames we will need a


, which can be created from the SparkContext. If you are working from the SparkR shell, the SQLContext and SparkContext should already be created for you.” However, if you are working in any IDE then you need to create the SQLContext and SparkContext

I’m working in RStudio so I need to create SQLContext and SparkContext which is given as;

Starting up SparkR

  • Load SparkR in R as
     > library (SparkR)

Initialising a Local Spark Context and SqlContext in R

  • Initializing a local SparkContext cluster as
> sc = sparkR.init (master="local")
# for creating a local cluster with no parallelism
> sqlContext = sparkRSQL.init (sc)

Initialising a Resilient Distributed Dataset (RDD) in R

  • Initialising and Creating a Resilient Distributed Dataset (RDD) in R as
> sc = sparkR.init ("local[4]")
# where index 4 refers to number of CPU cores

# Note spark context once created is immutable.if you want to create a new spark context then you will have to remove or destroy the previous spark context because if you try to create a new spark context with the previous one not removed you will get a message like below

> sc_1 = sparkR.init(master="local[2]")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context
  • To check local cluster sc type
     > sc # Java ref type id 0 

Creating DataFrame from RDD

” With a


, applications can create DataFrames from a local R data frame, from a Hive table, or from other data sources

  • From local data frame The usage is
     createDataFrame(sqlContext, data) 

    where the arguments sqlContext is a sqlContext and data is an RDD or list or a data frame

The usage of function createDataFrame is

createDataFrame(sqlContext, data, schema = NULL, samplingRatio = 1)

WHERE, <em>sqlContext is a SQLContext;
data is a RDD or list or data.frame and </em>
<em>schema a list of column names or named list (StructType),


> data = iris   # Where iris is a R dataset
> df = createDataFrame (sqlContext, data)
> df
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]

View the RDD

An R data frame can be viewed with the method View but how to view the RDD in R? It can be viewed by the

showDF(x, numRows = 20) 
# where numRows is the number of rows to show. Default is 20 rows

So following our example let’s see how it works on the iris RDD

> showDF(df)
| 5.1| 3.5| 1.4| 0.2| setosa|
| 4.9| 3.0| 1.4| 0.2| setosa|
| 4.7| 3.2| 1.3| 0.2| setosa|
| 4.6| 3.1| 1.5| 0.2| setosa|
| 5.0| 3.6| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5.0| 3.4| 1.5| 0.2| setosa|
| 4.4| 2.9| 1.4| 0.2| setosa|
| 4.9| 3.1| 1.5| 0.1| setosa|

DataFrame operations on RDD

  • Selecting rows and columns
> head( select (df, df$Species, df$Petal_Width))
Species        Petal_Width
1 setosa            0.2
2 setosa            0.2
3 setosa            0.2
4 setosa            0.2
5 setosa            0.2
6 setosa            0.4

# Note the usage of select statement to read the data frame. The select statement returns specified rows and columns

  • you may use the conventional head () to see the first few rows and columns of the data frame like
> head(df)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
  • To return the count of all rows in the data frame use count (x) where x is the SparkSQL dataframe like
> count(df)
[1] 150
<strong>Some SparkR Transformations</strong>
  • Filter - filter the rows of a data frame according to a given condition. It can also be use to create a new data frame. An example is given as;
> x=filter(df, df$Petal_Length &gt; 1.3)
# print petal length greater than 1.3
> x
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:
double, Petal_Width:double, Species:string]
> head(x)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa

Finally, to stop the spark context, type

 > spark.stop()

Leave a comment

Filed under Apache Spark, data analysis, R

Working with Apache SparkR-1.4.0 in RStudio

Finally, after three days I was able to solve the problem. The problem was that I could not figure out how to integrate Apache SparkR in RStudio on a Windows OS environment.

The existing solutions on Google are linux based. So, I will spare you all the trouble that I went through :-)

A characteristic behaviour of Windows OS is that for any application or system program that needs to be executed it (the OS kernel) requires a handle (Linux OS terminology) or a Path (Windows OS terminology). So, if you need any non-windows application to work with windows os then its important that the application path must be defined in the system environment variables. In a previous post, I have elaborated how it can be done.

Please note that the official website suggests to install SparkR from github by using the command

on spark-shell but it never worked for me. It always resulted in the error “mvn.bat not found” even though I had set it up correctly in the system environment.

Help, came in the form of “Google” where today I found this post by Optimus


The solution is pretty straightforward. You need to create a symbolic link to tell R interpreter the location of the SparkR directory. And in windows os its achieved by the [/sourcecode]mklink command with [/sourcecode]/D option that specifies that the given link will be a Soft link.

Foremost, I reiterate ensure that the system environment variables are in place. Next, launch the command prompt and run it as an administrator. Now execute the command as given below;

mklink /D "C:\Program Files\R\R-3.1.3\library\SparkR"  "C:\spark-1.4.0\R \lib\SparkR"

where “C:\Program Files\R\R-3.1.3\library\SparkR” is the destination location from where you would like SparkR to execute which in this case will be RStudio “C:\spark-1.4.0\R\lib\SparkR” is the source location where you have SparkR installed.if the command is successful then you will see the followingCLI

Eureka! Problem solved.

Now, launch RStudio and execute the command to load the SparkR package in RStudio as

> library (SparkR)

on the R prompt Initialising SparkR

The package SparkR is successfully attached.

How to work with SparkR?

  • A spark program first creates a SparkContext object so as to access a cluster.
  • if you are using pyspark then python automatically creates the spark context for you. This is shown below. Notice when you initiate pyspark at the terminal/command prompt you will see the message pyspark contextSparkContext available as sc, HiveContext available as sqlContext” Please note, pyspark requires python version 2.6 or higher. In python shell if you type sc and press enter key you will see the memory location of it. This proves that python automatically creates the SparkContext.
  • if you are using Scala or R as the spark interface then you will need to create the spark context. To learn how to create a spark context in Scala look here and in R look here

Next step is to initialize SparkR. This can be done by the command sparkR.init() and we save it to a spark local cluster sc  environment as given. Check here for the SparkR official documentation

 > sc = sparkR.init (master="local")
# where the master url value <em>local </em>means to run spark
locally within one thread with no parallelism at all.
Use master="local[k] to Run Spark locally with K worker threads
(ideally, set this to the number of cores on your machine).

To check how many cores your computer see here. The following screenshot shows how to create spark context in R. Note in this spark context there is no parallelism because the master URL is local. My computer has 4 cores so if i want parallelism to speed up data processing then i will create the spark context as

 > sc= sparkR.init (master="local[4]") # If i want to use all the four cores of the CPU This can also be written as > sc= sparkR.init ("local[4]") or > sc= sparkR.init ("local"). To stop sc use >sc.stop() 


Once the spark local cluster object is created, you can then start to analyse your big data. At the time of writing this post, the official SparkR documentation has no implementation in R. But you can try the ones given in Python, Java or Scala and try them out in sparkR Version.A basic implementation is shown here.

SparkContext is used to create Resilient Distributed Dataset (RDD)  Note: RDD once created is immutable meaning it cannot be changed.

To initialize a new SqlContext use


By navigating in your browser to localhost:4040 (or 4041, depending on the port), you can monitor jobs and ensure that calls you are making to the Spark host are actually working.sparkREnvironment1

And if you click on the Environment tab you will that you are running SparkR as shown below.


You may now begin your big data analysis with the massive power of R and Spark. Cheers to that :-)

To stop Spark Context, type

 > sparkR.stop() 

I hope this helps.


Filed under Environment Setup

How to install SparkR on Windows operating system environment?

DISCLAIMER: The following steps might seem too much to some readers but then this has worked for me. If you are able to reduce the steps and still get it working, kindly post your solution so that I could learn from it too. Thank you.

” SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.” (

It took me two days to figure out this solution. The one limitation of this solution is that it works only on the command line interpreter meaning you can invoke sparkR from command prompt but not from using any front end IDE like RStudio. I’m still trying to figure out how to get sparkR working on RStudio.

With that said, lets set it up for Windows environment. Its no rocket science :-) The trick is to ensure that you set the environment variables correctly. I’m using Windows 7 HP edition 64 bit OS. First step is to download Maven, SBT, In my previous post I have provided information on from where to download these. So if you have not done so, you are directed to check that post out first. The link is here.

Download SparkR from here. If you are on Windows OS click on Zip download and/or if you are on linux/Mac OS then click on Tar download.

Once again, I will provide the environment variables as they are set up on my machine.

  • Set the variable name as JAVA_HOME (in case JAVA is not installed on your computer then follow these steps). Next set the variable value as the JDK PATH. In my case it is ‘C:\Program Files\Java\jdk1.7.0_79\’ (please type the path without the single quote)
  • Similarly, create a new system variable and name it as PYTHON_PATH. Set the variable value as the Python Path on your computer. In my case it is ‘C:\Python27\’ (please type the path without the single quote)
  • Create a new system variable and name it as HADOOP_HOME. Set the variable value as C:\winutils. (Note: There is no need to install Hadoop. The spark shell only requires the Hadoop path which in this case holds the value to winutils that will let us compile the spark program on a windows environment.
  • Create a new system variable and name it as SPARK_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\SPARK\BIN’
  • Create a new system variable and name it as SBT_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\PROGRAM FILES (x86)\SBT\’
  • Create a new system variable and name it as MAVEN_HOME. Assign the variable value as the path to your Spark binary location. In my case it is in ‘C:\PROGRAM FILES\APACHE MAVEN 3.3.3\’

Once all these variables have been created, next select the “Path” variable under “System variables” and click on the Edit button. A window called “Edit System variable” will pop up. Leave the Variable name “Path” as it is. In the variable value, append the following string as given



Click on Ok button to close the environment variable window.

Now open up the terminal (the command prompt window) and type sparkR and press the enter key. You should see the following screen. In a similar fashion you can also invoke pysparkR by typing the command pyspark. If you want to invoke scala then the command is spark-shell


Next post will be on how to invoke sparkR from RStudio as well as a sample program

Comments Off on How to install SparkR on Windows operating system environment?

Filed under Environment Setup

A simple program to count the frequency of word’s using Scala in Spark

Continuing from my previous post on how to install Spark on a windows environment the next logical task to do was to execute a simple hello world type program. Therefore in this post I attempt to count the frequency of words in the complete works of William Shakespeare. This is my first attempt to understand how Spark works. So lets get started.

I downloaded the complete works of William Shakespeare from here as a text file format and then saved it as a text file in the data folder of my spark installation (‘D:\spark-1.2.1\data’). The size of the text file is 5.4 MB

I then powered on the spark shell by opening up the command prompt, navigating to the bin directory and executing the command ‘spark-shell’

At the scala prompt I begin by creating three variables filePath, countLove, countBlind as
scala> val filePath =sc.textFile ("D:\\spark\\data\\william.txt")
scala> val countLove= file.filter (line => line.contains ("love"))
scala> val countBlind= file.filter (line => line.contains ("blind"))
Now, I will use the count function to count the occurrence of these two words in the whole text like
scala> countLove.count ()
res0: Long = 2484  #2484 occurrences of the word 'love' found in the text
scala> countBlind.count()
res1: Long = 100   #100 occurrences of the word 'bind' found in the text

spark UI

Comments Off on A simple program to count the frequency of word’s using Scala in Spark

Filed under Apache Spark, data analysis