How to commit changes to a remote file on a server

Task: To configure spark-defaults.conf and spark-env.sh file on a remote server using WinSCP

Error: Each time I try to commit changes to the file i keep getting the error "cannot overwrite remote file Permission denied Error code: 3 Error message from server: Permission denied"

Solution:

In Putty navigate into one directory before the actual directory
Example: I wanted to edit spark-defaults.conf file located in

/opt/cloudera/parcels/CDH-5.4.2.1.cdh5.4.2.p0.2/etc/spark/conf.dist/spark-defaults.conf

So what I did was I cd into /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/etc/spark
Next type the command sudo chown username:username directoryname -R
example: sudo chown ashish:ashish conf.dist -R

Now go into WinScp and edit the file ..Notice when you save the file the WinScp screen will flash kind off, indicates change has been comitted. you can check the same by opening the concerned file.

Reference: http://stackoverflow.com/questions/25505652/permission-denied-error-code-3-error-message-from-server-permission-denied-fi

Leave a comment

Filed under Apache Spark, Big Data

null record error on inserting a new message or append a new message into a topic- Apache Kafka

So, I have been working on Apache Kafka using CDH5.4 with parcels.
Scenario: I have four Linux servers of which one is Master and remaining three are slaves.
Task: To configure one of the slaves to act as a Kafka messaging server.
Command: When I execute this command to append a new message to the topic

hadoop jar /opt/camus/camus-example/target/camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P /opt/camus/camus.properties

I get the error "java.lang.RunTimeException job failed nullrecord

camus- null record errorMistake: I overlooked the camus.properties file in Kafka and did not properly configure it which caused this error
Solution:
camus.message.timestamp.field=created_at
camus.message.timestamp.format=ISO-8601
etl.hourly=hourly
etl.daily=daily
(etl.hourly and etl.daily were grayed out, I only enabled them)
etl.default.timezone=Singapore (The default timezone was not set, I set it to Singapore)

This solved the null record error
Help: camus_etl@googlegroups.com was instrumental in providing the solution

What you need to ensure is the timezone where you are in and most important is the camus.message.timestamp.field=created_at

The complete conf.properties file is listed at my github page

References

1. http://alvincjin.blogspot.com/2014/12/trouble-shooting-for-kafka-camus-example.html

2. http://saurzcode.in/2015/02/integrate-kafka-hdfs-using-camus-twitter-stream-example/

Comments Off on null record error on inserting a new message or append a new message into a topic- Apache Kafka

Filed under Big Data

Batch Geo-coding in R

  • “Geocoding (sometimes called forward geocoding) is the process of enriching a description of a location, most typically a postal address or place name, with geographic coordinates from spatial reference data such as building polygons, land parcels, street addresses, postal codes (e.g. ZIP codes, CEDEX) and so on.”

Google API for Geo-coding restricts coordinates lookup to 2500 per IP address per day. So if you have more than this limit of addresses then searching for an alternative solution is cumbersome.

The task at hand was to determine the coordinates of a huge number of addresses to the tune of over 10,000. The question was how to achieve this in R?

Solution

> library(RgoogleMaps)
 > DF <- with(caseLoc, data.frame(caseLoc, t(sapply(caseLoc$caseLocation, getGeoCode))))
 #caseLoc is the address file and caseLocation is the column header

Comments Off on Batch Geo-coding in R

Filed under data preprocessing, geocoding

The significance of python’s lambda function

For some time I was unable to figure out what in the world is “lambda” function in python. Off course, I referred to the official documentation but it did not help much. Not that I am saying its poorly written.. It is very well written but you see the problem here is that I needed an easy explanation that my grey cells could easily catch.

So anyway, I have now understood it and I briefly explain it in case I need to refer it back again. I can always check it in here.

In a single line definition these are quick dirty functions. Meaning if you are too tired of writing a full function definition like


> def heart_beat (pulse):
         return pulse*100
> doctor=heart_beat(20)
> print doctor # Result

But if you were to use the lambda function, you would not need to define and name the function as above. It could have rather been done as follows

> doctor = lambda heart_beat: heart_beat * 100
> doctor # Result will be printed on screen
> doctor (20)
> #Result

Comments Off on The significance of python’s lambda function

Filed under Python

Guide to Data Science Competitions

Originally posted on Happy Endpoints:

“Don't worry about a thing,every littleSummer is finally here and so are the long form virtual hackathons. Unlike a traditional hackathon, which focus on what you can build in one place in one limited time span, virtual hackathons typically give you a month or more to work from where ever you like.

And for those of us who love data, we are not left behind. There are a number of data science competitions to choose from this summer. Whether it’s a new Kaggle challenge (which are posted year round) or the data science component of Challenge Post’s Summer Jam Series, there are plenty of opportunities to spend the summer either sharpening or showing off your skills.

The Landscape: Which Competitions are Which?

  • Kaggle
    Kaggle competitions have corporate sponsors that are looking for specific business questions answered with their sample data. In return, winners are rewarded handsomely, but you have to win first.
  • Summer Jam

View original 313 more words

Comments Off on Guide to Data Science Competitions

Filed under Resources

To read multiple files from a directory and save to a data frame

There are various solution to this questions like these but I will attempt to answer the problems that I encountered with there working solution that either I found or created by my own.
Question 1: My initial problem was how to read multiple .CSV files and store them into a single data frame.
Solution: Use a lapply() function and rbind(). One of the working R code I found here provided by Hadley. The code is

# The following code reads multiple csv files into a single data frame
load_data &lt;- function(path) { 
 files &lt;- dir(path, pattern = '\\*.csv', full.names = TRUE)
 tables &lt;- lapply(files, read.csv)
 do.call(rbind, tables)
}

And then use the function like

&gt; load_data("D://User//Temp")

Comments Off on To read multiple files from a directory and save to a data frame

Filed under data preprocessing, R

Small Data Analysis using SparkR

I chose to title this post as “Small Data” because of two reasons namely, because the dataset is very small in size and does not really do justice to sparkR data crunching and two by using a small dataset I wanted to check the functionality of sparkR.

The SparkR documentation is available here and it’s usage is available here

“The entry point into SparkR is the

SparkContext

which connects your R program to a Spark cluster. You can create a

SparkContext 

using

sparkR.init 

and pass in options such as the application name etc. Further, to work with DataFrames we will need a

SQLContext

, which can be created from the SparkContext. If you are working from the SparkR shell, the SQLContext and SparkContext should already be created for you.” However, if you are working in any IDE then you need to create the SQLContext and SparkContext

I’m working in RStudio so I need to create SQLContext and SparkContext which is given as;

Starting up SparkR

  • Load SparkR in R as
     > library (SparkR)

Initialising a Local Spark Context and SqlContext in R

  • Initializing a local SparkContext cluster as
> sc = sparkR.init (master="local")
# for creating a local cluster with no parallelism
> sqlContext = sparkRSQL.init (sc)

Initialising a Resilient Distributed Dataset (RDD) in R

  • Initialising and Creating a Resilient Distributed Dataset (RDD) in R as
> sc = sparkR.init ("local[4]")
# where index 4 refers to number of CPU cores

# Note spark context once created is immutable.if you want to create a new spark context then you will have to remove or destroy the previous spark context because if you try to create a new spark context with the previous one not removed you will get a message like below

> sc_1 = sparkR.init(master="local[2]")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context
  • To check local cluster sc type
     > sc # Java ref type org.apache.spark.api.java.JavaSparkContext id 0 

Creating DataFrame from RDD

” With a

SQLContext

, applications can create DataFrames from a local R data frame, from a Hive table, or from other data sources

  • From local data frame The usage is
     createDataFrame(sqlContext, data) 

    where the arguments sqlContext is a sqlContext and data is an RDD or list or a data frame

The usage of function createDataFrame is

createDataFrame(sqlContext, data, schema = NULL, samplingRatio = 1)

WHERE, <em>sqlContext is a SQLContext;
data is a RDD or list or data.frame and </em>
<em>schema a list of column names or named list (StructType),
optional</em></pre>

Example:

> data = iris   # Where iris is a R dataset
> df = createDataFrame (sqlContext, data)
> df
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]

View the RDD

An R data frame can be viewed with the method View but how to view the RDD in R? It can be viewed by the

showDF(x, numRows = 20) 
# where numRows is the number of rows to show. Default is 20 rows

So following our example let’s see how it works on the iris RDD

> showDF(df)
+------------+-----------+------------+-----------+-------+
|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+------------+-----------+------------+-----------+-------+
| 5.1| 3.5| 1.4| 0.2| setosa|
| 4.9| 3.0| 1.4| 0.2| setosa|
| 4.7| 3.2| 1.3| 0.2| setosa|
| 4.6| 3.1| 1.5| 0.2| setosa|
| 5.0| 3.6| 1.4| 0.2| setosa|
| 5.4| 3.9| 1.7| 0.4| setosa|
| 4.6| 3.4| 1.4| 0.3| setosa|
| 5.0| 3.4| 1.5| 0.2| setosa|
| 4.4| 2.9| 1.4| 0.2| setosa|
| 4.9| 3.1| 1.5| 0.1| setosa|
+------------+-----------+------------+-----------+-------+

DataFrame operations on RDD

  • Selecting rows and columns
> head( select (df, df$Species, df$Petal_Width))
Species        Petal_Width
1 setosa            0.2
2 setosa            0.2
3 setosa            0.2
4 setosa            0.2
5 setosa            0.2
6 setosa            0.4

# Note the usage of select statement to read the data frame. The select statement returns specified rows and columns

  • you may use the conventional head () to see the first few rows and columns of the data frame like
> head(df)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
  • To return the count of all rows in the data frame use count (x) where x is the SparkSQL dataframe like
> count(df)
[1] 150
<strong>Some SparkR Transformations</strong>
  • Filter - filter the rows of a data frame according to a given condition. It can also be use to create a new data frame. An example is given as;
> x=filter(df, df$Petal_Length &gt; 1.3)
# print petal length greater than 1.3
> x
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:
double, Petal_Width:double, Species:string]
> head(x)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa

Finally, to stop the spark context, type

 > spark.stop()

Comments Off on Small Data Analysis using SparkR

Filed under Apache Spark, data analysis, R