MapReduce Patterns, Algorithms, and Use Cases

Ashish Dutt:

an interesting post on big data processing

Originally posted on Highly Scalable Blog:

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.

MapReduce Framework

Basic MapReduce Patterns

Counting and Summing

Problem Statement: There is a number of documents where each document is a set of terms. It is required to calculate a total number of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file where each record contains a response time and it is required to calculate an average response time.

Solution:

Let start with something really simple. The code snippet below shows Mapper that simply…

View original 2,379 more words

Leave a comment

Filed under Educational Data Mining

PySpark in PyCharm on a remote server

Use Case: I want to use my laptop (using Win 7 Professional) to connect to the CentOS 6.4 master server using PyCharm.

Objective: To write the code in Pycharm on the laptop and then send the job to the server which will do the processing and should then return the result back to the laptop or to any other visualizing API.

My solution was to get the PyCharm Professional (you can download it as a 30 day evaluation version) edition which let me configure it. In the PyCharm environment, press the key combination “ctrl+alt+s” which will open up the settings window. From there click on the + sign next to Project: [your project name] in my case project name is Remote_Server as shown

pyspark-config-1 

My solution was to get the PyCharm Professional (you can download it as a 30 day evaluation version) edition which let me configure it. In the PyCharm environment, press the key combination “ctrl+alt+s” which will open up the settings window. From there click on the + sign next to Project: [your project name] in my case project name is Remote_Server as shown

1-pyspark configuration

Now click on Ok and write a sample program to test the connectivity. A sample program is given as

SPARK_HOME=”spark://IP_ADDRESS_OF_YOUR_SERVER:7077″
try:
from pyspark import SparkContext
from pyspark import SparkConf
print (“Pyspark sucess”)
except ImportError as e:
print (“Error importing Spark Modules”, e)
try:
conf = SparkConf()
conf.setMaster(SPARK_HOME)
conf.setAppName(“First_Remote_Spark_Program”)
sc = SparkContext(conf=conf)
print (“connection succeeded with Master”,conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
print(distData)
except:
print (“unable to connect to remote server”)

Now, when you run this code you should see the pyspark interpreter as shown

pyspark-config-4pyspark-config-5

Comments Off on PySpark in PyCharm on a remote server

Filed under Apache Spark

How to commit changes to a remote file on a server

Task: To configure spark-defaults.conf and spark-env.sh file on a remote server using WinSCP

Error: Each time I try to commit changes to the file i keep getting the error "cannot overwrite remote file Permission denied Error code: 3 Error message from server: Permission denied"

Solution:

In Putty navigate into one directory before the actual directory
Example: I wanted to edit spark-defaults.conf file located in

/opt/cloudera/parcels/CDH-5.4.2.1.cdh5.4.2.p0.2/etc/spark/conf.dist/spark-defaults.conf

So what I did was I cd into /opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/etc/spark
Next type the command sudo chown username:username directoryname -R
example: sudo chown ashish:ashish conf.dist -R

Now go into WinScp and edit the file ..Notice when you save the file the WinScp screen will flash kind off, indicates change has been comitted. you can check the same by opening the concerned file.

Reference: http://stackoverflow.com/questions/25505652/permission-denied-error-code-3-error-message-from-server-permission-denied-fi

Comments Off on How to commit changes to a remote file on a server

Filed under Apache Spark, Big Data

null record error on inserting a new message or append a new message into a topic- Apache Kafka

So, I have been working on Apache Kafka using CDH5.4 with parcels.
Scenario: I have four Linux servers of which one is Master and remaining three are slaves.
Task: To configure one of the slaves to act as a Kafka messaging server.
Command: When I execute this command to append a new message to the topic

hadoop jar /opt/camus/camus-example/target/camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P /opt/camus/camus.properties

I get the error "java.lang.RunTimeException job failed nullrecord

camus- null record errorMistake: I overlooked the camus.properties file in Kafka and did not properly configure it which caused this error
Solution:
camus.message.timestamp.field=created_at
camus.message.timestamp.format=ISO-8601
etl.hourly=hourly
etl.daily=daily
(etl.hourly and etl.daily were grayed out, I only enabled them)
etl.default.timezone=Singapore (The default timezone was not set, I set it to Singapore)

This solved the null record error
Help: camus_etl@googlegroups.com was instrumental in providing the solution

What you need to ensure is the timezone where you are in and most important is the camus.message.timestamp.field=created_at

The complete conf.properties file is listed at my github page

References

1. http://alvincjin.blogspot.com/2014/12/trouble-shooting-for-kafka-camus-example.html

2. http://saurzcode.in/2015/02/integrate-kafka-hdfs-using-camus-twitter-stream-example/

Comments Off on null record error on inserting a new message or append a new message into a topic- Apache Kafka

Filed under Big Data

Batch Geo-coding in R

  • “Geocoding (sometimes called forward geocoding) is the process of enriching a description of a location, most typically a postal address or place name, with geographic coordinates from spatial reference data such as building polygons, land parcels, street addresses, postal codes (e.g. ZIP codes, CEDEX) and so on.”

Google API for Geo-coding restricts coordinates lookup to 2500 per IP address per day. So if you have more than this limit of addresses then searching for an alternative solution is cumbersome.

The task at hand was to determine the coordinates of a huge number of addresses to the tune of over 10,000. The question was how to achieve this in R?

Solution

> library(RgoogleMaps)
 > DF <- with(caseLoc, data.frame(caseLoc, t(sapply(caseLoc$caseLocation, getGeoCode))))
 #caseLoc is the address file and caseLocation is the column header

Comments Off on Batch Geo-coding in R

Filed under data preprocessing, geocoding

The significance of python’s lambda function

For some time I was unable to figure out what in the world is “lambda” function in python. Off course, I referred to the official documentation but it did not help much. Not that I am saying its poorly written.. It is very well written but you see the problem here is that I needed an easy explanation that my grey cells could easily catch.

So anyway, I have now understood it and I briefly explain it in case I need to refer it back again. I can always check it in here.

In a single line definition these are quick dirty functions. Meaning if you are too tired of writing a full function definition like


> def heart_beat (pulse):
         return pulse*100
> doctor=heart_beat(20)
> print doctor # Result

But if you were to use the lambda function, you would not need to define and name the function as above. It could have rather been done as follows

> doctor = lambda heart_beat: heart_beat * 100
> doctor # Result will be printed on screen
> doctor (20)
> #Result

Comments Off on The significance of python’s lambda function

Filed under Python

Guide to Data Science Competitions

Originally posted on Happy Endpoints:

“Don't worry about a thing,every littleSummer is finally here and so are the long form virtual hackathons. Unlike a traditional hackathon, which focus on what you can build in one place in one limited time span, virtual hackathons typically give you a month or more to work from where ever you like.

And for those of us who love data, we are not left behind. There are a number of data science competitions to choose from this summer. Whether it’s a new Kaggle challenge (which are posted year round) or the data science component of Challenge Post’s Summer Jam Series, there are plenty of opportunities to spend the summer either sharpening or showing off your skills.

The Landscape: Which Competitions are Which?

  • Kaggle
    Kaggle competitions have corporate sponsors that are looking for specific business questions answered with their sample data. In return, winners are rewarded handsomely, but you have to win first.
  • Summer Jam

View original 313 more words

Comments Off on Guide to Data Science Competitions

Filed under Resources