Hadoop & RHadoop

Apache™ Hadoop® is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.

Hadoop Common
A module containing the utilities that support the other Hadoop components.

Hadoop Distributed File System (HDFS)
A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.

MapReduce
A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.

Yet Another Resource Negotiator (YARN)
The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

Useful Resources
1) http://www.coreservlets.com/hadoop-tutorial/
2) http://www.tutorialspoint.com/hadoop/hadoop_quick_guide.htm
3) Hadoop: Beginner’s Guide
4) Hadoop: The Definitive Guide, 4th Edition
5) The Hadoop Ecosystem Table

Understanding Hadoop requires a good knowledge on Linux and Java.

RHadoop, which is a collection of R packages for connecting R to Hadoop and running R on Hadoop nodes, allows users to manage and analyze data with Hadoop in R, including the creation of map-reduce jobs.

The R packages in RHadoop Toolkit are: rmr2, rhdfs, rhbase, plyrmr,  and ravro.
rmr2: functions providing Hadoop MapReduce functionality in R.
rhdfs: functions providing file management of the HDFS from within R.
rhbase: functions providing database management for the HBase distributed database from within R.
plyrmr: higher level plyr-like data processing for structured data, powered by rmr.
ravro: read and write files in avro format.
http://projects.revolutionanalytics.com/documents/rhadoop/rhadooppkgs/#rhadooplist

Useful Resources
1) Step by Step Guide to Setting Up an R-Hadoop System
2) RHadoop Tutorial
3) Big Data Analytics with R and Hadoop
4) R and Hadoop Data Analysis – RHadoop
5) Big Data Analysis using RHadoop
6) Hadoop MapReduce Cookbook

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s