Hadoop- A Solution For Big Data

U-News Staff

Hadoop, an open-source program that stores large quantities of data from various sources, has become an important piece of the big data puzzle over the past decade. Hadoop has become a general purpose computing platform, and could become a foundation for the next generation of database applications.  For Hadoop, the biggest motivator in the market is simple: “Before Hadoop, data storage was expensive.”

“I was interested in providing the world with great tools to make search engines,” Doug Cutting, co-creator of Hadoop, told IT World. “I was excited to be able to try to bring this technology to the world through Open Source.”

Facebook, eBay, Yelp, Twitter, Salesforce.com and many others use Hadoop to analyze data about user behavior and their operations. It serves as foundation for newer graph, NoSQL and relational databases.

“Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features, allowing us to harness our incredible data about the professional world for our users,” said Jay Kreps, principal engineer for LinkedIn, at the third annual Hadoop Summit in 2010.

Apache, the web server that owns Hadoop, describes it as a framework that processes large datasets across many computers.

Hadoop is “designed to scale up from single servers to thousands of machines, each offering local computation and storage,” according to the Apache website.

The data Hadoop deals with are complex in their volume, velocity and variety.  It is not just a single, open source project – it is an ecosystem of projects that work together to provide common services. Hadoop transforms commodity hardware, which is affordable without adding any reliability into coherent services that stores terabytes of data reliably and also process the data very efficiently.

The important thing about Hadoop is that it is redundant and reliable. Even if someone loses a machine, Hadoop automatically replicates data without the operator having to do anything. Users can get full access to data even though it is partitioned. All machines can process data, and Hadoop mainly focuses on batch processing.

Primarily, when a user submits an application to the cluster that runs the application, it will pop up with the result when it is done. It makes it very easy to write distributed applications, as text can be written on one machine and automatically saved on up to 4,500 machines without changing it. This feature provides lot of power and makes the developer much more efficient. Users don’t need to buy special hardware, as it runs on commodity hardware, and there is no need for extra hardware for reliability or redundancy, as reliability is on the software.

Hadoop consists of two main parts. The first is MapReduce, which is the processing part of the program. Users submit computation to MapReduce and it runs to give the result. The second part is Hadoop Distributed File System (HDFS), which stores all data for Hadoop. It has all the files and directories.

The server responsible for launching MapReduce on each machine is called TaskTracker. The HDFS server on a machine is called DataNode. It stores lots of data and provides high bandwidth access to data in each machine. To make a cluster, users need to repeat this with TaskTracker and DataNode. If more storage or computing is required, users can add another machine.

MapReduce needs a coordinator known as JobTracker, which accepts tasks, divides them into minor tasks and assigning each to an individual TaskTracker. The TaskTracker runs the task and reports the status to JobTracker. JobTracker also detects if the TaskTracker disappears due to any software or hardware crash, then automatically assigns its tasks to another TaskTracker.

The NameNode is the coordinator on the HDFS side. It gives information about where data is, but data doesn’t flow through the NameNode. It notices when a DataNode is down and replicates data.

Hadoop includes a few other projects. The first is Pig, which is similar to a compiler: it allows users to write high level descriptions of data processing and then converts it into a MapReduce job. Developers feel comfortable using Pig due to the increase in productivity with its use.

For people who are not comfortable with programming and who more comfortable with SQL interface then, Hive is a good choice. It takes SQL as input and does a similar transformation as Pig. It is used extensively by 90 percent of the computation.

Hadoop is a batch processor, but that doesn’t match everything people need. People need to read and write data in real time. HBase, a top Apache project, meets that need. It provides a simple interface to distributed data and allows the required processing. HBase can be accessed by Pig, Hive and MapReduce for storing information in HDFS so that it’s guaranteed to be reliable and durable. HBase is used for applications such as Facebook messaging. When users send a message in Facebook, that message is an object in an HBase table and it scales out.

“We have one of the largest clusters with a total storage disk capacity of more than 20PB and with more than 23000 cores,” Ashish Thusoo, engineering manager at Facebook, said at the third Hadoop Summit. “We also use Hadoop and Scribe for log collection, bringing in more than 50TB of raw data per day. Hadoop has helped us scale with these tremendous data volumes.”

HBase stores some of its metadata information in Zookeeper, which is another Apache project that provides coordination services for different servers. So, when different servers work together, Zookeeper looks after all coordination issues.

Hive has a meditative server that stores information in tables. Users wanted the same information to be available to MapReduce and Pig, so Apache made a new project called HCatalog. The HCatalog is similar to a Hive server with enhancement, but can be used by many applications.

Hadoop allows users to store as much data as they want in whatever form needed, simply by adding more servers to Hadoop cluster. Each new server adds more storage and more processing power to the whole cluster. This makes data storage less costly compared to prior methods of storage. Hadoop uses a distributed file system as a storage method, which makes the data processing faster.

Hadoop also allows companies to store data as it comes in. This avoids from spending money and time configuring data for relational databases and rigid tables. Thus, data can be stored much cheaply than RDBMS software.

“Now business can add more datasets to their collection” Cutting told ReadWrite.com.