Blog

Filter posts by Category Or Tag of the Blog section!

The role of Hadoop in big data world

Saturday, 20 July 2013

What is Hadoop? 

Hadoop is an open source platform designed by Apache and written in Java programming language for a vast majority of data with new ways of storing that makes it easy managing the big data to cheap and efficient. Hadoop is not just a data storage for archiving massive of data, it provides a wide variety of fast data processing methods. it has two main part: the data processing framework and a distributed file system. Rather than making it easy working with big data, Hadoop is enabling scale up to clusters of computers instead of a huge data center by using parallel processing. Hadoop is completely flexible because it is modular and is switchable for different tools and platforms.

 

What's the need for Hadoop? 

I'm not going to talk about traditional RDBM databases and their lacks, but the reality is that changes in RDBM databases are naturally hard and cumbersome. Hadoop, however, provides to store much data, structured and unstructured and that's why it's a better choice to catch up all kinds of data by any traffic of transferring in different clusters. About 80% of data is Unstructured and managing and analyzing this via a traditional database which has designed for structured data is so expensive. As I mentioned Hadoop is not just about storing a massive amount of data, data should be analyzed and processed easily and cheaply.

 

How does it work? 

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. All you need to do is add nodes to Hadoop to handle the storage and the processing. cheaper storage and faster processing capabilities, matched with efficient analysis tools like Hadoop, allow large companies to save all of their valuable data.

 

What are the requirements of Hadoop? 

Hadoop is linearly scalable, you will increase your storage and processing power by adding a node. And a mid-range processor, maximum 32 of Gig memory, 1 GbE network connection for each node is sufficient for each node.

 

More information

  1. http://en.wikipedia.org/wiki/Apache_Hadoop
  2. http://h30565.www3.hp.com/t5/Feature-Articles/What-IT-Managers-Need-to-Know-About-Hadoop/ba-p/1416
  3. http://www.globalknowledge.com/training/generic.asp?pageid=3438&country=United+States

Category: Data

Tags: Big Data

comments powered by Disqus