Ehsan Ghanbari

Experience, DotNet, Solutions

The role of Hadoop in big data world

What is Hadoop? 

Hadoop is an open source platform designed by Apache and written in Java programming language for a vast majority of data with new ways of storing that makes it easy managing the big data to cheap and efficient. Hadoop is not just a data storage for archiving massive of data, it provides a wide variety of fast data processing methods. it has two main part: the data processing framework and a distributed file system. Rather than making it easy working with big data, Hadoop is enabling scale up to clusters of computers instead of a huge data center by using parallel processing. Hadoop is completely flexible because it is modular and is switchable for different tools and platforms.

 

What's the need for Hadoop? 

I'm not going to talk about traditional RDBM databases and their lacks, but the reality is that changes in RDBM databases are naturally hard and cumbersome. Hadoop, however, provides to store much data, structured and unstructured and that's why it's a better choice to catch up all kinds of data by any traffic of transferring in different clusters. About 80% of data is Unstructured and managing and analyzing this via a traditional database which has designed for structured data is so expensive. As I mentioned Hadoop is not just about storing a massive amount of data, data should be analyzed and processed easily and cheaply.

 

How does it work? 

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There’s no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy. All you need to do is add nodes to Hadoop to handle the storage and the processing. cheaper storage and faster processing capabilities, matched with efficient analysis tools like Hadoop, allow large companies to save all of their valuable data.

 

What are the requirements of Hadoop? 

Hadoop is linearly scalable, you will increase your storage and processing power by adding a node. And a mid-range processor, maximum 32 of Gig memory, 1 GbE network connection for each node is sufficient for each node.

 

More information

  1. http://en.wikipedia.org/wiki/Apache_Hadoop
  2. http://h30565.www3.hp.com/t5/Feature-Articles/What-IT-Managers-Need-to-Know-About-Hadoop/ba-p/1416
  3. http://www.globalknowledge.com/training/generic.asp?pageid=3438&country=United+States



Around big data world explosion

"Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions." ~Wikipedia

Every two days, we create as much information as we did from the dawn of civilization up until 2003” ~ Eric Schmidt, former Google CEO

When I heard about this "Google doesn’t just connect us with content. It affects our perception of content" I was confused about that. If we consider the computer world progresses in five decades, nobody could imagine personal computers about 60 years ago and also reach to any data by internet about 20 years ago and etc. now is the era of petabytes of information and as computers of 40 years ago cannot handle information of 10 years ago, today's large and also smart data cannot be handled by last decade technologies too. The issue of Big data comes whenever traditional data management couldn't manage today's complex processes and a large amount of data.

"more than 85% of Fortune 500 organizations will fail to effectively exploit Big Data for competitive advantage." ~ link

Databases are in two types: analytical and transactional. Transactional databases are for structure databases and while analytical databases are for both structured and unstructured databases. Big data is difficult to work with using most relational database management systems, relational database management is good for a huge and powerful database server while Big Data is for massively parallel management on thousands of servers to high processing speed. Understanding big data needs to have a field of experience working in multiple database areas while it is especially for very complex business strategies and a simple application doesn't need any kind of big data to get used. So you have to have an experienced working with all type of structured and unstructured data to know what is going in big data.

as it's shown in the picture below, Big data is the mixture of any kind of data - structured (databases, sensor, clickstream, and location) and unstructured(text, data, email, HTML, social data and images, audio, video)

                                                     d

another issue is about migration to Big data, it seems there is no any good practice to migrate from because it comprised of a large pool of modern databases, platforms, software packages and tools, and migration to this kind of modern technology requires training and disbursement extend gigabytes of data to petabytes. There are some new tools like SSD, Hadoop and some technics like network virtualization to use. Extracting actionable intelligence from Big Data requires handling large amounts of disparate data and processing it very quickly. Considering scale and agility is one of the most important steps in migrating to big data, These issues must be considered when choosing your computing tools and your business processes. Maintain agility or flexibility, because the application gets under more and more changes to use big data, so agility and estimating the scope Is something valuable in big data. As skilled programmers and moderns tools do Not guarantee the success of a project, big data needs agility to achieve to be successful too.

 

I found these websites useful:

  1. http://en.wikipedia.org/wiki/Big_data
  2. http://www.sas.com/big-data/
  3. http://www.scaledb.com/big-data.php
  4. http://123socialmedia.com/7-big-data-articles-you-should-read/
  5. http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx
  6. http://www.mckinsey.com/insights/strategy/are_you_ready_for_the_era_of_big_data
  7. http://www.infoworld.com/d/cloud-computing/big-data-in-the-cloud-its-time-experiment-193152
  8. http://www.fsn.co.uk/channel_bi_bpm_cpm/mastering_big_data_cfo_strategies_to_transform_insight_into_opportunity#.UdsUS_mGpN9



About Me

Ehsan Ghanbari

Hi! my name is Ehsan. I'm a developer, passionate technologist, and fan of clean code. I'm interested in enterprise and large-scale applications architecture and design patterns and I'm spending a lot of my time on architecture subject. Since 2008, I've been as a developer for companies and organizations and I've been focusing on Microsoft ecosystem all the time. During the&nb Read More

Post Tags
Pending Blog Posts
Strategic design
Factory Pattern
time out pattern in ajax
Selectors in Jquery
using Log4net in asp.net MVC4
How to use PagedList In asp.net MVC
Redis as a cache server
Domain driven design VS model driven architecture
What's the DDD-lite?
Multiple submit buttons in asp.net MVC