What is Hadoop

 There is a piles of data every where now days.  There were a days when one company was talking about Giga or Tera bytes of data, but now individual user is talking about Tera bytes of data and today many of us have Terabytes of data.

And now big question is: – are those data managed? Are those data arranged? Can I get what i need immediately?

These questions i am raising at user level but what will happen at business level. So database management is become very critical today. So new technology are developed everyday to over come data management challenges.

And Hadoop is part of it.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop was originally conceived on the basis of Google’s MapReduce, in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabyte.

Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as the best way to handle massive amounts of data, including not only structured data but also complex, unstructured data as well. Its popularity is due in part to its ability to store and process large amounts of data, quickly and cost effectively across clusters of commodity hardware.

Apache Hadoop is not actually a single product but instead a collection of several components including the following:

MapReduce – A framework for writing applications that processes large amounts of structured and unstructured data in parallel across large clusters of machines in a very reliable and fault-tolerant manner.

Hadoop Distributed File System (HDFS) – A reliable and distributed Java-based file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.

Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.

Pig – A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.

HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.

ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.

Ambari – An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.

HCatalog – A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.

Apache Hadoop is generally not a direct replacement for enterprise data warehouses, data marts and other data stores that are commonly used to manage structured or transactional data. Instead, it is used to augment enterprise data architectures by providing an efficient and cost-effective means for storing, processing, managing and analyzing the ever-increasing volumes of semi-structured or un-structured data being produced daily.

Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.

Apache Hadoop is widely deployed at organizations around the globe, including many of the world’s leading Internet and social networking businesses. At Yahoo!, Apache Hadoop is literally behind every click, processing and analyzing petabytes of data to better detect spam, predicting user interests, target ads and determine ad effectiveness.

Many organizations are beginning to look at Hadoop as an extension to their environments to tackle the volume, velocity and variety of big data. As a result, Hadoop adoption is likely to grow. In a recent Ventana Research survey of large-scale data users, more than half of the respondents stated that they are considering Hadoop within their environment.

Hadoop does not replace existing systems, instead it augments them by enabling the additional processing of large volumes of data so existing systems can focus on what they do best. Data integration plays a key role for organizations that want to combine Hadoop with data from multiple systems to realize breakthrough business insights not otherwise possible in order to take advantage of the unique strengths of each technology, and maximize performance of the overall environment.

More info you can get from:-

http://hortonworks.com/

http://www.cloudera.com/

http://hadoop.apache.org/

 

, , ,

  1. Leave a comment

Leave a comment