Hadoop
From Wikipedia, the free encyclopedia
Apache Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers. [1] It enables applications to easily scale out to thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.
Hadoop is a top level Apache project, being built and used by a community of contributors from all over the world[2]. Yahoo! has been the largest contributor[3] to the project and uses Hadoop extensively in its Web Search and Advertising businesses.[4] IBM and Google have announced a major initiative to use Hadoop to support University courses in Distributed Computer Programming. [5]
Hadoop was named after its creator's (Doug Cutting, now a Yahoo employee) child's stuffed elephant. It was originally developed to support distribution for the Nutch search engine project.[6]
Contents |
[edit] Hadoop at Yahoo!
On February 19, 2008, Yahoo! launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.[7]
There are multiple Hadoop clusters at Yahoo!, each occupying a single datacentre (or fraction thereof). No HDFS filesystems or Map/Reduce jobs are split across multiple datacentres; instead each datacentre has a separate filesystem and workload. The cluster servers run Linux, and are configured on boot using Kickstart (Linux). Every machine bootstraps the Linux image, including the Hadoop distribution. Cluster configuration is also aided through a program called Zookeeper. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine.
[edit] Hadoop on Amazon EC2/S3 services
It's possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)[8]. As an example The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240[9].
There is support for the S3 filesystem in Hadoop distributions, and the Hadoop team generates EC2 machine images after every release. From a pure performance perspective, Hadoop on S3/EC2 is inefficient, as the S3 filesystem is remote and delays returning from every write operation until the data is guaranteed to not be lost. This removes the locality advantages of Hadoop, which schedules work near data to save on network load. However, as Hadoop-on-EC2 is the primary mass-market way to run Hadoop without one's own private cluster, the performance detail is clearly felt to be acceptable to the users.
[edit] Hadoop with Sun Grid Engine
Hadoop can also be used in compute farms and high-performance computing environments. Integration with Sun Grid Engine was released, and running Hadoop on Sun Grid is possible. [10] Note that, as with EC2/S3, the CPU-time scheduler appears to be unaware of the locality of the data. A key feature of the Hadoop Runtime, "do the work in the same server or rack as the data" is therefore lost.
[edit] References
- ^ "Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework." Hadoop Overview
- ^ Hadoop Users List
- ^ Hadoop Credits Page
- ^ Yahoo! Launches World's Largest Hadoop Production Application
- ^ Google Press Center: Google and IBM Announce University Initiative to Address Internet-Scale Computing Challenges
- ^ "Hadoop contains the distributed computing platform that was formerly a part of Nutch. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of map/reduce." About Hadoop
- ^ Yahoo! Launches World's Largest Hadoop Production Application (Hadoop and Distributed Computing at Yahoo!)
- ^ http://aws.typepad.com/aws/2008/02/taking-massive.html Running Hadoop on Amazon EC2/S3
- ^ Self-service, Prorated Super Computing Fun! - Open - Code - New York Times Blog
- ^ Creating Hadoop pe under SGE. Sun Microsystems (2008-01-16).
[edit] See also
- HBase - Google BigTable-like database. Sub-project of Hadoop.
- Cloud computing
[edit] External links
- Hadoop website
- Yahoo's bet on Hadoop, an article about Yahoo's investment in Hadoop from Tim O'Reilly
- Yahoo's Doug Cutting on MapReduce and the Future of Hadoop
- A NYT blog mentioning that Hadoop and EC2 were used to reprocess all of The New York Times archive content
- Mention of Nutch and Hadoop in an article about Google
- IBM MapReduce Tools for Eclipse
- Heritrix Hadoop DFS Writer Processor from Zvents
- Problem Solving on Large Scale Clusters using Hadoop
- Hadoop Auto-deployment module for Enomalism
- Pig, a high-level languange over the Hadoop platform.
|