Book description
You learn the ins and outs of MapReduce; how to structure a cluster, design, and implement the Hadoop file system; and how to structure your first cloud—computing tasks using Hadoop. Learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.
Table of contents
- Copyright
- About the Author
- About the Technical Reviewer
- Acknowledgments
- Introduction
- 1. Getting Started with Hadoop Core
-
2. The Basics of a MapReduce Job
- 2.1. The Parts of a Hadoop MapReduce Job
- 2.2. Configuring a Job
- 2.3. Running a Job
- 2.4. Creating a Custom Mapper and Reducer
- 2.5. Summary
-
3. The Basics of Multimachine Clusters
- 3.1. The Makeup of a Cluster
- 3.2. Cluster Administration Tools
-
3.3. Cluster Configuration
- 3.3.1. Hadoop Configuration Files
-
3.3.2. Hadoop Core Server Configuration
- 3.3.2.1. Per-Machine Data
- 3.3.2.2. Default Shared File System URI and NameNode Location for HDFS
- 3.3.2.3. JobTracker Host and Port
- 3.3.2.4. Maximum Concurrent Map Tasks per TaskTracker
- 3.3.2.5. Maximum Concurrent Reduce Tasks per TaskTracker
- 3.3.2.6. JVM Options for the Task Virtual Machines
- 3.3.2.7. Enable Job Control Options on the Web Interfaces
-
3.4. A Sample Cluster Configuration
- 3.4.1. Configuration Requirements
- 3.4.2. Configuration Files for the Sample Cluster
- 3.4.3. Distributing the Configuration
- 3.4.4. Verifying the Cluster Configuration
- 3.4.5. Formatting HDFS
- 3.4.6. Starting HDFS
- 3.4.7. Correcting Errors
- 3.4.8. The Web Interface to HDFS
- 3.4.9. Starting MapReduce
- 3.4.10. Running a Test Job on the Cluster
- 3.5. Summary
-
4. HDFS Details for Multimachine Clusters
- 4.1. Configuration Trade-Offs
- 4.2. HDFS Installation for Multimachine Clusters
- 4.3. Tuning Factors
- 4.4. Recovery from Failure
- 4.5. Troubleshooting HDFS Failures
- 4.6. Summary
-
5. MapReduce Details for Multimachine Clusters
- 5.1. Requirements for Successful MapReduce Jobs
- 5.2. Launching MapReduce Jobs
- 5.3. Using Shared Libraries
- 5.4. MapReduce-Specific Configuration for Each Machine in a Cluster
- 5.5. Using the Distributed Cache
- 5.6. Configuring the Hadoop Core Cluster Information
- 5.7. The Mapper Dissected
- 5.8. Partitioners Dissected
- 5.9. The Reducer Dissected
- 5.10. Combiners
- 5.11. File Types for MapReduce Jobs
- 5.12. Compression
- 5.13. Summary
- 6. Tuning Your MapReduce Jobs
-
7. Unit Testing and Debugging
- 7.1. Unit Testing MapReduce Jobs
- 7.2. Running the Debugger on MapReduce Jobs
- 7.3. Summary
-
8. Advanced and Alternate MapReduce Techniques
- 8.1. Streaming: Running Custom MapReduce Jobs from the Command Line
- 8.2. Alternative Methods for Accessing HDFS
-
8.3. Alternate MapReduce Techniques
-
8.3.1. Chaining: Efficiently Connecting Multiple Map and/or Reduce Steps
- 8.3.1.1. Configuring for Chains
- 8.3.1.2. Passing Key/Value Pairs by Value or by Reference
- 8.3.1.3. Type Checking for Chained Keys and Values
- 8.3.1.4. Per Chain Item Job Configuration Objects
- 8.3.1.5. How the close() Method Is Called for Items in a Chain
- 8.3.1.6. Configuring Mapper Tasks to be a Chain
- 8.3.1.7. Configuring the Reducer Tasks to Be Chains
-
8.3.2. Map-side Join: Sequentially Reading Data from Multiple Sorted Inputs
- 8.3.2.1. Examining Join Datasets
- 8.3.2.2. Under the Covers: How a Join Works
- 8.3.2.3. Types of Joins Supported
- 8.3.2.4. Details of a Join Specification
- 8.3.2.5. Handling Duplicate Keys in a Dataset
-
8.3.2.6. Composing a Join Specification
- 8.3.2.6.1. String CompositeInputFormat.compose(Class<? extends InputFormat> inf, String path)
- 8.3.2.6.2. String CompositeInputFormat.compose(String op, Class<? extends InputFormat> inf, String... path)
- 8.3.2.6.3. String CompositeInputFormat.compose(String op, Class<? extends InputFormat> inf, Path... path)
- 8.3.2.7. Building and Running a Join
- 8.3.2.8. The Magic of the TupleWritable in the Mapper.map() Method
-
8.3.1. Chaining: Efficiently Connecting Multiple Map and/or Reduce Steps
- 8.4. Aggregation: A Framework for MapReduce Jobs that Count or Aggregate Data
- 8.5. Handling Acceptable Failure Rates
- 8.6. Capacity Scheduler: Execution Queues and Priorities
- 8.7. Summary
- 9. Solving Problems with Hadoop
-
10. Projects Based On Hadoop and Future Directions
-
10.1. Hadoop Core–Related Projects
- 10.1.1. HBase: HDFS-Based Column-Oriented Table
- 10.1.2. Hive: The Data Warehouse that Facebook Built
- 10.1.3. Pig, the Other Latin: A Scripting Language for Dataset Analysis
- 10.1.4. Mahout: Machine Learning Algorithms
- 10.1.5. Hama: A Parallel Matrix Computation Framework
- 10.1.6. ZooKeeper: A High-Performance Collaboration Service
- 10.1.7. Lucene: The Open Source Search Engine
- 10.1.8. Thrift and Protocol Buffers
- 10.1.9. Cascading: A Map Reduce Framework for Complex Flows
- 10.1.10. CloudStore: A Distributed File System
- 10.1.11. Hypertable: A Distributed Column-Oriented Database
- 10.1.12. Greenplum: An Analytic Engine with SQL
- 10.1.13. CloudBase: Data Warehousing
- 10.2. Hadoop in the Cloud
- 10.3. API Changes in Hadoop 0.20.0
- 10.4. Additional Features in the Example Code
- 10.5. Summary
-
10.1. Hadoop Core–Related Projects
-
A. The JobConf Object in Detail
- A.1. JobConf Object in the Driver and Tasks
- A.2. JobConf Is a Properties Table
- A.3. Variable Expansion
- A.4. Final Values
- A.5. Constructors
- A.6. Methods for Loading Additional Configuration Resources
-
A.7. Basic Getters and Setters
- A.7.1. public String get(String name)
- A.7.2. public String getRaw(String name)
- A.7.3. public void set(String name, String value)
- A.7.4. public String get(String name, String defaultValue)
- A.7.5. public int getInt(String name, int defaultValue)
- A.7.6. public void setInt(String name, int value)
- A.7.7. public long getLong(String name, long defaultValue)
- A.7.8. public void setLong(String name, long value)
- A.7.9. public float getFloat(String name, float defaultValue)
- A.7.10. public boolean getBoolean(String name, boolean defaultValue)
- A.7.11. public void setBoolean(String name, boolean value)
- A.7.12. public Configuration.IntegerRanges getRange(String name, String defaultValue)
- A.7.13. public Collection<String> getStringCollection(String name)
- A.7.14. public String[] getStrings(String name)
- A.7.15. public String[] getStrings(String name, String... defaultValue)
- A.7.16. public void setStrings(String name, String... values)
- A.7.17. public Class<?> getClassByName(String name) throws ClassNotFoundException
- A.7.18. public Class<?>[] getClasses(String name, Class<?>... defaultValue)
- A.7.19. public Class<?> getClass(String name, Class<?> defaultValue)
- A.7.20. public <U> Class<? extends U> getClass(String name, Class<? extends U> defaultValue, Class<U> xface)
- A.7.21. public void setClass(String name, Class<?> theClass, Class<?> xface)
-
A.8. Getters for Localized and Load Balanced Paths
- A.8.1. public Path getLocalPath(String dirsProp, String pathTrailer) throws IOException
- A.8.2. public File getFile(String dirsProp, String pathTrailer) throws IOException
- A.8.3. public String[] getLocalDirs() throws IOException
- A.8.4. public void deleteLocalFiles() throws IOException
- A.8.5. public void deleteLocalFiles(String subdir)throws IOException
- A.8.6. public Path getLocalPath(String pathString) throws IOException
- A.8.7. public String getJobLocalDir()
- A.9. Methods for Accessing Classpath Resources
- A.10. Methods for Controlling the Task Classpath
-
A.11. Methods for Controlling the Task Execution Environment
- A.11.1. public String getUser()
- A.11.2. public void setUser(String user)
- A.11.3. public void setKeepFailedTaskFiles(boolean keep)
- A.11.4. public boolean getKeepFailedTaskFiles()
- A.11.5. public void setKeepTaskFilesPattern(String pattern)
- A.11.6. public String getKeepTaskFilesPattern()
- A.11.7. public void setWorkingDirectory(Path dir)
- A.11.8. public Path getWorkingDirectory()
- A.11.9. public void setNumTasksToExecutePerJvm(int numTasks)
- A.11.10. public int getNumTasksToExecutePerJvm()
-
A.12. Methods for Controlling the Input and Output of the Job
- A.12.1. public InputFormat getInputFormat()
- A.12.2. public void setInputFormat(Class<? extends InputFormat> theClass)
- A.12.3. public OutputFormat getOutputFormat()
- A.12.4. public void setOutputFormat(Class<? extends OutputFormat> theClass)
- A.12.5. public OutputCommitter getOutputCommitter()
- A.12.6. public void setOutputCommitter(Class <? extends OutputCommitter> theClass)
- A.12.7. public void setCompressMapOutput(boolean compress)
- A.12.8. public boolean getCompressMapOutput()
- A.12.9. public void setMapOutputCompressorClass(Class <? extends CompressionCodec> codecClass)
- A.12.10. public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue)
- A.12.11. public void setMapOutputKeyClass(Class<?> theClass)
- A.12.12. public Class<?> getMapOutputKeyClass()
- A.12.13. public Class<?> getMapOutputValueClass()
- A.12.14. public void setMapOutputValueClass(Class<?> theClass)
- A.12.15. public Class<?> getOutputKeyClass()
- A.12.16. public void setOutputKeyClass(Class<?> theClass)
- A.12.17. public Class<?> getOutputValueClass()
- A.12.18. public void setOutputValueClass(Class<?> theClass)
-
A.13. Methods for Controlling Output Partitioning and Sorting for the Reduce
- A.13.1. public RawComparator getOutputKeyComparator()
- A.13.2. public void setOutputKeyComparatorClass(Class <? extends RawComparator> theClass)
- A.13.3. public void setKeyFieldComparatorOptions(String keySpec)
- A.13.4. public String getKeyFieldComparatorOption()
- A.13.5. public Class<? extends Partitioner> getPartitionerClass()
- A.13.6. public void setPartitionerClass(Class<? extends Partitioner> theClass)
- A.13.7. public void setKeyFieldPartitionerOptions(String keySpec)
- A.13.8. public String getKeyFieldPartitionerOption()
- A.13.9. public RawComparator getOutputValueGroupingComparator()
- A.13.10. public void setOutputValueGroupingComparator(Class <? extends RawComparator> theClass)
-
A.14. Methods that Control Map and Reduce Tasks
- A.14.1. public Class<? extends Mapper> getMapperClass()
- A.14.2. public void setMapperClass(Class<? extends Mapper> theClass)
- A.14.3. public Class<? extends MapRunnable> getMapRunnerClass()
- A.14.4. public void setMapRunnerClass(Class<? extends MapRunnable> theClass)
- A.14.5. public Class<? extends Reducer> getReducerClass()
- A.14.6. public void setReducerClass(Class<? extends Reducer> theClass)
- A.14.7. public Class<? extends Reducer> getCombinerClass()
- A.14.8. public void setCombinerClass(Class<? extends Reducer> theClass)
- A.14.9. public boolean getSpeculativeExecution()
- A.14.10. public void setSpeculativeExecution (boolean speculativeExecution)
- A.14.11. public boolean getMapSpeculativeExecution()
- A.14.12. public void setMapSpeculativeExecution (boolean speculativeExecution)
- A.14.13. public boolean getReduceSpeculativeExecution()
- A.14.14. public void setReduceSpeculativeExecution (boolean speculativeExecution)
- A.14.15. public int getNumMapTasks()
- A.14.16. public void setNumMapTasks(int n)
- A.14.17. public int getNumReduceTasks()
- A.14.18. public void setNumReduceTasks(int n)
- A.14.19. public int getMaxMapAttempts()
- A.14.20. public void setMaxMapAttempts(int n)
- A.14.21. public int getMaxReduceAttempts()
- A.14.22. public void setMaxReduceAttempts(int n)
- A.14.23. public void setMaxTaskFailuresPerTracker(int noFailures)
- A.14.24. public int getMaxTaskFailuresPerTracker()
- A.14.25. public int getMaxMapTaskFailuresPercent()
- A.14.26. public void setMaxMapTaskFailuresPercent(int percent)
- A.14.27. public int getMaxReduceTaskFailuresPercent()
- A.14.28. public void setMaxReduceTaskFailuresPercent(int percent)
-
A.15. Methods Providing Control Over Job Execution and Naming
- A.15.1. public String getJobName()
- A.15.2. public void setJobName(String name)
- A.15.3. public String getSessionId()
- A.15.4. public void setSessionId(String sessionId)
- A.15.5. public JobPriority getJobPriority()
- A.15.6. public void setJobPriority(JobPriority prio)
- A.15.7. public boolean getProfileEnabled()
- A.15.8. public void setProfileEnabled(boolean newValue)
- A.15.9. public String getProfileParams()
- A.15.10. public void setProfileParams(String value)
- A.15.11. public Configuration.IntegerRanges getProfileTaskRange (boolean isMap)
- A.15.12. public void setProfileTaskRange(boolean isMap, String newValue)
- A.15.13. public String getMapDebugScript()
- A.15.14. public void setMapDebugScript(String mDbgScript)
- A.15.15. public String getReduceDebugScript()
- A.15.16. public void setReduceDebugScript(String rDbgScript)
- A.15.17. public String getJobEndNotificationURI()
- A.15.18. public void setJobEndNotificationURI(String uri)
- A.15.19. public String getQueueName()
- A.15.20. public void setQueueName(String queueName)
- A.15.21. long getMaxVirtualMemoryForTask() {
- A.15.22. void setMaxVirtualMemoryForTask(long vmem) {
-
A.16. Convenience Methods
- A.16.1. public int size()
- A.16.2. public void clear()
- A.16.3. public Iterator<Map.Entry<String,String>> iterator()
- A.16.4. public void writeXml(OutputStream out) throws IOException
- A.16.5. public ClassLoader getClassLoader()
- A.16.6. public void setClassLoader(ClassLoader classLoader)
- A.16.7. public String toString()
- A.17. Methods Used to Pass Configurations Through SequenceFiles
Product information
- Title: Pro Hadoop
- Author(s):
- Release date: June 2009
- Publisher(s): Apress
- ISBN: 9781430219422
You might also like
book
Pro Apache Hadoop, Second Edition
Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop the framework of big …
book
Professional Hadoop
The professional's one-stop guide to this open-source, Java-based big data framework Professional Hadoop is the complete …
book
Optimizing Hadoop for MapReduce
This book is the perfect introduction to sophisticated concepts in MapReduce and will ensure you have …
book
Hadoop Blueprints
Use Hadoop to solve business problems by learning from a rich set of real-life case studies …