Pro Hadoop

Book description

You learn the ins and outs of MapReduce; how to structure a cluster, design, and implement the Hadoop file system; and how to structure your first cloud—computing tasks using Hadoop. Learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.

Table of contents

  1. Copyright
  2. About the Author
  3. About the Technical Reviewer
  4. Acknowledgments
  5. Introduction
    1. Who This Book Is For
    2. How This Book Is Structured
    3. Prerequisites
    4. Downloading the Code
    5. Contacting the Author
  6. 1. Getting Started with Hadoop Core
    1. 1.1. Introducing the MapReduce Model
    2. 1.2. Introducing Hadoop
      1. 1.2.1. Hadoop Core MapReduce
      2. 1.2.2. The Hadoop Distributed File System
    3. 1.3. Installing Hadoop
      1. 1.3.1. The Prerequisites
        1. 1.3.1.1. Hadoop on a Linux System
        2. 1.3.1.2. Hadoop on a Windows System: How To and Common Problems
      2. 1.3.2. Getting Hadoop Running
      3. 1.3.3. Checking Your Environment
    4. 1.4. Running Hadoop Examples and Tests
      1. 1.4.1. Hadoop Examples
        1. 1.4.1.1. Running the Pi Estimator
        2. 1.4.1.2. Examining the Output: Input Splits, Shuffles, Spills, and Sorts
      2. 1.4.2. Hadoop Tests
    5. 1.5. Troubleshooting
    6. 1.6. Summary
  7. 2. The Basics of a MapReduce Job
    1. 2.1. The Parts of a Hadoop MapReduce Job
      1. 2.1.1. Input Splitting
      2. 2.1.2. A Simple Map Function: IdentityMapper
      3. 2.1.3. A Simple Reduce Function: IdentityReducer
    2. 2.2. Configuring a Job
      1. 2.2.1. Specifying Input Formats
      2. 2.2.2. Setting the Output Parameters
      3. 2.2.3. Configuring the Reduce Phase
    3. 2.3. Running a Job
    4. 2.4. Creating a Custom Mapper and Reducer
      1. 2.4.1. Setting Up a Custom Mapper
        1. 2.4.1.1. The Reporter Object
        2. 2.4.1.2. The Counters and Exceptions
      2. 2.4.2. After the Job Finishes
        1. 2.4.2.1. Examining the Counters
        2. 2.4.2.2. Was This Job Really Successful?
      3. 2.4.3. Creating a Custom Reducer
      4. 2.4.4. Why Do the Mapper and Reducer Extend MapReduceBase?
        1. 2.4.4.1. The configure Method
        2. 2.4.4.2. The close Method
      5. 2.4.5. Using a Custom Partitioner
    5. 2.5. Summary
  8. 3. The Basics of Multimachine Clusters
    1. 3.1. The Makeup of a Cluster
    2. 3.2. Cluster Administration Tools
    3. 3.3. Cluster Configuration
      1. 3.3.1. Hadoop Configuration Files
      2. 3.3.2. Hadoop Core Server Configuration
        1. 3.3.2.1. Per-Machine Data
        2. 3.3.2.2. Default Shared File System URI and NameNode Location for HDFS
        3. 3.3.2.3. JobTracker Host and Port
        4. 3.3.2.4. Maximum Concurrent Map Tasks per TaskTracker
        5. 3.3.2.5. Maximum Concurrent Reduce Tasks per TaskTracker
        6. 3.3.2.6. JVM Options for the Task Virtual Machines
        7. 3.3.2.7. Enable Job Control Options on the Web Interfaces
    4. 3.4. A Sample Cluster Configuration
      1. 3.4.1. Configuration Requirements
        1. 3.4.1.1. Network Requirements
        2. 3.4.1.2. Advanced Networking: Support for Multihomed Machines
        3. 3.4.1.3. Machine Configuration Requirements
      2. 3.4.2. Configuration Files for the Sample Cluster
        1. 3.4.2.1. The hadoop-site.xml File
        2. 3.4.2.2. The slaves and masters Files
        3. 3.4.2.3. The hadoop-metrics.properties File
      3. 3.4.3. Distributing the Configuration
      4. 3.4.4. Verifying the Cluster Configuration
      5. 3.4.5. Formatting HDFS
      6. 3.4.6. Starting HDFS
      7. 3.4.7. Correcting Errors
      8. 3.4.8. The Web Interface to HDFS
      9. 3.4.9. Starting MapReduce
      10. 3.4.10. Running a Test Job on the Cluster
    5. 3.5. Summary
  9. 4. HDFS Details for Multimachine Clusters
    1. 4.1. Configuration Trade-Offs
    2. 4.2. HDFS Installation for Multimachine Clusters
      1. 4.2.1. Building the HDFS Configuration
        1. 4.2.1.1. Generating the conf/hadoop-site.xml File
        2. 4.2.1.2. Generating the conf/slaves and conf/masters Files
        3. 4.2.1.3. Customizing the conf/hadoop-env.sh File
      2. 4.2.2. Distributing Your Installation Data
      3. 4.2.3. Formatting Your HDFS
      4. 4.2.4. Starting Your HDFS Installation
      5. 4.2.5. Verifying HDFS Is Running
        1. 4.2.5.1. Checking the NameNodes
        2. 4.2.5.2. Checking the DataNodes
    3. 4.3. Tuning Factors
      1. 4.3.1. File Descriptors
      2. 4.3.2. Block Service Threads
      3. 4.3.3. NameNode Threads
      4. 4.3.4. Server Pending Connections
      5. 4.3.5. Reserved Disk Space
      6. 4.3.6. Storage Allocations
      7. 4.3.7. Disk I/O
        1. 4.3.7.1. Secondary NameNode Disk I/O Tuning
        2. 4.3.7.2. NameNode Disk I/O Tuning
        3. 4.3.7.3. DataNode Disk I/O Tuning
      8. 4.3.8. Network I/O Tuning
    4. 4.4. Recovery from Failure
      1. 4.4.1. NameNode Recovery
      2. 4.4.2. DataNode Recovery and Addition
      3. 4.4.3. DataNode Decommissioning
      4. 4.4.4. Deleted File Recovery
    5. 4.5. Troubleshooting HDFS Failures
      1. 4.5.1. NameNode Failures
        1. 4.5.1.1. Out of Memory
        2. 4.5.1.2. Data Loss or Corruption
        3. 4.5.1.3. No Live Node Contains Block Errors
        4. 4.5.1.4. Write Failed
      2. 4.5.2. DataNode or NameNode Pauses
    6. 4.6. Summary
  10. 5. MapReduce Details for Multimachine Clusters
    1. 5.1. Requirements for Successful MapReduce Jobs
    2. 5.2. Launching MapReduce Jobs
    3. 5.3. Using Shared Libraries
    4. 5.4. MapReduce-Specific Configuration for Each Machine in a Cluster
    5. 5.5. Using the Distributed Cache
      1. 5.5.1. Adding Resources to the Task Classpath
      2. 5.5.2. Distributing Archives and Files to Tasks
        1. 5.5.2.1. Distributing Archives
        2. 5.5.2.2. Distributing Files
      3. 5.5.3. Accessing the DistributedCache Data
        1. 5.5.3.1. Looking Up Names
        2. 5.5.3.2. Looking Up Archives and Files
        3. 5.5.3.3. Finding a File or Archive in the Localized Cache
    6. 5.6. Configuring the Hadoop Core Cluster Information
      1. 5.6.1. Setting the Default File System URI
      2. 5.6.2. Setting the JobTracker Location
    7. 5.7. The Mapper Dissected
      1. 5.7.1. Mapper Methods
        1. 5.7.1.1. The configure() Method
        2. 5.7.1.2. The map( ) Method
        3. 5.7.1.3. The close() Method
      2. 5.7.2. Mapper Class Declaration and Member Fields
      3. 5.7.3. Initializing the Mapper with Spring
        1. 5.7.3.1. Creating the Spring Application Context
        2. 5.7.3.2. Using Spring to Autowire the Mapper Class
    8. 5.8. Partitioners Dissected
      1. 5.8.1. The HashPartitioner Class
      2. 5.8.2. The TotalOrderPartitioner Class
        1. 5.8.2.1. Building a Range Table
        2. 5.8.2.2. Using the TotalOrderPartitioner
      3. 5.8.3. The KeyFieldBasedPartitioner Class
    9. 5.9. The Reducer Dissected
      1. 5.9.1. A Simple Transforming Reducer
      2. 5.9.2. A Reducer That Uses Three Partitions
    10. 5.10. Combiners
    11. 5.11. File Types for MapReduce Jobs
      1. 5.11.1. Text Files
      2. 5.11.2. Sequence Files
      3. 5.11.3. Map Files
    12. 5.12. Compression
      1. 5.12.1. Codec Specification
      2. 5.12.2. Sequence File Compression
      3. 5.12.3. Map Task Output
      4. 5.12.4. JAR, Zip, and Tar Files
    13. 5.13. Summary
  11. 6. Tuning Your MapReduce Jobs
    1. 6.1. Tunable Items for Cluster and Jobs
      1. 6.1.1. Behind the Scenes: What the Framework Does
        1. 6.1.1.1. On Job Submission
        2. 6.1.1.2. Map Task Submission and Execution
        3. 6.1.1.3. Merge-Sorting
        4. 6.1.1.4. The Reduce Phase
        5. 6.1.1.5. Writing to HDFS
      2. 6.1.2. Cluster-Level Tunable Parameters
        1. 6.1.2.1. Server-Level Parameters
        2. 6.1.2.2. HDFS Tunable Parameters
        3. 6.1.2.3. JobTracker and TaskTracker Tunable Parameters
      3. 6.1.3. Per-Job Tunable Parameters
    2. 6.2. Monitoring Hadoop Core Services
      1. 6.2.1. JMX: Hadoop Core Server and Task State Monitor
      2. 6.2.2. Nagios: A Monitoring and Alert Generation Framework
      3. 6.2.3. Ganglia: A Visual Monitoring Tool with History
      4. 6.2.4. Chukwa: A Monitoring Service
      5. 6.2.5. FailMon: A Hardware Diagnostic Tool
    3. 6.3. Tuning to Improve Job Performance
      1. 6.3.1. Speeding Up the Job and Task Start
      2. 6.3.2. Optimizing a Job's Map Phase
      3. 6.3.3. Tuning the Reduce Task Setup
      4. 6.3.4. Addressing Job-Level Issues
        1. 6.3.4.1. Dealing with the Task Tail
        2. 6.3.4.2. Dealing with the Job Tail
    4. 6.4. Summary
  12. 7. Unit Testing and Debugging
    1. 7.1. Unit Testing MapReduce Jobs
      1. 7.1.1. Requirements for Using ClusterMapReduceTestCase
        1. 7.1.1.1. Troubles with Jetty, the HTTP Server for the Web UI
        2. 7.1.1.2. The Hadoop Core JAR Is Missing or Malformed
        3. 7.1.1.3. The Virtual Cluster Failed to Start
      2. 7.1.2. Simpler Testing and Debugging with ClusterMapReduceDelegate
        1. 7.1.2.1. Core Methods of ClusterMapReduceDelegate
        2. 7.1.2.2. Configuration Parameters for Interacting with Virtual Clusters
      3. 7.1.3. Writing a Test Case: SimpleUnitTest
        1. 7.1.3.1. The TestCase Class Declaration
        2. 7.1.3.2. The Cluster Start Method
        3. 7.1.3.3. The Cluster Stop Method
        4. 7.1.3.4. The Actual Test
        5. 7.1.3.5. A Test Case That Launches a MapReduce Job
    2. 7.2. Running the Debugger on MapReduce Jobs
      1. 7.2.1. Running an Entire MapReduce Job in a Single JVM
      2. 7.2.2. Debugging a Task Running on a Cluster
      3. 7.2.3. Rerunning a Failed Task
        1. 7.2.3.1. Configuring the Job or Cluster to Save the Task Local Working Directory
        2. 7.2.3.2. Determining the Location of the Task Local Working Directory
        3. 7.2.3.3. Running a Job with a Keep Pattern and Debugging via the IsolationRunner
    3. 7.3. Summary
  13. 8. Advanced and Alternate MapReduce Techniques
    1. 8.1. Streaming: Running Custom MapReduce Jobs from the Command Line
      1. 8.1.1. Streaming Command-Line Arguments
        1. 8.1.1.1. Using -inputreader org.apache.hadoop.streaming.StreamXmlRecordReader
      2. 8.1.2. Using Pipes
      3. 8.1.3. Using Counters in Streaming and Pipes Jobs
        1. 8.1.3.1. Using the reporter:counter:group,counter,increment Command
        2. 8.1.3.2. Using the reporter:status:message Command
    2. 8.2. Alternative Methods for Accessing HDFS
      1. 8.2.1. libhdfs
      2. 8.2.2. fuse-dfs
      3. 8.2.3. Mounting an HDFS File System Using fuse_dfs
    3. 8.3. Alternate MapReduce Techniques
      1. 8.3.1. Chaining: Efficiently Connecting Multiple Map and/or Reduce Steps
        1. 8.3.1.1. Configuring for Chains
        2. 8.3.1.2. Passing Key/Value Pairs by Value or by Reference
        3. 8.3.1.3. Type Checking for Chained Keys and Values
        4. 8.3.1.4. Per Chain Item Job Configuration Objects
        5. 8.3.1.5. How the close() Method Is Called for Items in a Chain
        6. 8.3.1.6. Configuring Mapper Tasks to be a Chain
        7. 8.3.1.7. Configuring the Reducer Tasks to Be Chains
      2. 8.3.2. Map-side Join: Sequentially Reading Data from Multiple Sorted Inputs
        1. 8.3.2.1. Examining Join Datasets
        2. 8.3.2.2. Under the Covers: How a Join Works
        3. 8.3.2.3. Types of Joins Supported
          1. 8.3.2.3.1. Inner Join
          2. 8.3.2.3.2. Outer Join
          3. 8.3.2.3.3. Override Join
          4. 8.3.2.3.4. Composing Your Own Join Operators
        4. 8.3.2.4. Details of a Join Specification
        5. 8.3.2.5. Handling Duplicate Keys in a Dataset
        6. 8.3.2.6. Composing a Join Specification
          1. 8.3.2.6.1. String CompositeInputFormat.compose(Class<? extends InputFormat> inf, String path)
          2. 8.3.2.6.2. String CompositeInputFormat.compose(String op, Class<? extends InputFormat> inf, String... path)
          3. 8.3.2.6.3. String CompositeInputFormat.compose(String op, Class<? extends InputFormat> inf, Path... path)
        7. 8.3.2.7. Building and Running a Join
        8. 8.3.2.8. The Magic of the TupleWritable in the Mapper.map() Method
    4. 8.4. Aggregation: A Framework for MapReduce Jobs that Count or Aggregate Data
      1. 8.4.1. Aggregation Using Streaming
      2. 8.4.2. Aggregation Using Java Classes
      3. 8.4.3. Specifying the ValueAggregatorDescriptor Class via Configuration Parameters
      4. 8.4.4. Side Effect Files: Map and Reduce Tasks Can Write Additional Output Files
    5. 8.5. Handling Acceptable Failure Rates
      1. 8.5.1. Dealing with Task Failure
      2. 8.5.2. Skipping Bad Records
    6. 8.6. Capacity Scheduler: Execution Queues and Priorities
      1. 8.6.1. Enabling the Capacity Scheduler
    7. 8.7. Summary
  14. 9. Solving Problems with Hadoop
    1. 9.1. Design Goals
    2. 9.2. Design 1: Brute-Force MapReduce
      1. 9.2.1. A Single Reduce Task
      2. 9.2.2. Key Contents and Comparators
      3. 9.2.3. A Helper Class for Keys
      4. 9.2.4. The Mapper
      5. 9.2.5. The Combiner
      6. 9.2.6. The Reducer
      7. 9.2.7. The Driver
      8. 9.2.8. The Pluses and Minuses of the Brute-Force Design
    3. 9.3. Design 2: Custom Partitioner for Segmenting the Address Space
      1. 9.3.1. The Simple IP Range Partitioner
      2. 9.3.2. Search Space Keys for Each Reduce Task That May Contain Matching Keys
      3. 9.3.3. Helper Class for Keys Modifications
    4. 9.4. Design 3: Future Possibilities
    5. 9.5. Summary
  15. 10. Projects Based On Hadoop and Future Directions
    1. 10.1. Hadoop Core–Related Projects
      1. 10.1.1. HBase: HDFS-Based Column-Oriented Table
      2. 10.1.2. Hive: The Data Warehouse that Facebook Built
        1. 10.1.2.1. Setting Up and Running Hive
      3. 10.1.3. Pig, the Other Latin: A Scripting Language for Dataset Analysis
      4. 10.1.4. Mahout: Machine Learning Algorithms
      5. 10.1.5. Hama: A Parallel Matrix Computation Framework
      6. 10.1.6. ZooKeeper: A High-Performance Collaboration Service
      7. 10.1.7. Lucene: The Open Source Search Engine
        1. 10.1.7.1. SOLR: A Rich Set of Interfaces to Lucene
        2. 10.1.7.2. Katta: A Distributed Lucene Index Server
      8. 10.1.8. Thrift and Protocol Buffers
      9. 10.1.9. Cascading: A Map Reduce Framework for Complex Flows
      10. 10.1.10. CloudStore: A Distributed File System
      11. 10.1.11. Hypertable: A Distributed Column-Oriented Database
      12. 10.1.12. Greenplum: An Analytic Engine with SQL
      13. 10.1.13. CloudBase: Data Warehousing
    2. 10.2. Hadoop in the Cloud
      1. 10.2.1. Amazon
      2. 10.2.2. Cloudera
        1. 10.2.2.1. Training
        2. 10.2.2.2. Supported Distribution
        3. 10.2.2.3. Paid Support
      3. 10.2.3. Scale Unlimited
    3. 10.3. API Changes in Hadoop 0.20.0
      1. 10.3.1. Vaidya: A Rule-Based Performance Diagnostic Tool for MapReduce Jobs
      2. 10.3.2. Service Level Authorization (SLA)
      3. 10.3.3. Removal of LZO Compression Codecs and the API Glue
      4. 10.3.4. New MapReduce Context APIs and Deprecation of the Old Parameter Passing APIs
    4. 10.4. Additional Features in the Example Code
      1. 10.4.1. Zero-Configuration, Two-Node Virtual Cluster for Testing
      2. 10.4.2. Eclipse Project for the Example Code
    5. 10.5. Summary
  16. A. The JobConf Object in Detail
    1. A.1. JobConf Object in the Driver and Tasks
    2. A.2. JobConf Is a Properties Table
    3. A.3. Variable Expansion
    4. A.4. Final Values
    5. A.5. Constructors
      1. A.5.1. public JobConf()
      2. A.5.2. public JobConf(Class exampleClass)
      3. A.5.3. public JobConf(Configuration conf)
      4. A.5.4. public JobConf(Configuration conf, Class exampleClass)
      5. A.5.5. public JobConf(String config)
      6. A.5.6. public JobConf(Path config)
      7. A.5.7. public JobConf(boolean loadDefaults)
    6. A.6. Methods for Loading Additional Configuration Resources
      1. A.6.1. public void setQuietMode(boolean quietmode)
      2. A.6.2. public void addResource(String name)
      3. A.6.3. public void addResource(URL url)
      4. A.6.4. public void addResource(Path file)
      5. A.6.5. public void addResource(InputStream in)
      6. A.6.6. public void reloadConfiguration()
    7. A.7. Basic Getters and Setters
      1. A.7.1. public String get(String name)
      2. A.7.2. public String getRaw(String name)
      3. A.7.3. public void set(String name, String value)
      4. A.7.4. public String get(String name, String defaultValue)
      5. A.7.5. public int getInt(String name, int defaultValue)
      6. A.7.6. public void setInt(String name, int value)
      7. A.7.7. public long getLong(String name, long defaultValue)
      8. A.7.8. public void setLong(String name, long value)
      9. A.7.9. public float getFloat(String name, float defaultValue)
      10. A.7.10. public boolean getBoolean(String name, boolean defaultValue)
      11. A.7.11. public void setBoolean(String name, boolean value)
      12. A.7.12. public Configuration.IntegerRanges getRange(String name, String defaultValue)
      13. A.7.13. public Collection<String> getStringCollection(String name)
      14. A.7.14. public String[] getStrings(String name)
      15. A.7.15. public String[] getStrings(String name, String... defaultValue)
      16. A.7.16. public void setStrings(String name, String... values)
      17. A.7.17. public Class<?> getClassByName(String name) throws ClassNotFoundException
      18. A.7.18. public Class<?>[] getClasses(String name, Class<?>... defaultValue)
      19. A.7.19. public Class<?> getClass(String name, Class<?> defaultValue)
      20. A.7.20. public <U> Class<? extends U> getClass(String name, Class<? extends U> defaultValue, Class<U> xface)
      21. A.7.21. public void setClass(String name, Class<?> theClass, Class<?> xface)
    8. A.8. Getters for Localized and Load Balanced Paths
      1. A.8.1. public Path getLocalPath(String dirsProp, String pathTrailer) throws IOException
      2. A.8.2. public File getFile(String dirsProp, String pathTrailer) throws IOException
      3. A.8.3. public String[] getLocalDirs() throws IOException
      4. A.8.4. public void deleteLocalFiles() throws IOException
      5. A.8.5. public void deleteLocalFiles(String subdir)throws IOException
      6. A.8.6. public Path getLocalPath(String pathString) throws IOException
      7. A.8.7. public String getJobLocalDir()
    9. A.9. Methods for Accessing Classpath Resources
      1. A.9.1. public URL getResource(String name)
      2. A.9.2. public InputStream getConfResourceAsInputStream (String name)
      3. A.9.3. public Reader getConfResourceAsReader(String name)
    10. A.10. Methods for Controlling the Task Classpath
      1. A.10.1. public String getJar()
      2. A.10.2. public void setJar(String jar)
      3. A.10.3. public void setJarByClass(Class cls)
    11. A.11. Methods for Controlling the Task Execution Environment
      1. A.11.1. public String getUser()
      2. A.11.2. public void setUser(String user)
      3. A.11.3. public void setKeepFailedTaskFiles(boolean keep)
      4. A.11.4. public boolean getKeepFailedTaskFiles()
      5. A.11.5. public void setKeepTaskFilesPattern(String pattern)
      6. A.11.6. public String getKeepTaskFilesPattern()
      7. A.11.7. public void setWorkingDirectory(Path dir)
      8. A.11.8. public Path getWorkingDirectory()
      9. A.11.9. public void setNumTasksToExecutePerJvm(int numTasks)
      10. A.11.10. public int getNumTasksToExecutePerJvm()
    12. A.12. Methods for Controlling the Input and Output of the Job
      1. A.12.1. public InputFormat getInputFormat()
      2. A.12.2. public void setInputFormat(Class<? extends InputFormat> theClass)
      3. A.12.3. public OutputFormat getOutputFormat()
      4. A.12.4. public void setOutputFormat(Class<? extends OutputFormat> theClass)
      5. A.12.5. public OutputCommitter getOutputCommitter()
      6. A.12.6. public void setOutputCommitter(Class <? extends OutputCommitter> theClass)
      7. A.12.7. public void setCompressMapOutput(boolean compress)
      8. A.12.8. public boolean getCompressMapOutput()
      9. A.12.9. public void setMapOutputCompressorClass(Class <? extends CompressionCodec> codecClass)
      10. A.12.10. public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends CompressionCodec> defaultValue)
      11. A.12.11. public void setMapOutputKeyClass(Class<?> theClass)
      12. A.12.12. public Class<?> getMapOutputKeyClass()
      13. A.12.13. public Class<?> getMapOutputValueClass()
      14. A.12.14. public void setMapOutputValueClass(Class<?> theClass)
      15. A.12.15. public Class<?> getOutputKeyClass()
      16. A.12.16. public void setOutputKeyClass(Class<?> theClass)
      17. A.12.17. public Class<?> getOutputValueClass()
      18. A.12.18. public void setOutputValueClass(Class<?> theClass)
    13. A.13. Methods for Controlling Output Partitioning and Sorting for the Reduce
      1. A.13.1. public RawComparator getOutputKeyComparator()
      2. A.13.2. public void setOutputKeyComparatorClass(Class <? extends RawComparator> theClass)
      3. A.13.3. public void setKeyFieldComparatorOptions(String keySpec)
      4. A.13.4. public String getKeyFieldComparatorOption()
      5. A.13.5. public Class<? extends Partitioner> getPartitionerClass()
      6. A.13.6. public void setPartitionerClass(Class<? extends Partitioner> theClass)
      7. A.13.7. public void setKeyFieldPartitionerOptions(String keySpec)
      8. A.13.8. public String getKeyFieldPartitionerOption()
      9. A.13.9. public RawComparator getOutputValueGroupingComparator()
      10. A.13.10. public void setOutputValueGroupingComparator(Class <? extends RawComparator> theClass)
    14. A.14. Methods that Control Map and Reduce Tasks
      1. A.14.1. public Class<? extends Mapper> getMapperClass()
      2. A.14.2. public void setMapperClass(Class<? extends Mapper> theClass)
      3. A.14.3. public Class<? extends MapRunnable> getMapRunnerClass()
      4. A.14.4. public void setMapRunnerClass(Class<? extends MapRunnable> theClass)
      5. A.14.5. public Class<? extends Reducer> getReducerClass()
      6. A.14.6. public void setReducerClass(Class<? extends Reducer> theClass)
      7. A.14.7. public Class<? extends Reducer> getCombinerClass()
      8. A.14.8. public void setCombinerClass(Class<? extends Reducer> theClass)
      9. A.14.9. public boolean getSpeculativeExecution()
      10. A.14.10. public void setSpeculativeExecution (boolean speculativeExecution)
      11. A.14.11. public boolean getMapSpeculativeExecution()
      12. A.14.12. public void setMapSpeculativeExecution (boolean speculativeExecution)
      13. A.14.13. public boolean getReduceSpeculativeExecution()
      14. A.14.14. public void setReduceSpeculativeExecution (boolean speculativeExecution)
      15. A.14.15. public int getNumMapTasks()
      16. A.14.16. public void setNumMapTasks(int n)
      17. A.14.17. public int getNumReduceTasks()
      18. A.14.18. public void setNumReduceTasks(int n)
      19. A.14.19. public int getMaxMapAttempts()
      20. A.14.20. public void setMaxMapAttempts(int n)
      21. A.14.21. public int getMaxReduceAttempts()
      22. A.14.22. public void setMaxReduceAttempts(int n)
      23. A.14.23. public void setMaxTaskFailuresPerTracker(int noFailures)
      24. A.14.24. public int getMaxTaskFailuresPerTracker()
      25. A.14.25. public int getMaxMapTaskFailuresPercent()
      26. A.14.26. public void setMaxMapTaskFailuresPercent(int percent)
      27. A.14.27. public int getMaxReduceTaskFailuresPercent()
      28. A.14.28. public void setMaxReduceTaskFailuresPercent(int percent)
    15. A.15. Methods Providing Control Over Job Execution and Naming
      1. A.15.1. public String getJobName()
      2. A.15.2. public void setJobName(String name)
      3. A.15.3. public String getSessionId()
      4. A.15.4. public void setSessionId(String sessionId)
      5. A.15.5. public JobPriority getJobPriority()
      6. A.15.6. public void setJobPriority(JobPriority prio)
      7. A.15.7. public boolean getProfileEnabled()
      8. A.15.8. public void setProfileEnabled(boolean newValue)
      9. A.15.9. public String getProfileParams()
      10. A.15.10. public void setProfileParams(String value)
      11. A.15.11. public Configuration.IntegerRanges getProfileTaskRange (boolean isMap)
      12. A.15.12. public void setProfileTaskRange(boolean isMap, String newValue)
      13. A.15.13. public String getMapDebugScript()
      14. A.15.14. public void setMapDebugScript(String mDbgScript)
      15. A.15.15. public String getReduceDebugScript()
      16. A.15.16. public void setReduceDebugScript(String rDbgScript)
      17. A.15.17. public String getJobEndNotificationURI()
      18. A.15.18. public void setJobEndNotificationURI(String uri)
      19. A.15.19. public String getQueueName()
      20. A.15.20. public void setQueueName(String queueName)
      21. A.15.21. long getMaxVirtualMemoryForTask() {
      22. A.15.22. void setMaxVirtualMemoryForTask(long vmem) {
    16. A.16. Convenience Methods
      1. A.16.1. public int size()
      2. A.16.2. public void clear()
      3. A.16.3. public Iterator<Map.Entry<String,String>> iterator()
      4. A.16.4. public void writeXml(OutputStream out) throws IOException
      5. A.16.5. public ClassLoader getClassLoader()
      6. A.16.6. public void setClassLoader(ClassLoader classLoader)
      7. A.16.7. public String toString()
    17. A.17. Methods Used to Pass Configurations Through SequenceFiles
      1. A.17.1. public void readFields(DataInput in) throws IOException
      2. A.17.2. public void write(DataOutput out) throws IOException

Product information

  • Title: Pro Hadoop
  • Author(s):
  • Release date: June 2009
  • Publisher(s): Apress
  • ISBN: 9781430219422