IBM Data Engine for Hadoop and Spark

Book description

This IBM® Redbooks® publication provides topics to help the technical community take advantage of the resilience, scalability, and performance of the IBM Power Systems™ platform to implement or integrate an IBM Data Engine for Hadoop and Spark solution for analytics solutions to access, manage, and analyze data sets to improve business outcomes.

This book documents topics to demonstrate and take advantage of the analytics strengths of the IBM POWER8® platform, the IBM analytics software portfolio, and selected third-party tools to help solve customer's data analytic workload requirements. This book describes how to plan, prepare, install, integrate, manage, and show how to use the IBM Data Engine for Hadoop and Spark solution to run analytic workloads on IBM POWER8. In addition, this publication delivers documentation to complement available IBM analytics solutions to help your data analytic needs.

This publication strengthens the position of IBM analytics and big data solutions with a well-defined and documented deployment model within an IBM POWER8 virtualized environment so that customers have a planned foundation for security, scaling, capacity, resilience, and optimization for analytics workloads.

This book is targeted at technical professionals (analytics consultants, technical support staff, IT Architects, and IT Specialists) that are responsible for delivering analytics solutions and support on IBM Power Systems.

Table of contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. IBM Redbooks promotions
  4. Preface
    1. Authors
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  5. Chapter 1. Introduction to IBM Data Engine for Hadoop and Spark
    1. 1.1 What is big data
      1. 1.1.1 Structured and unstructured data
      2. 1.1.2 The four Vs of big data
      3. 1.1.3 The traditional data warehouse in relation to big data
    2. 1.2 Big data analytics
    3. 1.3 What is Apache Spark
      1. 1.3.1 Apache Hadoop and MapReduce versus Apache Spark
    4. 1.4 Why use an IBM Big Data and analytics solution
      1. 1.4.1 IBM Spectrum Scale file system as alternative to Hadoop File System
      2. 1.4.2 IBM Spectrum Conductor for Spark
      3. 1.4.3 IBM Open Platform with Apache Hadoop
      4. 1.4.4 IBM Spectrum Symphony
      5. 1.4.5 IBM Platform Cluster Manager
    5. 1.5 Why big data on IBM Power Systems servers
    6. 1.6 IBM Data Engine for Hadoop and Spark
  6. Chapter 2. Solution reference architecture
    1. 2.1 Overview of the solution
    2. 2.2 High-level architecture
    3. 2.3 Hardware components of the solution
      1. 2.3.1 The IBM Power System S812LC server
      2. 2.3.2 Networking
    4. 2.4 Software reference architecture
      1. 2.4.1 IBM Open Platform with Apache Hadoop clusters
      2. 2.4.2 Stand-alone products: IBM Spectrum Scale and IBM Spectrum Symphony
      3. 2.4.3 Cluster management
      4. 2.4.4 Additional analytics software: IBM Spectrum Conductor with Spark
      5. 2.4.5 Software options
    5. 2.5 Solution reference architecture
      1. 2.5.1 Configuration
      2. 2.5.2 Predefined configurations
      3. 2.5.3 Sizing the solution
      4. 2.5.4 Rack, power, and cooling information
  7. Chapter 3. Use case scenario for the IBM Data Engine for Hadoop and Spark
    1. 3.1 When to use IBM Data Engine for Hadoop and Spark
    2. 3.2 When to use Hadoop and what workloads are suitable for it
      1. 3.2.1 Landing Zone
      2. 3.2.2 Data warehouse offloading
    3. 3.3 When to use Apache Spark and what workloads are suitable for it
    4. 3.4 Greater resource utilization by using IBM Spectrum Symphony
    5. 3.5 Comparing Hadoop Distributed File System and IBM Spectrum Scale
    6. 3.6 Using the analytic capabilities of IBM Open Platform
  8. Chapter 4. Operational guidelines
    1. 4.1 Introduction
    2. 4.2 Adding a compute node
      1. 4.2.1 Identifying the networks
      2. 4.2.2 Defining the Central Electronics Complex group
      3. 4.2.3 Updating the server firmware
      4. 4.2.4 Installing the base operating system
      5. 4.2.5 Configuring the host name, users, and groups
      6. 4.2.6 Installing and configuring IBM Spectrum Scale
      7. 4.2.7 Installing software with Ambari
    3. 4.3 Configuring the Apache Spark UI
    4. 4.4 Deployment and operation tools
      1. 4.4.1 List of tools
  9. Chapter 5. Multitenancy
    1. 5.1 Introduction to multitenancy
    2. 5.2 IBM Spectrum Computing resource manager
    3. 5.3 Configuring multitenancy for MapReduce workloads
      1. 5.3.1 Monitoring MapReduce jobs by using IBM Spectrum Symphony
      2. 5.3.2 Creating an application profile
      3. 5.3.3 Adding users or groups to an existing application profile
      4. 5.3.4 Configuring the share ratio between application profiles
      5. 5.3.5 Configuring slot mapping
      6. 5.3.6 Configuring the priority for running jobs
  10. Appendix A. Ordering the solution
    1. Predefined configuration
    2. How to use the IBM Configurator for e-business (e-config)
    3. Services
  11. Appendix B. Script to clone partitions
    1. Clone partitions script
  12. Related publications
    1. IBM Redbooks
    2. Online resources
    3. Help from IBM
  13. Back cover

Product information

  • Title: IBM Data Engine for Hadoop and Spark
  • Author(s): Dino Quintero, Luis Bolinches, Aditya Gandakusuma Sutandyo, Nicolas Joly, Reinaldo Tetsuo Katahira
  • Release date: August 2016
  • Publisher(s): IBM Redbooks
  • ISBN: 9780738441931