Implementing IBM InfoSphere BigInsights on IBM System x

Book description

As world activities become more integrated, the rate of data growth has been increasing exponentially. And as a result of this data explosion, current data management methods can become inadequate. People are using the term big data (sometimes referred to as Big Data) to describe this latest industry trend. IBM® is preparing the next generation of technology to meet these data management challenges.

To provide the capability of incorporating big data sources and analytics of these sources, IBM developed a stream-computing product that is based on the open source computing framework Apache Hadoop. Each product in the framework provides unique capabilities to the data management environment, and further enhances the value of your data warehouse investment.

In this IBM Redbooks® publication, we describe the need for big data in an organization. We then introduce IBM InfoSphere® BigInsights™ and explain how it differs from standard Hadoop. BigInsights provides a packaged Hadoop distribution, a greatly simplified installation of Hadoop and corresponding open source tools for application development, data movement, and cluster management. BigInsights also brings more options for data security, and as a component of the IBM big data platform, it provides potential integration points with the other components of the platform.

A new chapter has been added to this edition. Chapter 11 describes IBM Platform Symphony®, which is a new scheduling product that works with IBM Insights, bringing low-latency scheduling and multi-tenancy to IBM InfoSphere BigInsights.

The book is designed for clients, consultants, and other technical professionals.

Table of contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. Authors
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Summary of changes
    1. June 2013, Second Edition
  5. Chapter 1. A whole new world of big data
    1. 1.1 What is big data
    2. 1.2 The big data challenge
      1. 1.2.1 The traditional data warehouse in relation to big data
      2. 1.2.2 How continual data growth affects data warehouse storage
    3. 1.3 How IBM is answering the big data challenge
      1. 1.3.1 Big data platform
      2. 1.3.2 Big data Enterprise Engines
    4. 1.4 Why you should care
  6. Chapter 2. Why choose BigInsights
    1. 2.1 BigInsights introduction
    2. 2.2 What is Hadoop?
      1. 2.2.1 Hadoop Distributed File System in more detail
      2. 2.2.2 MapReduce in more detail
    3. 2.3 What is BigInsights?
      1. 2.3.1 All-in-one installation
      2. 2.3.2 Integration with existing information architectures
      3. 2.3.3 Enterprise class support
      4. 2.3.4 Enterprise class functionality
      5. 2.3.5 BigSheets
      6. 2.3.6 BigInsights scheduler
      7. 2.3.7 Text analytics
    4. 2.4 BigInsights and the traditional data warehouse
      1. 2.4.1 How can BigInsights complement my data warehouse?
    5. 2.5 Use cases for BigInsights
      1. 2.5.1 Industry-based use cases
      2. 2.5.2 Social Media use case
  7. Chapter 3. BigInsights network architecture
    1. 3.1 Network design overview
    2. 3.2 Logical network planning
      1. 3.2.1 Deciding between 1 Gbps and 10 Gbps
      2. 3.2.2 Switch and Node Adapter redundancy: costs and trade-offs
    3. 3.3 Networking zones
      1. 3.3.1 Corporate Management network
      2. 3.3.2 Corporate Administration network
      3. 3.3.3 Private Data network
      4. 3.3.4 Optional Head Node configuration considerations
    4. 3.4 Network configuration options
      1. 3.4.1 Value configuration
      2. 3.4.2 Performance configuration
      3. 3.4.3 Enterprise option
    5. 3.5 Suggested IBM system networking switches
      1. 3.5.1 Value configuration switches
      2. 3.5.2 Performance configuration switch
    6. 3.6 How to work with multiple racks
      1. 3.6.1 Value configuration
      2. 3.6.2 Performance configuration
      3. 3.6.3 Enterprise option
    7. 3.7 How to improve performance
      1. 3.7.1 Network port bonding
      2. 3.7.2 Extra capacity through more hardware provided for redundancy
      3. 3.7.3 Virtual Link Aggregation Groups for greater multi-rack throughput
    8. 3.8 Physical network planning
      1. 3.8.1 IP address quantities and networking into existing corporate networks
      2. 3.8.2 Power and cooling
  8. Chapter 4. BigInsights hardware architecture
    1. 4.1 Roles of the management and data nodes
      1. 4.1.1 The management node
      2. 4.1.2 The data node
    2. 4.2 Using multiple management nodes
    3. 4.3 Storage and adapters used in the hardware architecture
      1. 4.3.1 RAID versus JBOD
      2. 4.3.2 Disk virtualization
      3. 4.3.3 Compression
    4. 4.4 The IBM hardware portfolio
      1. 4.4.1 The IBM System x3550 M4 as a management node
      2. 4.4.2 The IBM System x3630 M4 as a data node
    5. 4.5 Lead configuration for the BigInsights management node
      1. 4.5.1 Use two E5-2650, 2.0 GHz, 8-core processors in your management node
      2. 4.5.2 Memory for your management node
      3. 4.5.3 Dual power cables per management node
      4. 4.5.4 Two network adapters per management node
      5. 4.5.5 Storage controllers on the management node
      6. 4.5.6 Hard disk drives in the management node
    6. 4.6 Lead configuration for the BigInsights data node
      1. 4.6.1 Processor options for the data node
      2. 4.6.2 Memory considerations for the data node
      3. 4.6.3 Other considerations for the data node
      4. 4.6.4 Data node configuration options
      5. 4.6.5 Pre-defined rack configurations
      6. 4.6.6 Storage considerations
      7. 4.6.7 Basic input/output system tool
  9. Chapter 5. Operating system prerequisites for BigInsights
    1. 5.1 Prerequisite software
      1. 5.1.1 Operating provisioning software
      2. 5.1.2 Yellowdog Updater Modified repository
      3. 5.1.3 Operating system packages
    2. 5.2 Operating system settings related to software
      1. 5.2.1 System clock synchronization
      2. 5.2.2 Services to disable for improved performance
      3. 5.2.3 Raising the ulimits setting to accommodate Hadoop’s data processing within BigInsights
      4. 5.2.4 Optional: set up password-less Secure Shell
    3. 5.3 Optionally configure /etc/hosts
    4. 5.4 Operating system settings related to hardware
      1. 5.4.1 Operating system level settings if optional network cards were added
      2. 5.4.2 Storage configuration
  10. Chapter 6. BigInsights installation
    1. 6.1 Preparing the environment for installation
    2. 6.2 Installing BigInsights using the graphical user interface
    3. 6.3 Silent installation of BigInsights
      1. 6.3.1 Installing BigInsights using the silent installation option
    4. 6.4 How to install the Eclipse plug-in
    5. 6.5 Common installation pitfalls
  11. Chapter 7. Cluster validation
    1. 7.1 Cluster validation
      1. 7.1.1 Initial validation
      2. 7.1.2 Running the built-in health check utility
      3. 7.1.3 Simple applications to run
    2. 7.2 Performance considerations
    3. 7.3 TeraSort scalability and performance test example
    4. 7.4 Other useful scripts
      1. 7.4.1 addnode.sh
      2. 7.4.2 credstore.sh
      3. 7.4.3 synconf.sh
      4. 7.4.4 start.sh, stop.sh, start-all.sh, and stop-all.sh
      5. 7.4.5 status.sh
  12. Chapter 8. BigInsights capabilities
    1. 8.1 Data ingestion
      1. 8.1.1 Loading data from files using the web console
      2. 8.1.2 Loading files from the command line
      3. 8.1.3 Loading data from a data warehouse
      4. 8.1.4 Loading frequently updated files
    2. 8.2 BigSheets
    3. 8.3 Web console
    4. 8.4 Text Analytics
      1. 8.4.1 Text analytics architecture
      2. 8.4.2 Log file processing example
  13. Chapter 9. BigInsights hardware monitoring and alerting
    1. 9.1 BigInsights monitoring
      1. 9.1.1 Workflows and scheduled workflows
      2. 9.1.2 MapReduce jobs
      3. 9.1.3 Job and task counters
    2. 9.2 Nigel's monitor
      1. 9.2.1 nmon within a shell terminal
      2. 9.2.2 Saving nmon output to a file
    3. 9.3 Ganglia
      1. 9.3.1 Ganglia installation (optional)
      2. 9.3.2 Ganglia configuration (if installed)
      3. 9.3.3 Multicast versus unicast
      4. 9.3.4 Large cluster considerations
      5. 9.3.5 BigInsights 1.4 configuration to enable Hadoop metrics with Ganglia
    4. 9.4 Nagios
    5. 9.5 IBM Tivoli OMNIbus and Network Manager
      1. 9.5.1 Tivoli Netcool Configuration Manager
      2. 9.5.2 Highlights of Tivoli Netcool Configuration Manager
      3. 9.5.3 IBM Tivoli Netcool/OMNIbus
      4. 9.5.4 IBM Tivoli Network Manager IP
    6. 9.6 IBM System Networking Element Manager
      1. 9.6.1 Product features
      2. 9.6.2 Software summary
  14. Chapter 10. BigInsights security design
    1. 10.1 BigInsights security overview
    2. 10.2 Authorization
      1. 10.2.1 Roles
    3. 10.3 Authentication
      1. 10.3.1 Flat file
      2. 10.3.2 Lightweight Directory Access Protocol
      3. 10.3.3 Pluggable Authentication Module
    4. 10.4 Secure browser support
  15. Chapter 11. IBM Platform Symphony
    1. 11.1 Overview
    2. 11.2 The changing nature of distributed computing
    3. 11.3 About IBM Platform Symphony
    4. 11.4 IBM Platform Symphony architecture
    5. 11.5 Platform Symphony MapReduce framework
    6. 11.6 Multi-tenancy built in
    7. 11.7 How InfoSphere BigInsights works with Platform Symphony
    8. 11.8 Understanding the Platform Symphony performance benefit
    9. 11.9 Supported applications
    10. 11.10 BigInsights versions supported
    11. 11.11 Summary
  16. Appendix A. M4 reference architecture
    1. The M4 series of servers: Bill of materials
    2. IBM x3630 M4: The data node
  17. Appendix B. Installation values
    1. BigInsights default installation values
    2. Open source technologies and version numbers
    3. Ganglia monitoring options
  18. Appendix C. Checklist
    1. BIOS settings to check
    2. Networking settings to verify operating system
    3. Operating system settings to check
    4. BigInsights configuration changes to consider
  19. Related publications
    1. IBM Redbooks
    2. Other publications
    3. Online resources
    4. Help from IBM
  20. Back cover

Product information

  • Title: Implementing IBM InfoSphere BigInsights on IBM System x
  • Author(s):
  • Release date: June 2013
  • Publisher(s): IBM Redbooks
  • ISBN: None