InfoSphere DataStage Parallel Framework Standard Practices

Book description

In this IBM® Redbooks® publication, we present guidelines for the development of highly efficient and scalable information integration applications with InfoSphere™ DataStage® (DS) parallel jobs.

InfoSphere DataStage is at the core of IBM Information Server, providing components that yield a high degree of freedom. For any particular problem there might be multiple solutions, which tend to be influenced by personal preferences, background, and previous experience. All too often, those solutions yield less than optimal, and non-scalable, implementations.

This book includes a comprehensive detailed description of the components available, and descriptions on how to use them to obtain scalable and efficient solutions, for both batch and real-time scenarios.

The advice provided in this document is the result of the combined proven experience from a number of expert practitioners in the field of high performance information integration, evolved over several years.

This book is intended for IT architects, Information Management specialists, and Information Integration specialists responsible for delivering cost-effective IBM InfoSphere DataStage performance on all platforms.

Table of contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. The team who wrote this book
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Chapter 1. Data integration with Information Server and DataStage
    1. 1.1 Information Server 8
      1. 1.1.1 Architecture and information tiers
    2. 1.2 IBM Information Management InfoSphere Services
    3. 1.3 Center of Excellence for Data Integration (CEDI)
    4. 1.4 Workshops for IBM InfoSphere DataStage
  5. Chapter 2. Data integration overview
    1. 2.1 Job sequences
    2. 2.2 Job types
      1. 2.2.1 Transformation jobs
      2. 2.2.2 Hybrid jobs
      3. 2.2.3 Provisioning jobs
  6. Chapter 3. Standards
    1. 3.1 Directory structures
      1. 3.1.1 Metadata layer
      2. 3.1.2 Data, install, and project directory structures
      3. 3.1.3 Extending the DataStage project for external entities
      4. 3.1.4 File staging
    2. 3.2 Naming conventions
      1. 3.2.1 Key attributes of the naming convention
      2. 3.2.2 Designer object layout
      3. 3.2.3 Documentation and metadata capture
      4. 3.2.4 Naming conventions by object type
    3. 3.3 Documentation and annotation
    4. 3.4 Working with source code control systems
      1. 3.4.1 Source code control standards
      2. 3.4.2 Using object categorization standards
      3. 3.4.3 Export to source code control system
  7. Chapter 4. Job parameter and environment variable management
    1. 4.1 DataStage environment variables
      1. 4.1.1 DataStage environment variable scope
      2. 4.1.2 Special values for DataStage environment variables
      3. 4.1.3 Environment variable settings
      4. 4.1.4 Migrating project-level environment variables
    2. 4.2 DataStage job parameters
      1. 4.2.1 When to use parameters
      2. 4.2.2 Parameter standard practices
      3. 4.2.3 Specifying default parameter values
      4. 4.2.4 Parameter sets
  8. Chapter 5. Development guidelines
    1. 5.1 Modular development
    2. 5.2 Establishing job boundaries
    3. 5.3 Job design templates
    4. 5.4 Default job design
    5. 5.5 Parallel shared containers
    6. 5.6 Error and reject record handling
      1. 5.6.1 Reject handling with the Sequential File stage
      2. 5.6.2 Reject handling with the Lookup stage
      3. 5.6.3 Reject handling with the Transformer stage
      4. 5.6.4 Reject handling with Target Database stages
      5. 5.6.5 Error processing requirements
    7. 5.7 Component usage
      1. 5.7.1 Server Edition components
      2. 5.7.2 Copy stage
      3. 5.7.3 Parallel datasets
      4. 5.7.4 Parallel Transformer stages
      5. 5.7.5 BuildOp stages
    8. 5.8 Job design considerations for usage and impact analysis
      1. 5.8.1 Maintaining JobDesign:Table definition connection
      2. 5.8.2 Verifying the job design:table definition connection
  9. Chapter 6. Partitioning and collecting
    1. 6.1 Partition types
      1. 6.1.1 Auto partitioning
      2. 6.1.2 Keyless partitioning
      3. 6.1.3 Keyed partitioning
      4. 6.1.4 Hash partitioning
    2. 6.2 Monitoring partitions
    3. 6.3 Partition methodology
    4. 6.4 Partitioning examples
      1. 6.4.1 Partitioning example 1: Optimized partitioning
      2. 6.4.2 Partitioning example 2: Use of Entire partitioning
    5. 6.5 Collector types
      1. 6.5.1 Auto collector
      2. 6.5.2 Round-robin collector
      3. 6.5.3 Ordered collector
      4. 6.5.4 Sort Merge collector
    6. 6.6 Collecting methodology
  10. Chapter 7. Sorting
    1. 7.1 Partition and sort keys
    2. 7.2 Complete (Total) sort
    3. 7.3 Link sort and Sort stage
      1. 7.3.1 Link sort
      2. 7.3.2 Sort stage
    4. 7.4 Stable sort
    5. 7.5 Subsorts
    6. 7.6 Automatically-inserted sorts
    7. 7.7 Sort methodology
    8. 7.8 Tuning sort
      1. 7.8.1 Sorts and variable-length fields
  11. Chapter 8. File Stage usage
    1. 8.1 Dataset usage
    2. 8.2 Sequential File stages (Import and export)
      1. 8.2.1 Reading from a sequential file in parallel
      2. 8.2.2 Writing to a sequential file in parallel
      3. 8.2.3 Separating I/O from column import
      4. 8.2.4 Partitioning sequential file reads
      5. 8.2.5 Sequential file (Export) buffering
      6. 8.2.6 Parameterized sequential file format
      7. 8.2.7 Reading and writing nullable columns
      8. 8.2.8 Reading from and writing to fixed-length files
      9. 8.2.9 Reading bounded-length VARCHAR columns
      10. 8.2.10 Tuning sequential file performance
    3. 8.3 Complex Flat File stage
      1. 8.3.1 CFF stage data type mapping
    4. 8.4 Filesets
  12. Chapter 9. Transformation languages
    1. 9.1 Transformer stage
      1. 9.1.1 Transformer NULL handling and reject link
      2. 9.1.2 Parallel Transformer system variables
      3. 9.1.3 Transformer derivation evaluation
      4. 9.1.4 Conditionally aborting jobs
      5. 9.1.5 Using environment variable parameters
      6. 9.1.6 Transformer decimal arithmetic
      7. 9.1.7 Optimizing Transformer expressions and stage variables
    2. 9.2 Modify stage
      1. 9.2.1 Modify and null handling
      2. 9.2.2 Modify and string trim
    3. 9.3 Filter and Switch stages
  13. Chapter 10. Combining data
    1. 10.1 Lookup versus Join versus Merge
    2. 10.2 Capturing unmatched records from a Join
    3. 10.3 The Aggregator stage
      1. 10.3.1 Aggregation method
      2. 10.3.2 Aggregation data type
      3. 10.3.3 Performing total aggregations
    4. 10.4 Comparison stages
    5. 10.5 Checksum
    6. 10.6 SCD stage
  14. Chapter 11. Restructuring data
    1. 11.1 Complex data types
      1. 11.1.1 Vectors
      2. 11.1.2 Subrecords
      3. 11.1.3 Tagged fields
    2. 11.2 The Restructure library
      1. 11.2.1 Tagbatch and Tagswitch
      2. 11.2.2 Importing complex record types
    3. 11.3 The Pivot Enterprise stage
  15. Chapter 12. Performance tuning job designs
    1. 12.1 Designing a job for optimal performance
    2. 12.2 Understanding operator combination
    3. 12.3 Minimizing runtime processes and resourcerequirements
    4. 12.4 Understanding buffering
      1. 12.4.1 Inter-operator transport buffering
      2. 12.4.2 Deadlock prevention buffering
  16. Chapter 13. Database stage guidelines
    1. 13.1 Existing database development overview
      1. 13.1.1 Existing database stage types
      2. 13.1.2 Database metadata
      3. 13.1.3 Optimizing select lists
      4. 13.1.4 Testing database connectivity
      5. 13.1.5 Designing for restart
      6. 13.1.6 Database OPEN and CLOSE commands
      7. 13.1.7 Database sparse lookup versus join
      8. 13.1.8 Appropriate use of SQL and DataStage
    2. 13.2 Existing DB2 guidelines
      1. 13.2.1 Existing DB2 stage types
      2. 13.2.2 Connecting to DB2 with the DB2/UDB Enterprise stage
      3. 13.2.3 Configuring DB2 multiple instances in one DataStage job
      4. 13.2.4 DB2/UDB Enterprise stage column names
      5. 13.2.5 DB2/API stage column names
      6. 13.2.6 DB2/UDB Enterprise stage data type mapping
      7. 13.2.7 DB2/UDB Enterprise stage options
      8. 13.2.8 Performance notes
    3. 13.3 Existing Informix database guidelines
      1. 13.3.1 Informix Enterprise stage column names
      2. 13.3.2 Informix Enterprise stage data type mapping
    4. 13.4 ODBC Enterprise guidelines
      1. 13.4.1 ODBC Enterprise stage column names
      2. 13.4.2 ODBC Enterprise stage data type mapping
      3. 13.4.3 Reading ODBC sources in parallel
      4. 13.4.4 Writing to ODBC targets in parallel
    5. 13.5 Oracle database guidelines
      1. 13.5.1 Oracle Enterprise stage column names
      2. 13.5.2 Oracle Enterprise stage data type mapping
      3. 13.5.3 Reading from Oracle in parallel
      4. 13.5.4 Oracle load options
    6. 13.6 Sybase Enterprise guidelines
      1. 13.6.1 Sybase Enterprise stage column names
      2. 13.6.2 Sybase Enterprise stage data type mapping
    7. 13.7 Existing Teradata database guidelines
      1. 13.7.1 Choosing the proper Teradata stage
      2. 13.7.2 Source Teradata stages
      3. 13.7.3 Target Teradata stages
      4. 13.7.4 Teradata Enterprise stage column names
      5. 13.7.5 Teradata Enterprise stage data type mapping
      6. 13.7.6 Specifying Teradata passwords with special characters
      7. 13.7.7 Teradata Enterprise settings
      8. 13.7.8 Improving Teradata Enterprise performance
    8. 13.8 Netezza Enterprise stage
      1. 13.8.1 Netezza write methods
      2. 13.8.2 Limitations of Netezza Write stage
      3. 13.8.3 Netezza Enterprise error logs
  17. Chapter 14. Connector stage guidelines
    1. 14.1 Connectors and the connector framework
      1. 14.1.1 Connectors in parallel jobs
      2. 14.1.2 Large object (LOB) support
      3. 14.1.3 Reject Links
      4. 14.1.4 Schema reconciliation
      5. 14.1.5 Stage editor concepts
      6. 14.1.6 Connection objects
      7. 14.1.7 SQL Builder
      8. 14.1.8 Metadata importation
    2. 14.2 ODBC Connector
    3. 14.3 WebSphere MQ Connector
    4. 14.4 Teradata Connector
      1. 14.4.1 Teradata Connector advantages
      2. 14.4.2 Parallel Synchronization Table
      3. 14.4.3 Parallel Transport operators
      4. 14.4.4 Cleanup after an aborted load or update
      5. 14.4.5 Environment variables for debugging job execution
      6. 14.4.6 Comparison with existing Teradata stages
    5. 14.5 DB2 Connector
      1. 14.5.1 New features
      2. 14.5.2 Using rejects with user-defined SQL
      3. 14.5.3 Using alternate conductor setting
      4. 14.5.4 Comparison with existing DB2 stages
    6. 14.6 Oracle Connector
      1. 14.6.1 New features and improvements
      2. 14.6.2 Comparison with Oracle Enterprise
    7. 14.7 DT stage
    8. 14.8 SalesForce Connector
    9. 14.9 Essbase connector
    10. 14.10 SWG Connector
  18. Chapter 15. Batch data flow design
    1. 15.1 High performance batch data flow design goals
      1. 15.1.1 Minimize time required to complete batch processing
      2. 15.1.2 Build scalable jobs
      3. 15.1.3 Minimize the impact of startup time
      4. 15.1.4 Optimize network, I/O and memory usage
      5. 15.1.5 Plan job concurrency and degrees of parallelism
    2. 15.2 Common bad patterns
      1. 15.2.1 DS server mentality for parallel jobs
      2. 15.2.2 Database sparse lookups
      3. 15.2.3 Processing full source database refreshes
      4. 15.2.4 Extracting much and using little (reference datasets)
      5. 15.2.5 Reference data is too large to fit into physical memory
      6. 15.2.6 Loading and re-extracting the same data
      7. 15.2.7 One sequence run per input/output file
    3. 15.3 Optimal number of stages per job
    4. 15.4 Checkpoint/Restart
    5. 15.5 Balanced optimization
      1. 15.5.1 Transformations inside the database
      2. 15.5.2 Transformations with DataStage
    6. 15.6 Batch data flow patterns
      1. 15.6.1 Restricting incoming data from the source
      2. 15.6.2 A fundamental problem: Reference lookup resolution
      3. 15.6.3 A sample database model
      4. 15.6.4 Restricting the reference lookup dataset
      5. 15.6.5 Correlating data
      6. 15.6.6 Keeping information server as the transformation hub
      7. 15.6.7 Accumulating reference data in local datasets
      8. 15.6.8 Minimize number of sequence runs per processing window
      9. 15.6.9 Separating database interfacing and transformation jobs
      10. 15.6.10 Extracting data efficiently
      11. 15.6.11 Uploading data efficiently
  19. Chapter 16. Real-time data flow design
    1. 16.1 Definition of real-time
    2. 16.2 Mini-batch approach
    3. 16.3 Parallel framework in real-time applications
    4. 16.4 DataStage extensions for real-time applications
      1. 16.4.1 Always-on source stage types
      2. 16.4.2 End-of-wave
      3. 16.4.3 Transaction support
    5. 16.5 Job topologies
      1. 16.5.1 Summary of stage usage guidelines
      2. 16.5.2 ISD batch topologies
    6. 16.6 MQConnector/DTS
      1. 16.6.1 Aspects of DTS application development
      2. 16.6.2 Reference documentation
      3. 16.6.3 A sample basic DTS job
      4. 16.6.4 Design topology rules for DTS jobs
      5. 16.6.5 Transactional processing
      6. 16.6.6 MQ/DTS and the Information Server Framework
      7. 16.6.7 Sample job and basic properties
      8. 16.6.8 Runtime Topologies for DTS jobs
      9. 16.6.9 Processing order of input links
      10. 16.6.10 Rejecting messages
      11. 16.6.11 Database contention
      12. 16.6.12 Scalability
      13. 16.6.13 Design patterns to avoid
    7. 16.7 InfoSphere Information Services Director
      1. 16.7.1 The scope of this section
      2. 16.7.2 Design topology rules for always-on ISD jobs
      3. 16.7.3 Scalability
      4. 16.7.4 Synchronizing database stages with ISD output
      5. 16.7.5 ISD with DTS
      6. 16.7.6 ISD with connectors
      7. 16.7.7 Re-partitioning in ISD jobs
      8. 16.7.8 General considerations for using ISD jobs
      9. 16.7.9 Selecting server or EE jobs for publication through ISD
    8. 16.8 Transactional support in message-oriented applications
    9. 16.9 Payload processing
    10. 16.10 Pipeline Parallelism challenges
      1. 16.10.1 Key collisions
      2. 16.10.2 Data stubbing
      3. 16.10.3 Parent/Child processing
    11. 16.11 Special custom plug-ins
    12. 16.12 Special considerations for QualityStage
  20. Appendix A. Runtime topologies for distributed transaction jobs
    1. A.1 No ordering, no relationships
    2. A.2 No ordering, with relationships
    3. A.3 Bypassing work queues
  21. Appendix B. Standard practices summary
    1. B.1 Standards
    2. B.2 Development guidelines
    3. B.3 Component usage
    4. B.4 DataStage data types
    5. B.5 Partitioning data
    6. B.6 Collecting data
    7. B.7 Sorting
    8. B.8 Stage-specific guidelines
    9. B.9 Database stage guidelines
    10. B.10 Troubleshooting and monitoring
  22. Appendix C. DataStage naming reference
  23. Appendix D. Example job template
  24. Appendix E. Understanding the parallel job score
    1. E.1 Viewing the job score
    2. E.2 Parallel job score components
  25. Appendix F. Estimating the size of a parallel dataset
  26. Appendix G. Environment variables reference
  27. Appendix H. DataStage data types
    1. H.1 Parallel data types
    2. H.2 Null handling
    3. H.3 Runtime column propagation
  28. Related publications
    1. IBM Redbooks
    2. Other publications
    3. Online resources
    4. How to get Redbooks
    5. Help from IBM
  29. Back cover

Product information

  • Title: InfoSphere DataStage Parallel Framework Standard Practices
  • Author(s):
  • Release date: July 2010
  • Publisher(s): IBM Redbooks
  • ISBN: 9780738434476