Book description
In this IBM® Redbooks® publication, we present guidelines for the development of highly efficient and scalable information integration applications with InfoSphere™ DataStage® (DS) parallel jobs.
InfoSphere DataStage is at the core of IBM Information Server, providing components that yield a high degree of freedom. For any particular problem there might be multiple solutions, which tend to be influenced by personal preferences, background, and previous experience. All too often, those solutions yield less than optimal, and non-scalable, implementations.
This book includes a comprehensive detailed description of the components available, and descriptions on how to use them to obtain scalable and efficient solutions, for both batch and real-time scenarios.
The advice provided in this document is the result of the combined proven experience from a number of expert practitioners in the field of high performance information integration, evolved over several years.
This book is intended for IT architects, Information Management specialists, and Information Integration specialists responsible for delivering cost-effective IBM InfoSphere DataStage performance on all platforms.
Table of contents
- Front cover
- Notices
- Preface
- Chapter 1. Data integration with Information Server and DataStage
- Chapter 2. Data integration overview
- Chapter 3. Standards
- Chapter 4. Job parameter and environment variable management
- Chapter 5. Development guidelines
- Chapter 6. Partitioning and collecting
- Chapter 7. Sorting
-
Chapter 8. File Stage usage
- 8.1 Dataset usage
-
8.2 Sequential File stages (Import and export)
- 8.2.1 Reading from a sequential file in parallel
- 8.2.2 Writing to a sequential file in parallel
- 8.2.3 Separating I/O from column import
- 8.2.4 Partitioning sequential file reads
- 8.2.5 Sequential file (Export) buffering
- 8.2.6 Parameterized sequential file format
- 8.2.7 Reading and writing nullable columns
- 8.2.8 Reading from and writing to fixed-length files
- 8.2.9 Reading bounded-length VARCHAR columns
- 8.2.10 Tuning sequential file performance
- 8.3 Complex Flat File stage
- 8.4 Filesets
-
Chapter 9. Transformation languages
-
9.1 Transformer stage
- 9.1.1 Transformer NULL handling and reject link
- 9.1.2 Parallel Transformer system variables
- 9.1.3 Transformer derivation evaluation
- 9.1.4 Conditionally aborting jobs
- 9.1.5 Using environment variable parameters
- 9.1.6 Transformer decimal arithmetic
- 9.1.7 Optimizing Transformer expressions and stage variables
- 9.2 Modify stage
- 9.3 Filter and Switch stages
-
9.1 Transformer stage
- Chapter 10. Combining data
- Chapter 11. Restructuring data
- Chapter 12. Performance tuning job designs
-
Chapter 13. Database stage guidelines
- 13.1 Existing database development overview
-
13.2 Existing DB2 guidelines
- 13.2.1 Existing DB2 stage types
- 13.2.2 Connecting to DB2 with the DB2/UDB Enterprise stage
- 13.2.3 Configuring DB2 multiple instances in one DataStage job
- 13.2.4 DB2/UDB Enterprise stage column names
- 13.2.5 DB2/API stage column names
- 13.2.6 DB2/UDB Enterprise stage data type mapping
- 13.2.7 DB2/UDB Enterprise stage options
- 13.2.8 Performance notes
- 13.3 Existing Informix database guidelines
- 13.4 ODBC Enterprise guidelines
- 13.5 Oracle database guidelines
- 13.6 Sybase Enterprise guidelines
-
13.7 Existing Teradata database guidelines
- 13.7.1 Choosing the proper Teradata stage
- 13.7.2 Source Teradata stages
- 13.7.3 Target Teradata stages
- 13.7.4 Teradata Enterprise stage column names
- 13.7.5 Teradata Enterprise stage data type mapping
- 13.7.6 Specifying Teradata passwords with special characters
- 13.7.7 Teradata Enterprise settings
- 13.7.8 Improving Teradata Enterprise performance
- 13.8 Netezza Enterprise stage
- Chapter 14. Connector stage guidelines
-
Chapter 15. Batch data flow design
- 15.1 High performance batch data flow design goals
-
15.2 Common bad patterns
- 15.2.1 DS server mentality for parallel jobs
- 15.2.2 Database sparse lookups
- 15.2.3 Processing full source database refreshes
- 15.2.4 Extracting much and using little (reference datasets)
- 15.2.5 Reference data is too large to fit into physical memory
- 15.2.6 Loading and re-extracting the same data
- 15.2.7 One sequence run per input/output file
- 15.3 Optimal number of stages per job
- 15.4 Checkpoint/Restart
- 15.5 Balanced optimization
-
15.6 Batch data flow patterns
- 15.6.1 Restricting incoming data from the source
- 15.6.2 A fundamental problem: Reference lookup resolution
- 15.6.3 A sample database model
- 15.6.4 Restricting the reference lookup dataset
- 15.6.5 Correlating data
- 15.6.6 Keeping information server as the transformation hub
- 15.6.7 Accumulating reference data in local datasets
- 15.6.8 Minimize number of sequence runs per processing window
- 15.6.9 Separating database interfacing and transformation jobs
- 15.6.10 Extracting data efficiently
- 15.6.11 Uploading data efficiently
-
Chapter 16. Real-time data flow design
- 16.1 Definition of real-time
- 16.2 Mini-batch approach
- 16.3 Parallel framework in real-time applications
- 16.4 DataStage extensions for real-time applications
- 16.5 Job topologies
-
16.6 MQConnector/DTS
- 16.6.1 Aspects of DTS application development
- 16.6.2 Reference documentation
- 16.6.3 A sample basic DTS job
- 16.6.4 Design topology rules for DTS jobs
- 16.6.5 Transactional processing
- 16.6.6 MQ/DTS and the Information Server Framework
- 16.6.7 Sample job and basic properties
- 16.6.8 Runtime Topologies for DTS jobs
- 16.6.9 Processing order of input links
- 16.6.10 Rejecting messages
- 16.6.11 Database contention
- 16.6.12 Scalability
- 16.6.13 Design patterns to avoid
-
16.7 InfoSphere Information Services Director
- 16.7.1 The scope of this section
- 16.7.2 Design topology rules for always-on ISD jobs
- 16.7.3 Scalability
- 16.7.4 Synchronizing database stages with ISD output
- 16.7.5 ISD with DTS
- 16.7.6 ISD with connectors
- 16.7.7 Re-partitioning in ISD jobs
- 16.7.8 General considerations for using ISD jobs
- 16.7.9 Selecting server or EE jobs for publication through ISD
- 16.8 Transactional support in message-oriented applications
- 16.9 Payload processing
- 16.10 Pipeline Parallelism challenges
- 16.11 Special custom plug-ins
- 16.12 Special considerations for QualityStage
- Appendix A. Runtime topologies for distributed transaction jobs
- Appendix B. Standard practices summary
- Appendix C. DataStage naming reference
- Appendix D. Example job template
- Appendix E. Understanding the parallel job score
- Appendix F. Estimating the size of a parallel dataset
- Appendix G. Environment variables reference
- Appendix H. DataStage data types
- Related publications
- Back cover
Product information
- Title: InfoSphere DataStage Parallel Framework Standard Practices
- Author(s):
- Release date: July 2010
- Publisher(s): IBM Redbooks
- ISBN: 9780738434476
You might also like
book
Leveraging DB2 10 for High Performance of Your Data Warehouse
Building on the business intelligence (BI) framework and capabilities that are outlined in InfoSphere Warehouse: A …
book
IBM InfoSphere Streams: Accelerating Deployments with Analytic Accelerators
This IBM® Redbooks® publication describes visual development, visualization, adapters, analytics, and accelerators for IBM InfoSphere® Streams …
book
Smarter Business: Dynamic Information with IBM InfoSphere Data Replication CDC
To make better informed business decisions, better serve clients, and increase operational efficiencies, you must be …
book
IBM Information Server: Integration and Governance for Emerging Data Warehouse Demands
This IBM® Redbooks® publication is intended for business leaders and IT architects who are responsible for …