You are previewing Microsoft® SQL Server® 2008 Integration Services: Problem-Design-Solution.
O'Reilly logo
Microsoft® SQL Server® 2008 Integration Services: Problem-Design-Solution

Book Description

An authoritative guide to designing effective solutions for data cleansing, ETL, and file management with SQL Server 2008 Integration Services

SQL Server Integration Services (SSIS) is the leading tool in the data warehouse industry, used for performing extraction, transformation, and load operations.

After an overview of SSIS architecture, the authors walk you a series of real-world problems and show various techniques for handling them.

  • Shows you how to design SSIS solutions for data cleansing, ETL and file management

  • Demonstrates how to integrate data from a variety of data sources,

  • Shows how to monitor SSIS performance,

  • Demonstrates how to avoid common pitfalls involved with SSIS deployment

  • Explains how to ensure performance of the deployed solution and effectively handle unexpected system failures and outages

  • The companion Web site provides sample code and database scripts that readers can directly implement

This book shows you how to design, build, deploy, and manage solutions to real-world problems that SSIS administrators and developers face day-to-day.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. About the Authors
  6. Credits
  7. Acknowledgments
  8. Introduction
  9. Chapter 1: SSIS Solution Architecture
    1. Problem
    2. Macro Challenge: Enterprise Data Spaghetti
    3. Micro Challenge: Data-Processing Confusion
    4. Problems with Execution and Troubleshooting
    5. Infrastructure Challenges
    6. Other Challenges
    7. Design
    8. Choosing the Right Tool
    9. Overall Solution Architecture
    10. Data Integration or Consolidation
    11. Data Warehouse ETL
    12. Project Planning and Team Development
    13. Agile Benefits
    14. Agile Cautions and Planning
    15. Data Element Documentation
    16. Package Design Patterns
    17. Modular Packages
    18. Master Packages
    19. Server and Storage Hardware
    20. Server Hardware
    21. Development and Test Servers
    22. ETL Collocation
    23. Storage Hardware
    24. Package Execution Location
    25. Package Storage Location Versus Execution Location
    26. Execute SQL Task and Bulk Insert Task Execution
    27. Package Execution and the Data Flow
    28. Design Review
    29. Solution
    30. Setting the Stage: Management and Deployment
    31. Source Data: Files, Tables, and Data Cleansing
    32. Data Warehouse ETL and Cube Processing
    33. Advanced ETL: Scripting, High Availability, and Performance
    34. Summary
  10. Chapter 2: SSIS Management Framework Design
    1. Problem
    2. Challenges of Not Having a Framework Implementation
    3. Different Development Methods
    4. Changing Metadata
    5. Getting the Right Information to the Right People at the Right Time
    6. Reusability
    7. Framework Implementation
    8. Configuration Management Scheme
    9. Logging and Auditing
    10. Template Package
    11. Framework Benefits and Value
    12. Design
    13. Configuration Management
    14. Overall Design
    15. Environment Variable Configuration
    16. XML Configuration File
    17. SQL Server Configuration
    18. Logging and Auditing Mechanism
    19. Storage
    20. Tables
    21. Stored Procedures
    22. Template Package
    23. Implementation Guidelines
    24. Solution
    25. Configuration Management
    26. Logging and Auditing Mechanism
    27. Storage and Tables
    28. Stored Procedures
    29. Template Package
    30. Development
    31. Installation
    32. Other Considerations
    33. Customizations
    34. ETL Process Framework
    35. Process Owner
    36. Reporting
    37. Summary
  11. Chapter 3: Package Deployment and Storage Decisions
    1. Problem
    2. Standardization
    3. Environment
    4. Application
    5. Desired Implementation
    6. Design
    7. Storage Methods
    8. SSIS Service
    9. File System Storage
    10. SQL Server Storage
    11. Storage Comparison
    12. Deployment Methods
    13. SSIS Deployment Wizard
    14. Manual Deployment of Packages
    15. DTUtil Scripting
    16. Solution
    17. Storage Methodology
    18. Existing Infrastructure
    19. Security
    20. Package Makeup
    21. Back to the Flowchart
    22. Deployment Methodology
    23. Storage Method
    24. Existing Infrastructure
    25. Deployment Needs
    26. Back to the Flowchart
    27. Total Methodology
    28. Summary
  12. Chapter 4: File-Handling and Processing Methods
    1. Problem
    2. Simple File Operations
    3. Remote File Operations
    4. File Criteria
    5. File Iteration
    6. File Notification
    7. Design
    8. File System Task
    9. Source and Destination Connection
    10. Attributes
    11. Overwriting the Destination
    12. Multiple Files
    13. FTP Task
    14. FTP Connection
    15. HTTP
    16. Foreach Loop
    17. Applying Advanced Criteria
    18. File Watcher Task
    19. WMI Event Watcher Task
    20. Solution
    21. Summary
  13. Chapter 5: Data Extraction Best Practices
    1. Problem
    2. Extraction Data Criteria
    3. Source System Impact
    4. Incremental Extraction
    5. Deleted Rows
    6. Staging Database
    7. Data Lineage and Metadata
    8. File Sources
    9. Design
    10. Package Connections and Source Adapters
    11. Package Connections
    12. Source Adapters
    13. Incremental Data Extraction
    14. Incremental Extraction Using a Change Identifier Value
    15. Targeting Change Records through Dynamic Source Queries
    16. Retrieving Incremental Identifier Values and Updating Package Variables
    17. Capturing the Maximum Change Identifier Value
    18. Incremental Extraction from SQL Server without a Trigger
    19. Using SSIS to Handle All Aspects of an Incremental Extraction
    20. Data Staging Method
    21. Tracking Data Lineage Identifiers
    22. Solution
    23. Metadata-Driven Extraction Example
    24. Metadata Tables
    25. Control Flow
    26. Looping through the Sources
    27. Inside the Sources Loop
    28. The Script
    29. Read Variables
    30. Open Connections
    31. Get Max Change Indicator
    32. Extract Changed Source Rows
    33. Close Connections
    34. Summary
  14. Chapter 6: Data-Cleansing Design
    1. Problem
    2. Candidate Key Profiles
    3. Column Length Distribution Profiles
    4. Column Null Ratio Profiles
    5. Column Pattern Profiles
    6. Column Statistics Profiles
    7. Column Value Distribution Profiles
    8. Functional Dependency Profiles
    9. Design
    10. Using the Script Transform for Data Scrubbing
    11. Using the Fuzzy Grouping Transform to De-duplicate Data
    12. Using the Fuzzy Lookup Transform to Cleanse Data
    13. Dealing with Multiple Record Types in a Single File
    14. Using the Raw File
    15. Solution
    16. Summary
  15. Chapter 7: Dimension Table ETL
    1. Problem — Fundamental Dimension ETL
    2. Dimensions: The Basics
    3. Dimension ETL: The Challenge
    4. Design — Fundamental Dimension ETL
    5. Data Preparation
    6. Dimension Change Types
    7. Type 1 (Changing) Attribute: A Closer Look
    8. Type 2 (Historical) Attribute: A Closer Look
    9. Inferred Members
    10. Solution — Fundamental Dimension ETL
    11. Preparing Your Source Data for Dimension ETL
    12. SSIS Slowly Changing Dimension Wizard
    13. Advanced Properties and Additional Outputs of the SCD
    14. Problem — Advanced Dimension ETL
    15. SCD Wizard Advantages and Disadvantages
    16. Dimension Volume and Complexity
    17. Design — Advanced Dimension ETL
    18. Optimizing the Built-in SCD
    19. Index Optimizations
    20. Update Optimizations
    21. Snowflake Dimension Tables
    22. Parent-Child Dimension ETL
    23. Date Dimension ETL
    24. Profile Dimension and Junk Dimension ETL
    25. Creating a Custom Slowly Changing Package
    26. Solution — Advanced Dimension ETL
    27. Snowflake Dimensions
    28. Parent-Child Dimensions
    29. Profile and Junk Dimensions
    30. Date Dimension ETL
    31. Custom Dimension ETL
    32. Determining Dimension Changes
    33. Inserts and Updates
    34. Summary
  16. Chapter 8: Fact Table ETL
    1. Problem
    2. Fact Tables: The Basics
    3. Fact Table ETL: The Challenge
    4. Preparing the Data
    5. Mapping Dimension Keys
    6. Calculating Measures
    7. Adding Metadata
    8. Fact Table Updates
    9. Fact Table Inserts
    10. Design
    11. Data Preparation
    12. Data Preparation with SSIS Transformations
    13. Data Preparation Examples
    14. Acquiring the Dimension Surrogate Key in SSIS
    15. Identifying the Dimension Surrogate Keys with SSIS
    16. Surrogate Key Examples in SSIS
    17. Measure Calculations
    18. Measure Calculation Types
    19. Handling Measure Calculations in SSIS
    20. Managing Fact Table Changes
    21. Approaches to Identifying Changed Fact Records
    22. Fact Update Examples in SSIS
    23. Optimizing Fact Table Inserts
    24. Optimizing Inserts with Fast Load
    25. Optimizing Inserts with Index Management
    26. Solution
    27. Internet and Reseller Sales Fact Table ETL
    28. Fact Internet and Reseller Sales Extraction and Transform Process
    29. Fact Internet and Reseller Sales Load Process
    30. Snapshot Fact Table Example — Call Center Fact Table
    31. Advanced Fact Table ETL Concepts
    32. Handling Missing Dimension Lookups
    33. Handling Late-Arriving Facts
    34. Summary
  17. Chapter 9: SSAS Processing Architecture
    1. Problem
    2. SSAS Object Population
    3. Schedule
    4. Partition Management
    5. The Complete Package
    6. Design
    7. SSAS Objects and Processing Basics
    8. Dimensions
    9. Partitions
    10. Mining Structures
    11. SSIS Tasks and Components
    12. Analysis Services Processing Task
    13. Analysis Services Execute DDL Task
    14. Execute Process Task with ASCMD
    15. Data Flow Destinations for SSAS Objects
    16. Script Task with AMO
    17. Creating and Managing Partitions
    18. Overall Design
    19. Solution
    20. Preparation for SSAS Integration
    21. Process Dimensions Package
    22. Process Task
    23. Parallel XMLA
    24. Process Partitions Package
    25. Storing and Loading Metadata
    26. SSAS Processing
    27. Overall Solution
    28. Summary
  18. Chapter 10: Implementing Scale-Out ETL Process
    1. Problem
    2. Design
    3. Design Components Overview
    4. Central Common Services Server
    5. File Processor and Pre-Aggregation Scale-Out Processes Servers
    6. Design Details
    7. File Management Tasks
    8. Data File Management
    9. Work Allocation Process
    10. Scale-Out Source File Process
    11. Work Reassignment Process
    12. Data Aggregation Tasks
    13. Hourly Data Pre-Aggregation Process
    14. Hourly Data Aggregation Process
    15. Daily Data Aggregation Process
    16. Archival and Clean-up Processes
    17. Data File Archival Process
    18. Stage Table Clean-up Process
    19. Design Conclusions
    20. Solution
    21. Central Server Services
    22. Multiple File Processor and Pre-Aggregator Processes
    23. Database Tables Required on the Central Server
    24. Stored Procedures
    25. Procedures on the Central Server
    26. Procedures on the Staging Servers
    27. SSIS Packages
    28. File-Processing Server Packages
    29. Central Server Packages
    30. Summary
  19. Chapter 11: Scripting Design Patterns
    1. Problem — Advanced File Management
    2. Script Task
    3. Scenario
    4. Design — Advanced File Management
    5. Script Language
    6. Accessing Objects
    7. Custom Assemblies
    8. Scripting
    9. Solution — Advanced File Management
    10. Create an External Assembly
    11. Access the External Assembly
    12. Archive the Files
    13. Summary — Advanced File Management
    14. Problem — Call Center Fact ETL
    15. Reasons to Use Scripting
    16. Scenario
    17. Design — Call Center Fact ETL
    18. Component Type
    19. Output Type
    20. Synchronous Outputs
    21. Asynchronous Outputs
    22. Other Component Types
    23. Design Choice
    24. Overridable Methods
    25. Solution — Call Center Fact ETL
    26. Package Setup
    27. Script Component
    28. Reusable Methods
    29. Row-By-Row Processing
    30. Process Input
    31. Entire Script
    32. Package Wrap-Up
    33. Summary — Call Center Fact ETL
    34. Summary
  20. Chapter 12: SSIS Package Scaling
    1. Problem
    2. Identifying Task Durations
    3. Identifying Data Flow Destination Issues
    4. Identifying Transformation and Memory Bottlenecks
    5. Identifying Data Flow Source Performance Issues
    6. Design
    7. Balancing SQL Operations with the Data Flow
    8. Data Flow Advantages as Compared with SQL
    9. SQL Advantages when Compared with the Data Flow
    10. Applying SQL Code in Your SSIS Packages
    11. SSIS Pipeline Architecture Concepts
    12. Data Flow Design Example
    13. SQL-Centric SSIS Process
    14. Rewritten Data Flow–Centric SSIS Process
    15. Solution
    16. Tuning Your Data Flow
    17. Use Blocking Transformations Sparingly
    18. Limit Row-by-Row Operations
    19. Manage Data Flow Backpressure
    20. Pre-sort Sources as an Alternative to the Sort Transformation
    21. Optimize the Lookup and Managing the Cache Size
    22. Remove Unused Columns from the Data Flow
    23. Be Mindful of the SSIS Logging Impact
    24. Regulate Transactions
    25. Setting Data Flow Properties
    26. Up the EngineThreads Value
    27. Optimize the Temporary Storage Locations
    28. Leave RunInOptimizedMode as True
    29. Tune Buffers
    30. Database and Destination Optimization
    31. Limiting Database Growth
    32. Consider Dropping and Re-creating Indexes on Destination Tables
    33. Using the OLE DB Destination Adapter
    34. Use Advanced Oracle and Teradata Destination Adapters
    35. Handling Partitioned Tables
    36. Summary
  21. Index
  22. Advertisement