O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Management Solutions Using SAS Hash Table Operations

Book Description

Hash tables can do a lot more than you might think! Data Management Solutions Using SAS Hash Table Operations: A Business Intelligence Case Study concentrates on solving your challenging data management and analysis problems via the power of the SAS hash object, whose environment and tools make it possible to create complete dynamic solutions. To this end, this book provides an in-depth overview of the hash table as an in-memory database with the CRUD (Create, Retrieve, Update, Delete) cycle rendered by the hash object tools. By using this concept and focusing on real-world problems exemplified by sports data sets and statistics, this book seeks to help you take advantage of the hash object productively, in particular, but not limited to, the following tasks:
  • select proper hash tools to perform hash table operations
  • use proper hash table operations to support specific data management tasks
  • use the dynamic, run-time nature of hash object programming
  • understand the algorithmic principles behind hash table data look-up, retrieval, and aggregation
  • learn how to perform data aggregation, for which the hash object is exceptionally well suited
  • manage the hash table memory footprint, especially when processing big data
  • use hash object techniques for other data processing tasks, such as filtering, combining, splitting, sorting, and unduplicating.
Using this book, you will be able to answer your toughest questions quickly and in the most efficient way possible!

Table of Contents

  1. About This Book
  2. About These Authors
  3. Acknowledgments
  4. Part One—The HOW of the SAS Hash Object
    1. Chapter 1: Hash Object Essentials
      1. 1.1 Introduction
      2. 1.2 Hash Object in a Nutshell
      3. 1.3 Hash Table
      4. 1.4 Hash Table Properties
        1. 1.4.1 Residence and Volatility
        2. 1.4.2 Hash Variables Role Enforcement
        3. 1.4.3 Key Variables
        4. 1.4.4 Program Data Vector (PDV) Host Variables
      5. 1.5 Hash Table Lookup Organization
        1. 1.5.1 Hash Table Versus Indexed SAS Data File
      6. 1.6 Table Operations and Hash Object Tools
        1. 1.6.1 Tasks, Operations, Environment, and Tools Hierarchy
        2. 1.6.2 General Data Table Operations
        3. 1.6.3 Hash Object Tools Classification
        4. 1.6.4 Hash Object Syntax
        5. 1.6.5 Hash Object Nomenclature
      7. 1.7 Peek Under the Hood
        1. 1.7.1 Table Organization and Unindexed Key Search
        2. 1.7.2 Internal Hash Table Structure
        3. 1.7.3 Hashing Scheme
        4. 1.7.4 Hash Function
        5. 1.7.5 Hash Table Structure and Algorithm in Tandem
        6. 1.7.6 The HASHEXP Effect
        7. 1.7.7 What Is in the Name?
    2. Chapter 2: Table-Level Operations
      1. 2.1 Introduction
      2. 2.2 CREATE Operation
        1. 2.2.1 Declaring a Hash Object
        2. 2.2.2 Creating a Hash Object Instance
        3. 2.2.3 Combining Declaration and Instantiation
        4. 2.2.4 Defining Hash Table Variables
        5. 2.2.5 Omitting the DEFINEDATA Method
        6. 2.2.6 Wrapping Up the Create Operation
        7. 2.2.7 PDV Host Variables and Parameter Type Matching
        8. 2.2.8 Other Ways of Hard-Coded Parameter Type Matching
        9. 2.2.9 Dynamic Parameter Type Matching via File Reference
        10. 2.2.10 Parameter Type Matching by Forced File Reference
        11. 2.2.11 Parameter Type Matching by Default File Reference
        12. 2.2.12 Defining Multiple Hash Variables
        13. 2.2.13 Defining Hash Variables as Non-Literal Expressions
        14. 2.2.14 Defining Hash Variables Dynamically One at a Time
        15. 2.2.15 Defining Hash Variables Using Metadata
        16. 2.2.16 Multiple Instances Issue
        17. 2.2.17 Ensuring Single Instance Usage
        18. 2.2.18 Handling Multiple Instances
        19. 2.2.19 Create Operation Hash Tools
      3. 2.3 DELETE (Table) Operation
        1. 2.3.1 The DELETE Method
        2. 2.3.2 DELETE Operation Details
        3. 2.3.3 Delete (Table) Operation Hash Tools
      4. 2.4 CLEAR Operation
        1. 2.4.1 The CLEAR Method
        2. 2.4.2 Clear Operation vs Delete (Table) Operation
        3. 2.4.3 CLEAR Operation Hash Tools
      5. 2.5 OUTPUT Operation
        1. 2.5.1 The OUTPUT Method
        2. 2.5.2 Open-Write-Close Cycle
        3. 2.5.3 Open-Write-Close Cycle Encapsulation
        4. 2.5.4 Avoiding Open File Conflicts
        5. 2.5.5 Output Data Set Member Types
        6. 2.5.6 Creating and Overwriting Output Data Set
        7. 2.5.7 Using Output Data Set Options
        8. 2.5.8 DATASET Argument as Non-Literal Expression
        9. 2.5.9 Output Data Order
        10. 2.5.10 Output Operation Hash Tools
      6. 2.6 DESCRIBE Operation
        1. 2.6.1 The NUM_ITEMS Attribute
        2. 2.6.2 The ITEM_SIZE Attribute
        3. 2.6.3 Describe Operation Hash Tools
    3. Chapter 3: Item-Level Operations: Direct Access
      1. 3.1 Introduction
      2. 3.2 SEARCH (Pure LookUp) Operation
        1. 3.2.1 Implicit Search: No Arguments
        2. 3.2.2 Explicit Search: Using the KEY Argument Tag
        3. 3.2.3 Argument Tag Type Match
        4. 3.2.4 Assigned CHECK Calls
        5. 3.2.5 Unassigned CHECK Calls
        6. 3.2.6 Search Operation Hash Tools
        7. 3.2.7 Search Operation Hash-PDV Interaction
      3. 3.3 INSERT Operation
        1. 3.3.1 Dynamic Memory Acquisition
        2. 3.3.2 Implicit INSERT
        3. 3.3.3 Implicit INSERT: Method Call Mode
        4. 3.3.4 Implicit INSERT: Methods Other Than ADD
        5. 3.3.5 Implicit INSERT: Argument Tag Mode
        6. 3.3.6 Explicit INSERT
        7. 3.3.7 Explicit INSERT Rules
        8. 3.3.8 Implicit vs Explicit INSERT
        9. 3.3.9 Unique Key and Duplicate Key INSERT
        10. 3.3.10 Unique INSERT
        11. 3.3.11 Duplicate INSERT
        12. 3.3.12 Insertion Order
        13. 3.3.13 Insert Operation Hash Tools
        14. 3.3.14 INSERT Operation Hash-PDV Interaction
      4. 3.4 DELETE ALL Operation
        1. 3.4.1 DELETE ALL Implementation
        2. 3.4.2 DELETE ALL and Item Locking
        3. 3.4.3 DELETE ALL Operation Hash Tools
        4. 3.4.4 DELETE ALL Operation Hash-PDV Interaction
      5. 3.5 RETRIEVE Operation
        1. 3.5.1 Direct RETRIEVE
        2. 3.5.2 Successful Direct RETRIEVE
        3. 3.5.3 Unsuccessful Direct RETRIEVE
        4. 3.5.4 Implicit vs Explicit FIND Calls
        5. 3.5.5 RETRIEVE Operation Hash Tools
        6. 3.5.6 RETRIEVE Operation Hash-PDV Interaction
      6. 3.6 UPDATE ALL Operation
        1. 3.6.1 UPDATE ALL Implementation
        2. 3.6.2 Assigned vs Unassigned REPLACE Calls
        3. 3.6.3 Implicit vs Explicit REPLACE Calls
        4. 3.6.4 Selective UPDATE Operation Note
        5. 3.6.5 UPDATE ALL Operation Hash Tools
        6. 3.6.6 UPDATE ALL Operation Hash-PDV Interaction
      7. 3.7 ORDER Operation
        1. 3.7.1 ORDER Operation Invocation
        2. 3.7.2 ORDERED Argument Tag Plasticity
        3. 3.7.3 Hash Items vs Hash Item Groups
        4. 3.7.4 OUTPUT Operation Effects
        5. 3.7.5 General Hash Table Order Principle
        6. 3.7.6 Ordering by Composite Keys
        7. 3.7.7 Setting the SORTEDBY= Option
        8. 3.7.8 ORDER Operation Hash Tools
        9. 3.7.9 ORDER Operation Hash-PDV Interaction
    4. Chapter 4: Item-Level Operations: Enumeration
      1. 4.1 Introduction
      2. 4.2 Enumeration: Basics and Classification
        1. 4.2.1 Enumeration as a Process
        2. 4.2.2 Enumerating a Hash Table
        3. 4.2.3 KeyNumerate (Key Enumerate) Operation
        4. 4.2.4 Enumerate All Operation
      3. 4.3 KEYNUMERATE Operation
        1. 4.3.1 KeyNumerate Operation Mechanics
        2. 4.3.2 FIND_NEXT: Implicit vs Explicit
        3. 4.3.3 Other KeyNumerate Coding Styles
        4. 4.3.4 Version 9.4 Add-On: DO_OVER
        5. 4.3.5 Forward and Backward, In and Out
        6. 4.3.6 Staying within the Item List (Keeping It Set)
        7. 4.3.7 HAS_NEXT and HAS_PREV Peculiarities
        8. 4.3.8 Harvesting Hash Items
        9. 4.3.9 Harvesting Hash Items via Explicit Calls
        10. 4.3.10 Selective DELETE and UPDATE Operations
        11. 4.3.11 Selective DELETE: Single Item
        12. 4.3.12 Selective Delete: Multiple Items
        13. 4.3.13 Selective UPDATE
        14. 4.3.14 Selective DELETE vs Selective UPDATE
        15. 4.3.15 KeyNumerate Operation Hash Tools
        16. 4.3.16 KeyNumerate Operation Hash-PDV Interaction
      4. 4.4 ENUMERATE ALL Operation
        1. 4.4.1 The Hash Iterator Object
        2. 4.4.2 Creating and Linking the Iterator Object
        3. 4.4.3 Hash Iterator Pointer
        4. 4.4.4 Direct Iterator Access: First Item
        5. 4.4.5 Direct Iterator Access: Last Item
        6. 4.4.6 Direct Iterator Access: Key-Item
        7. 4.4.7 Sequential Access
        8. 4.4.8 Enumerating from the End Points
        9. 4.4.9 Iterator Priming Using NEXT and PREV
        10. 4.4.10 FIRST/LAST vs NEXT/PREV
        11. 4.4.11 Keeping the Iterator in the Table
        12. 4.4.12 Enumerating Sequentially from a Key-Item
        13. 4.4.13 Harvesting Same-Key Items from a Key-Item
        14. 4.4.14 The Hash Iterator and Item Locking
        15. 4.4.15 Locking and Unlocking
        16. 4.4.16 Locking Same-Key Item Groups
        17. 4.4.17 Locking the Entire Hash Table
        18. 4.4.18 ENUMERATE ALL Operation Hash Tools
        19. 4.4.19 Hash-PDV Interaction
  5. Part Two—The WHAT and the WHY of the SAS Hash Object
    1. Chapter 5: Bizarro Ball Sample Data
      1. 5.1 Introduction
      2. 5.2 Sample Data Descriptions
        1. 5.2.1 AtBats
        2. 5.2.2 Games
        3. 5.2.3 Leagues
        4. 5.2.4 Pitches
        5. 5.2.5 Player_Candidates
        6. 5.2.6 Runs
        7. 5.2.7 Teams
      3. 5.3 Summary
    2. Chapter 6: Data Tasks Using Hash Table Operations
      1. 6.1 Introduction
      2. 6.2 Subsetting Data
        1. 6.2.1 Two Principal Methods of Subsetting
        2. 6.2.2 Simple Data File Subsetting via a Hash Table
        3. 6.2.3 Why a Hash Table and Not SQL?
        4. 6.2.4 Subsetting with a Twist: Adding a Simple Count
      3. 6.3 Combining Data
        1. 6.3.1 Left / Right Joins
        2. 6.3.2 Merging a Join with an Aggregate
        3. 6.3.3 Inner Joins
        4. 6.3.4 DO_OVER Versus FIND + FIND_NEXT
        5. 6.3.5 Unique-Key Joins
      4. 6.4 Splitting Data
        1. 6.4.1 Hash Data Split - Sorted Input
        2. 6.4.2 Hash Data Split - Unsorted Input
      5. 6.5 Ordering and Grouping Data
        1. 6.5.1 Reordering Split Outputs
        2. 6.5.2 Intrinsic Data Grouping
      6. 6.6 Summary
    3. Chapter 7: Supporting Data Warehouse Star Schemas
      1. 7.1 Introduction
      2. 7.2 Creating and Updating Fact Tables
      3. 7.3 Creating and Updating Slowly Changing Dimension Tables
        1. 7.3.1 Handling Type 0 Dimension Tables
        2. 7.3.2 Handling Type 1 Dimension Tables
        3. 7.3.3 Handling Type 2 Dimension Tables
        4. 7.3.4 Handling Type 3 Dimension Tables
        5. 7.3.5 Handling Type 4 Dimension Tables
        6. 7.3.6 Handling Type 6 Dimension Tables
      4. 7.4 Creating a Bizarro Ball Star Schema Data Warehouse
        1. 7.4.1 Defining the Data Warehouse Tables
        2. 7.4.2 Defining the Fact and Dimension Hash Tables via Metadata
        3. 7.4.3 Creating the Initial Data Structures for a Star Schema
        4. 7.4.4 Updating the Fact and Dimension Tables
      5. 7.5 Summary
    4. Chapter 8: Creating Data Aggregates and Metrics
      1. 8.1 Overview
      2. 8.2 Creating Simple Aggregates
        1. 8.2.1 Getting Variables from Other Tables
        2. 8.2.2 Calculating Unique Counts
        3. 8.2.3 Calculating Medians, Percentiles, Mode, and More
      3. 8.3 Creating Multi-Way Aggregates
        1. 8.3.1 Using Parameter Files to Define Aggregates
      4. 8.4 Summary
  6. Part Three—Expanding the WHAT and the WHY, along with the HOW of the SAS Hash Object
    1. Chapter 9: Hash of Hashes – Looping Through SAS Hash Objects
      1. 9.1 Overview
      2. 9.2 Creating a Hash of Hashes (HoH) Table – Simple Example
      3. 9.3 Calculating Percentiles, Mode, Mean, and More
        1. 9.3.1 Percentiles
        2. 9.3.2 Multiple Medians
        3. 9.3.3 Percentiles, Mode, Median, and More
      4. 9.4 Consecutive Events
      5. 9.5 Multiple Splits
        1. 9.5.1 Adding a Unique Count
        2. 9.5.2 Multiple Split Calculations
      6. 9.6 Summary
    2. Chapter 10: The Hash Object as a Dynamic Data Structure
      1. 10.1 Introduction
      2. 10.2 Stable Unduplication
        1. 10.2.1 Basic Stable Unduplication
        2. 10.2.2 Selective Unduplication
      3. 10.3 Testing Data for Grouping
        1. 10.3.1 Grouped vs Non-Grouped
        2. 10.3.2 Using a Hash Table to Check for Grouping
      4. 10.4 Hash Tables as Other Data Structures
        1. 10.4.1 Stacks and Queues
        2. 10.4.2 Implementing a Stack
        3. 10.4.3 Implementing a Queue
        4. 10.4.4 Using a Hash Stack to Find Consecutive Events
      5. 10.5 Array Sorting
        1. 10.5.1 Using a Hash Table to Sort Arrays
      6. 10.6 Summary
    3. Chapter 11: Hash Object Memory Management
      1. 11.1 Introduction
      2. 11.2 Memory vs. Disk Trade-Off
        1. 11.2.1 General Considerations
        2. 11.2.2 Hash Memory Overload Scenarios and Solutions
      3. 11.3 Making Use of Existing Key Order
        1. 11.3.1 Data Aggregation
        2. 11.3.2 Data Unduplication
        3. 11.3.3 Joining Data
      4. 11.4 MD5 Hash Key Reduction
        1. 11.4.1 The General Concept
        2. 11.4.2 MD5 Key Reduction in Sample Data
        3. 11.4.3 Data Aggregation
        4. 11.4.4 Data Unduplication
        5. 11.4.5 Joining Data
      5. 11.5 Data Portion Offload (Hash Index)
        1. 11.5.1 Joining Data
        2. 11.5.2 Selective Unduplication
      6. 11.6 Uniform Input Split
        1. 11.6.1 Uniform Split Using Key Properties
        2. 11.6.2 Aggregation via Partial Key Split
        3. 11.6.3 Aggregation via Key Byte Split
        4. 11.6.4 Joining via Key Byte Split
      7. 11.7 Uniform MD5 Split On the Fly
        1. 11.7.1 MD5 Split Aggregation On the Fly
        2. 11.7.2 MD5 Split Join On the Fly
      8. 11.8 Uniform Split Using a SAS Index
      9. 11.9 Combining Hash Memory-Saving Techniques
      10. 11.10 MD5 Argument Concatenation Ins and Outs
        1. 11.10.1 MD5 Collisions and SHA256
        2. 11.10.2 Concatenation Length Sizing
        3. 11.10.4 Concatenation Delimiters and Endpoints
        4. 11.10.5 Auto-Formatting and Explicit Formatting
        5. 11.10.6 Concatenation Order and Consistency
      11. 11.11 Summary
  7. Part Four—Wrapping up: Two Case Studies
    1. Chapter 12: Researching Alternative Pitching Metrics
      1. 12.1 Overview
      2. 12.2 The Sample Program
        1. 12.2.1 Adding More Metrics
        2. 12.2.2 One Output Data Set with All the Splits Results
      3. 12.3 Summary
    2. Chapter 13: What If the Count Is 0-2 After the First Two Pitches
      1. 13.1 Overview
      2. 13.2 The Sample Program
      3. 13.3 Summary
  8. Index