You are previewing Data Preparation for Analytics Using SAS.
O'Reilly logo
Data Preparation for Analytics Using SAS

Book Description

Written for anyone involved in the data preparation process for analytics, Gerhard Svolba's Data Preparation for Analytics Using SAS offers practical advice in the form of SAS coding tips and tricks, and provides the reader with a conceptual background on data structures and considerations from a business point of view. The tasks addressed include viewing analytic data preparation in the context of its business environment, identifying the specifics of predictive modeling for data mart creation, understanding the concepts and considerations of data preparation for time series analysis, using various SAS procedures and SAS Enterprise Miner for scoring, creating meaningful derived variables for all data mart types, using powerful SAS macros to make changes among the various data mart structures, and more!

This book is part of the SAS Press program.

Table of Contents

  1. Preface
  2. Part 1 Data Preparation: Business Point of View
    1. Chapter 1 Analytic Business Questions
      1. 1.1 Introduction
      2. 1.2 The Term <i xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" xmlns:ns2="http://www.w3.org/2001/10/synthesis">Business Question</i>
      3. 1.3 Examples of Analytic Business Questions
      4. 1.4 The Analysis Process
      5. 1.5 Challenging an Analytic Business Question
      6. 1.6 Business Point of View Needed
    2. Chapter 2 Characteristics of Analytic Business Questions
      1. 2.1 Introduction
      2. 2.2 Analysis Complexity: Real Analytic or Reporting?
      3. 2.3 Analysis Paradigm: Statistics or Data Mining?
      4. 2.4 Data Preparation Paradigm: As Much Data As Possible or Business Knowledge First?
      5. 2.5 Analysis Method: Supervised or Unsupervised?
      6. 2.6 Scoring Needed: Yes/No?
      7. 2.7 Periodicity of Analysis: One-Shot Analysis or Re-run Analysis?
      8. 2.8 Need for Historic Data: Yes/No?
      9. 2.9 Data Structure: One-Row-per-Subject or Multiple-Rows-per-Subject?
      10. 2.10 Complexity of the Analysis Team
      11. 2.11 Conclusion
    3. Chapter 3 Characteristics of Data Sources
      1. 3.1 Introduction
      2. 3.2 Operational or Dispositive Data Systems?
      3. 3.3 Data Requirement: Periodic Availability
      4. 3.4 Wording: Analysis Table or Analytic Data Mart?
      5. 3.5 Quality of Data Sources for Analytics
    4. Chapter 4 Different Points of View on Analytic Data Preparation
      1. 4.1 Introduction
      2. 4.2 Simon, Daniele and Elias: Three Different Roles in the Analysis Process
      3. 4.3 Simon—The Business Analyst
      4. 4.4 Daniele—The Quantitative Expert
      5. 4.5 Elias—The IT and Data Expert
      6. 4.6 Who Is Right?
      7. 4.7 The Optimal Triangle
  3. Part 2 Data Structures and Data Modeling
    1. Chapter 5 The Origin of Data
      1. 5.1 Introduction
      2. 5.2 Data Origin from a Technical Point of View
      3. 5.3 Application Layer and Data Layer
      4. 5.4 Simple Text Files or Spreadsheets
      5. 5.5 Relational Database Systems
      6. 5.6 Enterprise Resource Planning Systems
      7. 5.7 Hierarchical Databases
      8. 5.8 Large Text Files
      9. 5.9 Where Should Data Be Accessed From?
    2. Chapter 6 Data Models
      1. 6.1 Introduction
      2. 6.2 Relational Model and Entity Relationship Diagrams
      3. 6.3 Logical versus Physical Data Model
      4. 6.4 Star Schema
      5. 6.5 Normalization and De-normalization
    3. Chapter 7 Analysis Subjects and Multiple Observations
      1. 7.1 Introduction
      2. 7.2 Analysis Subject
      3. 7.3 Multiple Observations
      4. 7.4 Data Mart Structures
      5. 7.5 No Analysis Subject Available?
    4. Chapter 8 The One Row-per-Subject Data Mart
      1. 8.1 Introduction
      2. 8.2 The One-Row-per-Subject Paradigm
      3. 8.3 The Technical Point of View
      4. 8.4 The Business Point of View: Transposing or Aggregating Original Data
      5. 8.5 Hierarchies: Aggregating Up and Copying Down
      6. 8.6 Conclusion
    5. Chapter 9 The Multiple-Rows-per-Subject Data Mart
      1. 9.1 Introduction
      2. 9.2 Using Multiple-Rows-per-Subject Data Marts
      3. 9.3 Types of Multiple-Rows-per-Subject Data Marts
      4. 9.4 Multiple Observations per Time Period
      5. 9.5 Relationship to Other Data Mart Structures
    6. Chapter 10 Data Structures for Longitudinal Analysis
      1. 10.1 Introduction
      2. 10.2 Data Relationships in Longitudinal Cases
      3. 10.3 Transactional Data, Finest Granularity, and Most Appropriate Aggregation Level
      4. 10.4 Data Mart Structures for Longitudinal Data Marts
    7. Chapter 11 Considerations for Data Marts
      1. 11.1 Introduction
      2. 11.2 Types and Roles of Variables in a Data Mart
      3. 11.3 Derived Variables
      4. 11.4 Variable Criteria
    8. Chapter 12 Considerations for Predictive Modeling
      1. 12.1 Introduction
      2. 12.2 Target Windows and Observation Windows
      3. 12.3 Multiple Target Windows
      4. 12.4 Overfitting
  4. Part 3 Data Mart Coding and Content
    1. Chapter 13 Accessing Data
      1. 13.1 Introduction
      2. 13.2 Accessing Data from Relational Databases Using SAS/ACCESS Modules
      3. 13.3 Accessing Data from Microsoft Office
      4. 13.4 Accessing Data from Text Files
      5. 13.5 Accessing Data from Hierarchical Text Files
      6. 13.6 Other Access Methods
    2. Chapter 14 Transposing One- and Multiple-Rows-per-Subject Data Structures
      1. 14.1 Introduction
      2. 14.2 Transposing from a Multiple-Rows-per-Subject Data Set to a One-Row-per-Subject Data Set
      3. 14.3 Transposing from a One-Row-per-Subject Data Set to a Multiple-Rows-per-Subject Data Set
      4. 14.4 Transposing a Transactional Table with Categorical Entries
      5. 14.5 Creating Key-Value Tables
    3. Chapter 15 Transposing Longitudinal Data
      1. 15.1 Introduction
      2. 15.2 Standard Scenarios
      3. 15.3 Complex Scenarios
    4. Chapter 16 Transformations of Interval-Scaled Variables
      1. 16.1 Introduction
      2. 16.2 Simple Derived Variables
      3. 16.3 Derived Relative Variables
      4. 16.4 Time Intervals
      5. 16.5 Binning Observations into Groups
      6. 16.6 Transformations of Distributions
      7. 16.7 Replacing Missing Values
      8. 16.8 Conclusion
    5. Chapter 17 Transformations of Categorical Variables
      1. 17.1 Introduction
      2. 17.2 General Considerations for Categorical Variables
      3. 17.3 Derived Variables
      4. 17.4 Combining Categories
      5. 17.5 Dummy Coding of Categorical Variables
      6. 17.6 Multidimensional Categorical Variables
      7. 17.7 Lookup Tables and External Data
    6. Chapter 18 Multiple Interval-Scaled Observations per Subject
      1. 18.1 Introduction
      2. 18.2 Static Aggregation
      3. 18.3 Correlation of Values
      4. 18.4 Concentration of Values
      5. 18.5 Course over Time: Standardization of Values
      6. 18.6 Course over Time: Derived Variables
    7. Chapter 19 Multiple Categorical Observations per Subject
      1. 19.1 Introduction
      2. 19.2 Absolute and Relative Frequencies of Categories
      3. 19.3 Concatenating Absolute and Relative Frequencies
      4. 19.4 Calculating Total and Distinct Counts of the Categories
      5. 19.5 Using ODS to Create Different Percent Variables
      6. 19.6 Business Interpretation of Percentage Variables
      7. 19.7 Other Methods
    8. Chapter 20 Coding for Predictive Modeling
      1. 20.1 Introduction
      2. 20.2 Proportions or Means of the Target Variable
      3. 20.3 Interval Variables and Predictive Modeling
      4. 20.4 Validation Methods
      5. 20.5 Conclusion
    9. Chapter 21 Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts
      1. 21.1 Introduction
      2. 21.2 Data Preparation for Association and Sequence Analysis
      3. 21.3 Enhancing Time Series Data
      4. 21.4 Aggregating at Various Hierarchical Levels
      5. 21.5 Preparing Time Series Data with SAS Functions
      6. 21.6 Using SAS/ETS Procedures for Data Preparation
  5. Part 4 Sampling, Scoring, and Automation
    1. Chapter 22 Sampling
      1. 22.1 Introduction
      2. 22.2 Sampling Methods
      3. 22.3 Simple Sampling and Reaching the Exact Sample Count or Proportion
      4. 22.4 Oversampling
      5. 22.5 Clustered Sampling
      6. 22.6 Conclusion
    2. Chapter 23 Scoring and Automation
      1. 23.1 Introduction
      2. 23.2 Scoring Process
      3. 23.3 Explicitly Calculating the Score Values from Parameters and Input Variables
      4. 23.4 Using the Respective SAS/STAT Procedure for Scoring
      5. 23.5 Scoring with PROC SCORE of SAS/STAT
      6. 23.6 Using the Respective SAS/ETS Procedure for Scoring
      7. 23.7 The Score Code That Can Be Produced in SAS Enterprise Miner
      8. 23.8 The Pre-checks on the Data That Are Useful before Scoring
      9. 23.9 Automation of Data Mart Creation in General
    3. Chapter 24 Do’s and Don’ts When Building Data Marts
      1. 24.1 Introduction
      2. 24.2 Process Do’s and Don’ts
      3. 24.3 Data Mart Handling Do’s and Don’ts
      4. 24.4 Coding Do’s and Don’ts
  6. Part 5 Case Studies
    1. Chapter 25 Case Study 1—Building a Customer Data Mart
      1. 25.1 Introduction
      2. 25.2 The Business Context
      3. 25.3 The Data
      4. 25.4 The Programs
      5. 25.5 The Results and Their Usage
    2. Chapter 26 Case Study 2—Deriving Customer Segmentation Measures from Transactional Data
      1. 26.1 The Business Questions
      2. 26.2 The Data
      3. 26.3 The Programs
      4. 26.4 The Results and Their Usage
    3. Chapter 27 Case Study 3—Preparing Data for Time Series Analysis
      1. 27.1 Introduction
      2. 27.2 The Business Context
      3. 27.3 The Data
      4. 27.4 From Transactional Data to the Most Appropriate Aggregation
      5. 27.5 Comparing PROC SQL, PROC MEANS, and PROC TIMESERIES
      6. 27.6 Additional Aggregations
      7. 27.7 Derived Variables
      8. 27.8 Creating Observations for Future Months
      9. 27.9 The Results and Their Usage
    4. Chapter 28 Case Study 4—Data Preparation in SAS Enterprise Miner
      1. 28.1 Introduction
      2. 28.2 Nodes for Data Preparation
      3. 28.3 Data Definition, Sampling, and Data Partition
      4. 28.4 Data Transformation
      5. 28.5 Scoring in SAS Enterprise Miner
      6. 28.6 Merging Multiple-Rows-per-Subject Data
      7. 28.7 Conclusion
  7. Appendix A Data Structures from a SAS Procedure Point of View
    1. A.1 Introduction
    2. A.2 Relationship between Data Mart Elements and SAS Procedure Statements
    3. A.3 Data Mart Structure Requirements for Selected Base SAS Procedures
    4. A.4 Data Mart Structure Requirements for Selected SAS/STAT Procedures
    5. A.5 Data Mart Structure Requirements forSelected SAS/ETS Procedures
    6. A.6 Data Mart Structure Requirements for Selected SAS/QC Procedures
    7. A.7 Data Mart Structure Requirements for Selected SAS/GRAPH Procedures
    8. A.8 Data Mart Structure Requirements for SAS Enterprise Miner Nodes
  8. Appendix B The Power of SAS for Analytic Data Preparation
    1. B.1 Motivation
    2. B.2 Overview
    3. B.3 Extracting Data from Source Systems
    4. B.4 Changing the Data Mart Structure: Transposing
    5. B.5 Data Management for Longitudinal and Multiple-Rows-per-Subject Data Sets
    6. B.6 Selected Features of the SAS Language for Data Management
    7. B.7 Benefits of the SAS Macro Language
    8. B.8 Matrix Operations with SAS/IML
  9. Appendix C Transposing with DATA Steps
    1. C.1 Transposing and Performance
    2. C.2 From a LONG Data Set to a WIDE Data Set
    3. C.3 From a WIDE Data Set to a LONG Data Set
    4. C.4 Transposing Transactional Categories with a DATA Step
  10. Glossary
  11. Index