You are previewing Cody's Data Cleaning Techniques Using SAS, Second Edition.
O'Reilly logo
Cody's Data Cleaning Techniques Using SAS, Second Edition

Book Description

Thoroughly updated for SAS 9, Cody's Data Cleaning Techniques Using SAS, Second Edition, addresses tasks that nearly every SAS programmer needs to do - that is, make sure that data errors are located and corrected. Written in Ron Cody's signature informal, tutorial style, this book develops and demonstrates data cleaning programs and macros that you can use as written or modify for your own special data cleaning needs. Each topic is developed through specific examples, and every program and macro is explained in detail.

You'll learn how to

  • find and correct errors in character and numeric values

  • develop programming techniques related to dates and missing values

  • use SQL approaches to data cleaning

  • develop techniques for correcting your data errors

  • use integrity constraints and audit trails to prevent errors from being added to a clean data set

Novice and experienced SAS users will discover ways to detect and correct data errors while learning how to apply DATA step programming techniques and SAS procedures.

Table of Contents

  1. List of Programs
  2. Preface
  3. Acknowledgments
  4. 1 Checking Values of Character Variables
    1. Introduction
    2. Using PROC FREQ to List Values
    3. Description of the Raw Data File PATIENTS.TXT
    4. Using a DATA Step to Check for Invalid Values
    5. Describing the VERIFY, TRIM, MISSING, and NOTDIGIT Functions
    6. Using PROC PRINT with a WHERE Statement to List Invalid Values
    7. Using Formats to Check for Invalid Values
    8. Using Informats to Remove Invalid Values
  5. 2 Checking Values of Numeric Variables
    1. Introduction
    2. Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers
    3. Using an ODS SELECT Statement to List Extreme Values
    4. Using PROC UNIVARIATE Options to List More Extreme Observations
    5. Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage
    6. Using PROC RANK to Look for Highest and Lowest Values by Percentage
    7. Presenting a Program to List the Highest and Lowest Ten Values
    8. Presenting a Macro to List the Highest and Lowest “n” Values
    9. Using PROC PRINT with a WHERE Statement to List Invalid Data Values
    10. Using a DATA Step to Check for Out-of-Range Values
    11. Identifying Invalid Values versus Missing Values
    12. Listing Invalid (Character) Values in the Error Report
    13. Creating a Macro for Range Checking
    14. Checking Ranges for Several Variables
    15. Using Formats to Check for Invalid Values
    16. Using Informats to Filter Invalid Values
    17. Checking a Range Using an Algorithm Based on Standard Deviation
    18. Detecting Outliers Based on a Trimmed Mean and Standard Deviation
    19. Presenting a Macro Based on Trimmed Statistics
    20. Using the TRIM Option of PROC UNIVARIATE and ODS to Compute Trimmed Statistics
    21. Checking a Range Based on the Interquartile Range
  6. 3 Checking for Missing Values
    1. Introduction
    2. Inspecting the SAS Log
    3. Using PROC MEANS and PROC FREQ to Count Missing Values
    4. Using DATA Step Approaches to Identify and Count Missing Values
    5. Searching for a Specific Numeric Value
    6. Creating a Macro to Search for Specific Numeric Values
  7. 4 Working with Dates
    1. Introduction
    2. Checking Ranges for Dates (Using a DATA Step)
    3. Checking Ranges for Dates (Using PROC PRINT)
    4. Checking for Invalid Dates
    5. Working with Dates in Nonstandard Form
    6. Creating a SAS Date When the Day of the Month Is Missing
    7. Suspending Error Checking for Known Invalid Dates
  8. 5 Looking for Duplicates and “n” Observations per Subject
    1. Introduction
    2. Eliminating Duplicates by Using PROC SORT
    3. Detecting Duplicates by Using DATA Step Approaches
    4. Using PROC FREQ to Detect Duplicate ID’s
    5. Selecting Patients with Duplicate Observations by Using a Macro List and SQL
    6. Identifying Subjects with “n” Observations Each (DATA Step Approach)
    7. Identifying Subjects with “n” Observations Each (Using PROC FREQ)
  9. 6 Working with Multiple Files
    1. Introduction
    2. Checking for an ID in Each of Two Files
    3. Checking for an ID in Each of “n” Files
    4. A Macro for ID Checking
    5. More Complicated Multi-File Rules
    6. Checking That the Dates Are in the Proper Order
  10. 7 Double Entry and Verification (PROC COMPARE)
    1. Introduction
    2. Conducting a Simple Comparison of Two Data Sets
    3. Using PROC COMPARE with Two Data Sets That Have an Unequal Number of Observations
    4. Comparing Two Data Sets When Some Variables Are Not in Both Data Sets
  11. 8 Some PROC SQL Solutions to Data Cleaning
    1. Introduction
    2. A Quick Review of PROC SQL
    3. Checking for Invalid Character Values
    4. Checking for Outliers
    5. Checking a Range Using an Algorithm Based on the Standard Deviation
    6. Checking for Missing Values
    7. Range Checking for Dates
    8. Checking for Duplicates
    9. Identifying Subjects with “n” Observations Each
    10. Checking for an ID in Each of Two Files
    11. More Complicated Multi-File Rules
  12. 9 Correcting Errors
    1. Introduction
    2. Hardcoding Corrections
    3. Describing Named Input
    4. Reviewing the UPDATE Statement
  13. 10 Creating Integrity Constraints and Audit Trails
    1. Introducing SAS Integrity Constraints
    2. Demonstrating General Integrity Constraints
    3. Deleting an Integrity Constraint Using PROC DATASETS
    4. Creating an Audit Trail Data Set
    5. Demonstrating an Integrity Constraint Involving More than One Variable
    6. Demonstrating a Referential Constraint
    7. Attempting to Delete a Primary Key When a Foreign Key Still Exists
    8. Attempting to Add a Name to the Child Data Set
    9. Demonstrating the Cascade Feature of a Referential Constraint
    10. Demonstrating the SET NULL Feature of a Referential Constraint
    11. Demonstrating How to Delete a Referential Constraint
  14. 11 DataFlux and dfPower Studio
    1. Introduction
    2. Examples
  15. Appendix Listing of Raw Data Files and SAS Programs
    1. Programs and Raw Data Files Used in This Book
    2. Description of the Raw Data File PATIENTS.TXT
    3. Layout for the Data File PATIENTS.TXT
    4. Listing of Raw Data File PATIENTS.TXT
    5. Program to Create the SAS Data Set PATIENTS
    6. Listing of Raw Data File PATIENTS2.TXT
    7. Program to Create the SAS Data Set PATIENTS2
    8. Program to Create the SAS Data Set AE (Adverse Events)
    9. Program to Create the SAS Data Set LAB_TEST
    10. Listings of the Data Cleaning Macros Used in This Book
  16. Index