You are previewing Cody’s Data Cleaning Techniques Using SAS® Software.
O'Reilly logo
Cody’s Data Cleaning Techniques Using SAS® Software

Book Description

The key to ensuring accurate data is having clean data. This book develops and describes data cleaning programs and macros. You can use many of the programs and macros that are provided, as is, or you can modify them for your own special data cleaning tasks. Ron has carefully explained and documented each of the programs and macros, thus providing you with SAS programming instruction on an intermediate-to-advanced level. Topics presented include validation checks on character data, numeric data, missing values, and date values; searching for duplicate records; working with multiple files; double entry and verification using the COMPARE procedure; and SQL solutions and using validation data sets. Written in Ron's signature informal, tutorial style, this book gives anyone who manages data thoroughly documented, step-by-step instructions for the development of data cleaning programs and macros.

Supports releases 6.12 and higher of SAS software.

Table of Contents

  1. Copyright
  2. Introduction
  3. Acknowledgments
  4. Checking Values of Character Variables
    1. Introduction
    2. Using PROC FREQ to List Values
    3. Description of the Raw Data File PATIENTS.TXT
    4. Using a DATA Step to Check for Invalid Values
    5. Using PROC PRINT with a WHERE Statement to List Invalid Values
    6. Using Formats to Check for Invalid Values
    7. Using Informats to Check for Invalid Values
  5. Checking Values of Numeric Variables
    1. Introduction
    2. Using PROC MEANS, PROC TABULATE, and PROC UNIVARIATE to Look for Outliers
    3. Using PROC PRINT with a WHERE Statement to List Invalid Data Values
    4. Using a DATA Step to Check for Invalid Values
    5. Creating a Macro for Range Checking
    6. Using Formats to Check for Invalid Values
    7. Using Informats to Check for Invalid Values
    8. Using PROC UNIVARIATE to Look for Highest and Lowest Values by Percentage
    9. Using PROC RANK to Look for Highest and Lowest Values by Percentage
    10. Extending PROC RANK to Look for Highest and Lowest “n” Values
    11. Finding Another Way to Determine Highest and Lowest Values
    12. Checking a Range Using an Algorithm Based on Standard Deviation
    13. Macros Based on the Two Methods of Outlier Detection
    14. Checking a Range Based on the Interquartile Range
    15. Checking Ranges for Several Variables
  6. Checking for Missing Values
    1. Introduction
    2. Inspecting the SAS Log
    3. Using PROC MEANS and PROC FREQ to Count Missing Values
    4. Using DATA Step Approaches to Identify and Count Missing Values
    5. Using PROC TABULATE to Count Missing and Nonmissing Values for Numeric Variables
    6. Using PROC TABULATE to Count Missing and Nonmissing Values for Character Variables
    7. Creating a General Purpose Macro to Count Missing and Nonmissing Values for Both Numeric and Character Variables
    8. Searching for a Specific Numeric Value
  7. Working with Dates
    1. Introduction
    2. Checking Ranges for Dates (Using a DATA Step)
    3. Checking Ranges for Dates (Using PROC PRINT)
    4. Checking for Invalid Dates
    5. Working with Dates in Nonstandard Form
    6. Creating a SAS Date When the Day of the Month Is Missing
    7. Suspending Error Checking for Known Invalid Dates
  8. Looking for Duplicates and “n” Observations per Subject
    1. Introduction
    2. Eliminating Duplicates by Using PROC SORT
    3. Detecting Duplicates by Using DATA Step Approaches
    4. Using PROC FREQ to Detect Duplicate ID’s
    5. Selecting Patients with Duplicate Observations by Using a Macro List and SQL
    6. Identifying Subjects with “n” Observations Each (DATA Step Approach)
    7. Identifying Subjects with “n” Observations Each (Using PROC FREQ)
  9. Working with Multiple Files
    1. Introduction
    2. Checking for an ID in Each of Two Files
    3. Checking for an ID in Each of “n” Files
    4. A Simple Macro to Check ID’s in Multiple Files
    5. A More Complicated Multi-File Macro for ID Checking
    6. More Complicated Multi-File Rules
    7. Checking That the Dates Are in the Proper Order
  10. Double Entry and Verification (PROC COMPARE)
    1. Introduction
    2. Conducting a Simple Comparison of Two Data Sets without an ID Variable
    3. Using PROC COMPARE with an ID Variable
    4. Using PROC COMPARE with Two Data Sets That Have an Unequal Number of Observations
    5. Comparing Two Data Sets When Some Variables Are Not in Both Data Sets
  11. Some SQL Solutions to Data Cleaning
    1. Introduction
    2. A Quick Review of PROC SQL
    3. Checking for Invalid Character Values
    4. Checking for Outliers
    5. Checking a Range Using an Algorithm Based on the Standard Deviation
    6. Checking for Missing Values
    7. Range Checking for Dates
    8. Checking for Duplicates
    9. Identifying Subjects with “n” Observations Each
    10. Checking for an ID in Each of Two Files
    11. More Complicated Multi-File Rules
  12. Using Validation Data Sets
    1. Introduction
    2. A Simple Example of a Validation Data Set
    3. Making the Program More Flexible and Converting It to a Macro
    4. Validating Character Data
    5. Converting Program 9-7 into a General Purpose Macro
    6. Extending the Validation Macro to Include Valid Character Ranges
    7. Combining Numeric and Character Validity Checks in a Single Macro with a Single Validation Data Set
    8. Introducing SAS Integrity Constraints (Versions 7 and Later)
  13. Listing of Raw Data Files and SAS Programs
    1. Description of the Raw Data File PATIENTS.TXT
    2. Layout for the Data File PATIENTS.TXT
    3. Listing of Raw Data File PATIENTS.TXT
    4. Program to Create the SAS Data Set PATIENTS
    5. Listing of Raw Data File PATIENTS2.TXT
    6. Program to Create the SAS Data Set PATIENTS2
    7. Program to Create the SAS Data Set AE (Adverse Events)
    8. Program to Create the SAS Data Set LAB_TEST
  14. Books Available from SAS® Press
    1. JMP® Books
  15. Index