Detecting Duplicates by Using DATA Step Approaches

Let’s explore the ways that will allow you to detect duplicate ID’s and duplicate observations in a data set. One very good way to approach this problem is to use the temporary SAS variables FIRST, and LAST. To see how this works, look at Program 5-4, which prints out all observations that have duplicate patient numbers.

Program 5-4. Identifying Duplicate ID’s
PROC SORT DATA=CLEAN.PATIENTS OUT=TMP;  1
   BY PATNO;
RUN;


DATA DUP;
   SET TMP;
   BY PATNO;  2
   IF FIRST.PATNO AND LAST.PATNO THEN DELETE;  3
RUN;


PROC PRINT DATA=DUP;
   TITLE "Listing of Duplicates from Data Set CLEAN.PATIENTS";
   ID PATNO;
RUN;

It’s first necessary to sort the data set by the ID variable . In the above program, the original data ...

Get Cody’s Data Cleaning Techniques Using SAS® Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.