Sep 1, 2011

Common Programming Mistake with Proc Sort NODUPRECS

Every SAS programmer would know that Proc Sort NODUPRECS would remove
exact duplicates…but often forgets / misuses it and alas gets into
trouble.
It is very important that the programmer should properly code the
variables in the BY statement. For the procedure to actually remove
the duplicates, the duplicate records should be brought together by
sorting the data with ALL the variables in the BY statement.
Let me explain it through an example here…Watch the lines in RED

data prdsal3;
set sashelp.prdsal3;
where product="SOFA"
and date between "01mar1998"d and "30apr1998"d and state="California";
keep State Date Actual;
run;

 
Here the input data has 2 duplicates obs 1 and 5..
When a Programmer mistakenly codes like this..i.e. omitting the
variable actual he get the output like the one below
proc sort data=prdsal3 out=sortedrandomly noduprecs;
by state date;
run;
 
 
 
The output shows the sorted order by state and date…the duplicates
moved to 1 and 3.

In the Output with the proc sort NODUPRECS did not eliminate the
duplicates because the original duplicate records (1 and 5) were NOT
sorted properly for NODUPRECS to eliminate consecutive Duplicate
records…
The SAS documentation says that
NODUPRECS
checks for and eliminates duplicate observations. If you specify this
option, then PROC SORT compares all variable values for each
observation to those for the previous observation that was written to
the output data set. If an exact match is found, then the observation
is not written to the output data set.
Note:   See NODUPKEY for information about eliminating observations
with duplicate BY values.
Alias :        NODUP
Interaction:        When you are removing consecutive duplicate observations
in the output data set with NODUPRECS, the choice of EQUALS or
NOEQUALS can have an effect on which observations are removed.
When the Programmer codes it correctly with all the variables…the
desired result is achieved…
proc sort data=prdsal3 out=prdsal3_sorted noduprecs;
by state actual date;
run;
 
The best possible to achieve this result is to use the _all_ auto
variable in the BY statement instead of specifying all the variables.
proc sort data=prdsal3 out=prdsal3_byall noduprecs;
by _all_;
run;
 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.