Thursday, 9 June 2011

Accounting for Annonimization noise in Binary Target variable (Nlmixed)


I usually am emphatic that the best way to improve model quality is to GET BETTER DATA!. However, Yesterday I used advanced statistics (well advanced for the business world I am in) to address a data quality issue. The situation arises due to restrictions put by government on the use of prescription data put in the public domain. When the data reaches my computer it has been through several hands and on the was some intentional noise is introduced and the relevant information put into bands to protect the individual GP. The way the annonimization is introduced may not be treated as fully random and independent noise.

Usually I just define a binary target variable identifying the top 20% prescribers and fit a logistic regression explaining who are the big prescribers (or even better, who are the fastest growers). However, for a drug that has just been launched the rate of misscalsification in the data is too uncomfortable. About 30% of the prescribers are masked in the data handed to me as non-prescribers – Gahhhh! 

Lets say my binary target variable is called ‘Prescribed’ where 1 means at least one prescription in the period. It has two flavours the annonimized that I have and the true value that I wish I had. I can define two probabilities:
P1=P(True Prescibed=1|Annonimized Prescibed=1)
P0=P(True Prescibed=0|Annonimized Prescibed=0)
The less equal these probabilities are the more you should be worried even if both are relatively small.
If I disregard this issue and assuming I have only one independent variable called Var1 then I would naively do a logistic regression:

proc logistic data=Staging.Mart;
 model Prescibed(event='1')=Var1;
 run;
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
1
-3.2095
0.0278
13351.1260
<.0001
Var1
1
0.0848
0.00257
1087.7234
<.0001

I could use NLMIXED to achieve the same analysis:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 model Prescibed ~ binary(p);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-3.2095
0.02778
37E3
-115.55
<.0001
0.05
-3.2639
-3.1551
-0.00003
b1
0.08480
0.002571
37E3
32.98
<.0001
0.05
0.07976
0.08984
-0.00291

The code may be modified slightly to do the same job but allow for further flexability – a general log likelihood definition:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 ll=Prescibed*log(p)+(1-Prescibed)*log(1-p);
 model Prescibed ~ general(ll);
 run;

Now I can introduce the noise probabilities P1 and P0:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 P1=0.99; *P(True Prescibed=1|Prescibed=1);
 P0=0.90; *P(True Prescibed=0|Prescibed=0);
 if Prescibed=1 then ll=(log(p)+0*log(1-p))*P1+
                     (0*log(p)+1*log(1-p))*(1-P1);
             else ll=(log(p)+0*log(1-p))*(1-P0)+
                     (0*log(p)+1*log(1-p))*P0;
 model Prescibed ~ general(ll);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-1.8449
0.01555
37E3
-118.61
<.0001
0.05
-1.8753
-1.8144
-0.00004
b1
0.03795
0.001994
37E3
19.03
<.0001
0.05
0.03404
0.04186
-0.00089

The parameter estimates have changed significantly:

 
Another fun day at the office.

Monday, 6 June 2011

Cleaning temporary files

Experienced sas users usually know where to find the work folder and how to clean it manually. Now there is an elegant solution for Windows – at long last.

Usually when the sas session is terminated, the application does some house cleaning. When the session is abandoned abruptly the work folder and its contents are left on the disk.  In time these temporary files may clog up the file storage and slowdown the box. A Unix cleanup scrip has been around for a good decade or so but when it came to The Windows operating systems I used to occasionally purge the files manually. However, now there is a utility that not only cleans the work folders but also a little bit more.

Note: Another container for unexpected file clutter is the sasuser folder. Mine is C:\Documents and Settings\eli\My Documents\My SAS Files\sasuser. A sloppy programmer will find there their output from SGplot and HTML reports accumulating rapidly.

How do I know the physical path to the work folder?


The standard installation will store all the work subfolders under C:\DOCUME~1\<windoes user id>\LOCALS~1\Temp\SAS Temporary Files. I have a desktop shortcut pointing there and occasionally I manually delete the subfolders there.

The simple way is to right mouse click the properties of work library:


Alternatively submit
proc options option=work; run;

and  the result will show in the log window.

A slightly more sophisticated and elegant way is:

                               %let WorkPath = %sysfunc(getoption(work));
%put <<&WorkPath.>>;

How do run and schedule the cleanup utility?

This cool utility is available with sas9.2 and works only if .Net Framework 2.0 is installed. The cool thing about this is that it is a bolton to the windows ‘Disk Cleanup’ utility. Once installed all you have to do is call the ‘Disk Cleanup’ utility (from Explorer, Select My Computer, right-Click on a local physical hard disk, and select Properties>Disk Cleanup) and then ensure there is a checkmark next to “SAS Temporary Files”


I did not have the utility installed so I found where to download it mentioned in this sas note: http://support.sas.com/kb/8/786.html.