Thursday 9 June 2011

Accounting for Annonimization noise in Binary Target variable (Nlmixed)


I usually am emphatic that the best way to improve model quality is to GET BETTER DATA!. However, Yesterday I used advanced statistics (well advanced for the business world I am in) to address a data quality issue. The situation arises due to restrictions put by government on the use of prescription data put in the public domain. When the data reaches my computer it has been through several hands and on the was some intentional noise is introduced and the relevant information put into bands to protect the individual GP. The way the annonimization is introduced may not be treated as fully random and independent noise.

Usually I just define a binary target variable identifying the top 20% prescribers and fit a logistic regression explaining who are the big prescribers (or even better, who are the fastest growers). However, for a drug that has just been launched the rate of misscalsification in the data is too uncomfortable. About 30% of the prescribers are masked in the data handed to me as non-prescribers – Gahhhh! 

Lets say my binary target variable is called ‘Prescribed’ where 1 means at least one prescription in the period. It has two flavours the annonimized that I have and the true value that I wish I had. I can define two probabilities:
P1=P(True Prescibed=1|Annonimized Prescibed=1)
P0=P(True Prescibed=0|Annonimized Prescibed=0)
The less equal these probabilities are the more you should be worried even if both are relatively small.
If I disregard this issue and assuming I have only one independent variable called Var1 then I would naively do a logistic regression:

proc logistic data=Staging.Mart;
 model Prescibed(event='1')=Var1;
 run;
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
1
-3.2095
0.0278
13351.1260
<.0001
Var1
1
0.0848
0.00257
1087.7234
<.0001

I could use NLMIXED to achieve the same analysis:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 model Prescibed ~ binary(p);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-3.2095
0.02778
37E3
-115.55
<.0001
0.05
-3.2639
-3.1551
-0.00003
b1
0.08480
0.002571
37E3
32.98
<.0001
0.05
0.07976
0.08984
-0.00291

The code may be modified slightly to do the same job but allow for further flexability – a general log likelihood definition:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 ll=Prescibed*log(p)+(1-Prescibed)*log(1-p);
 model Prescibed ~ general(ll);
 run;

Now I can introduce the noise probabilities P1 and P0:

proc nlmixed data=Staging.Mart qpoints=50;
 parms b0=0 b1=0;
 Eta=b0+b1*Var1;
 ExpEta=exp(Eta);
 P=(ExpEta/(1+ExpEta));
 P1=0.99; *P(True Prescibed=1|Prescibed=1);
 P0=0.90; *P(True Prescibed=0|Prescibed=0);
 if Prescibed=1 then ll=(log(p)+0*log(1-p))*P1+
                     (0*log(p)+1*log(1-p))*(1-P1);
             else ll=(log(p)+0*log(1-p))*(1-P0)+
                     (0*log(p)+1*log(1-p))*P0;
 model Prescibed ~ general(ll);
 run;
Parameter Estimates
Parameter
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Gradient
b0
-1.8449
0.01555
37E3
-118.61
<.0001
0.05
-1.8753
-1.8144
-0.00004
b1
0.03795
0.001994
37E3
19.03
<.0001
0.05
0.03404
0.04186
-0.00089

The parameter estimates have changed significantly:

 
Another fun day at the office.

Monday 6 June 2011

Cleaning temporary files

Experienced sas users usually know where to find the work folder and how to clean it manually. Now there is an elegant solution for Windows – at long last.

Usually when the sas session is terminated, the application does some house cleaning. When the session is abandoned abruptly the work folder and its contents are left on the disk.  In time these temporary files may clog up the file storage and slowdown the box. A Unix cleanup scrip has been around for a good decade or so but when it came to The Windows operating systems I used to occasionally purge the files manually. However, now there is a utility that not only cleans the work folders but also a little bit more.

Note: Another container for unexpected file clutter is the sasuser folder. Mine is C:\Documents and Settings\eli\My Documents\My SAS Files\sasuser. A sloppy programmer will find there their output from SGplot and HTML reports accumulating rapidly.

How do I know the physical path to the work folder?


The standard installation will store all the work subfolders under C:\DOCUME~1\<windoes user id>\LOCALS~1\Temp\SAS Temporary Files. I have a desktop shortcut pointing there and occasionally I manually delete the subfolders there.

The simple way is to right mouse click the properties of work library:


Alternatively submit
proc options option=work; run;

and  the result will show in the log window.

A slightly more sophisticated and elegant way is:

                               %let WorkPath = %sysfunc(getoption(work));
%put <<&WorkPath.>>;

How do run and schedule the cleanup utility?

This cool utility is available with sas9.2 and works only if .Net Framework 2.0 is installed. The cool thing about this is that it is a bolton to the windows ‘Disk Cleanup’ utility. Once installed all you have to do is call the ‘Disk Cleanup’ utility (from Explorer, Select My Computer, right-Click on a local physical hard disk, and select Properties>Disk Cleanup) and then ensure there is a checkmark next to “SAS Temporary Files”


I did not have the utility installed so I found where to download it mentioned in this sas note: http://support.sas.com/kb/8/786.html.

Wednesday 18 May 2011

Making the results of the analysis accessible

We have all been there. You have just completed an analysis to be proud of. You cleverly collected data from many information sources. Then, through nifty data management that really shows your sas skills, you thickened the data mart with meaningful aggregations, transformations and imputations. And to cap it all, you performed some brilliant statistical modelling pushing your personal boundaries. But when you try to communicate this you encounter glazed looks and you feel your effort is not appreciated. Even worse, you learn that the results of your model (e.g. a segmentation of s scoring) are not really bought into by management and the business.
When I was a youngling I taught in the Open University introduction to statistics courses and similar courses which were compulsory as part of a degree in psychology. That was invaluable experience in honing and toning my ability to explain and discuss statistics. However, explaining what a regression is all about is rather different from telling people about the analysis and discussing the results. When I worked in mainland Europe I developed and adapted a mode of communicating results that was slightly peppered with ‘statistical jargon’. The people I worked with, such as marketing manager & back office managers, had some statistical training in their past. Not only were they comfortable with box plots, lift charts and stepwise variable selection they expected to hear about it. Moreover, there was appetite to explore innovative statistical techniques as it was perceived essential to the business’ survival in the market. When I started working in the UK I had to change the way I talked & presented. The people I worked with did not want to go beyond discussing basic averages. They still wanted sophisticated and advanced analysis and solutions but challenged me to communicate it at the ‘shop Stuart’ level. To an extent that is because some of the clients grew from the shop floor so to speak. Moreover, there is a tendency to share the results with field which is great. The ultimate challenge was supporting a team preparing for tough negotiations with the trade union where the spirit was sharing the facts and analysis so the discussions could focus on strategy and planning.
The current client I am working for is exploring how to improve the way the analytics team communicates and presents analysis, findings and recommendations. The analysts are asking themselves how to up their game and talk at the business statistics without seeing that glazed look. The challenge is not just the communication to the decision maker but how to gain buy-in form the field. The consensus is that we should not fall into the trap of telling everyone how great the analysis is. Instead the approach should be “you should trust us to do a good job, now lets tell you what we found.”  The trust in the team’s skills and abilities should be acquired through the daily interactions with the business. A presentation of results should focus on the Business and address it pain. It should not be a naval gazing exercise.
Taking a step back to basics, the key question is “What do they really what to know”?
·         What information sources did we explore – if we covered the data they expected and more then that? This is an important first step in gaining their confidence in you.
·         What are the main findings – they do not want to hear about coefficients and correlations. They would like a high level summary such as “The number of face to face sales visits does not seem to be predictive when accounting for X”. Even if they do not like the message, at least they understand it and they know that they should concentrate on X. They might ask for evidence and you should have the response ready in a format that is appropriate for the audience.
It is a misconception that the sleekness of the communication of the results flows from the dichotomy between “Academics in their ivory towers” and the “In tune consultants”. I witnessed a reputable consulting firm prepare a “Deck” of about 300 backup slides for an hour-short presentation. Admittedly, it was their way of getting the team to address questions and document the thought process along side the findings. However, after weeks of sleepless nights, the result was that no one in the team could remember what it was all about; not to mention reproduce the numbers. Moreover, instead of creating more confidence in the analysis it achieved the opposite. Each graph and table needed some time to digest and understand. Many of them just were showing no effect or statistically significant differences that were not practically significant. The longer this went on during the meeting the more the feeling was “These clever guys might understand this but I do not have the time – or are they pulling one over?”
We just finished a few high profile modelling and targeting projects for keystone products. Our findings were dispelling common beliefs and suggesting a new strategy. The first presentation that we prepared was the bog standard ‘Tell them everything – a graph is worth more than a 1000 words’. It did not work. I spent a week reworking the presentation and ended up with 7 slides with mainly bullets and only two killer graphs that together brought the message home. It worked really well. When designing the presentation and the graphs, I harked back to my student days (the first fish were starting to climb out of the see just about then) where Professor Benjamini (http://www.math.tau.ac.il/~ybenja/) introduced us to Tuffte’s work (http://www.edwardtufte.com/tufte/) and discussed other research about cognitive perception of graphical information (see: high overview in http://www.perceptualedge.com/files/GraphDesignIQ.html)
There are a few challenges to keep in mind:
·         Do the correct modelling – a elegant and simple presentation should not be an excuse to discount statistical rigour.
·         It is important to communicate the quality of the modelling. It is not easy for annalists to not mention correlations, p-values, PPVs, Sensitivity, Spesifity, and Lift Values. How ever, is it worth spending the time educating the un-interested. It is better to translate to terms they know. “If we use this model we are likely to visit 50% more GPs that will respond positively. Had we applied this last year we probably would have seen an increase in sales of about X Pounds.” – Now that is a challenge that merits a paper of its own.
·         Communicate the margins of error – managers understand worst case, expected case and best case scenarios.
·         Communicate innovation – that is not easy – keep trying until you get it write for your audience.
A good consultant should have the same confidence the public give to their doctors. They are trusted to know and be experienced. What we want of them is a diagnosis and a solution.
What is the right balance? To bullet or to graph?
The answer to that, I believe, is that the client is always right.

Wednesday 23 February 2011

FCMP: BKf_BH Controlling the FDR


 Only once did I see the process for incorporation of cutting edge statistical theory into a new sas procedure and actually know the people. A long time ago, when fish just started to explore the idea of dry land living, I studied under Yoav Benjamini who together with Yosi Hochberg suggested (and named) the FDR as an alternative to the FEW. As is with many revolutionary concepts is was not accepted immediately. But once a few articles were written Yoav was contacted by sas who asked a few questions. It was some time ago so I am a little hazy on the details; but if my memory serves me right, I saw the actual letter. Subsequently Wolfinger (I Think) attended the MCP conference in Tel-Aviv and met Yoav, Yosi and Daniel Yekutieli.

 Sometime later, I was working at sas/Austria (or was it already sas/Denmark?) when a new proc was introduced MultTest. I was excited as it was relevant to my research and also included measures suggested by people I knew – Yosi Hochberg and Yoav Benjamini. I later discovered that Yosi’s measure may also be used in the mean statement for proc Anova and GLM (GT2).

 One thing jarred with me at the time. The option in proc MultTest to use the step-down FDR controlling procedure was called FDR. To my opinion it should have been BH for the Benjamini-Hochberg procedure. Apart for giving them their due and being consisted with the naming of the options such as Tukey, Dunnet etc., I knew of at least one other procedure at the time that controlled the FDR and expected more. Moreover, the naming confused between the procedure and the measure.

 Nowadays the FDR measure is mainstream, especially after its relevance to BioInformatics was recognized. More powerful procedures to control the FDR were proposed and some are implemented in sas 9.2. But I still like the elegance of the BH procedure.

 It was just natural that chose to explore the FCMP functionality through the prism of the FDR (download).

To try it run:

%Let n=3;
data test;
 * Array parameters to subroutine calls must be temporary arrays;
 array a(&n.) _temporary_;
 array b(&n.);
 array c(&n.) _temporary_;
 array d(&n.);
 input b1 b2 b3;
 do i=1 to &n.; a[i]=b[i]; end;
 call BKs_BH(a,c);
 do i=1 to &n.; d[i]=c[i]; end;
 datalines;
 0.05 0.01 0.95
 5.00 0.10 0.01
 0.05 0.05 0.05
 0.03 0.02 0.01
 run;
proc print;run;

Output
Obs     b1      b2      b3       d1      d2      d3

1     0.05    0.01    0.95    0.075    0.03    0.95
2     5.00    0.10    0.01    1.000    0.15    0.03
3     0.05    0.05    0.05    0.050    0.05    0.05
4     0.03    0.02    0.01    0.030    0.03    0.03


using sas V9.2/Base