I usually am emphatic that the best way to improve model quality is to GET BETTER DATA!. However, Yesterday I used advanced statistics (well advanced for the business world I am in) to address a data quality issue. The situation arises due to restrictions put by government on the use of prescription data put in the public domain. When the data reaches my computer it has been through several hands and on the was some intentional noise is introduced and the relevant information put into bands to protect the individual GP. The way the annonimization is introduced may not be treated as fully random and independent noise.

Usually I just define a binary target variable identifying the top 20% prescribers and fit a logistic regression explaining who are the big prescribers (or even better, who are the fastest growers). However, for a drug that has just been launched the rate of misscalsification in the data is too uncomfortable. About 30% of the prescribers are masked in the data handed to me as non-prescribers – Gahhhh!

Lets say my binary target variable is called ‘Prescribed’ where 1 means at least one prescription in the period. It has two flavours the annonimized that I have and the true value that I wish I had. I can define two probabilities:

P

_{1}=P(True Prescibed=1|Annonimized Prescibed=1)P

_{0}=P(True Prescibed=0|Annonimized Prescibed=0)The less equal these probabilities are the more you should be worried even if both are relatively small.

If I disregard this issue and assuming I have only one independent variable called Var1 then I would naively do a logistic regression:

**proc**

**logistic**data=Staging.Mart;

model Prescibed(event='1')=Var1;

**run**;

Analysis of Maximum Likelihood Estimates | |||||

Parameter | DF | Estimate | StandardError | WaldChi-Square | Pr > ChiSq |

Intercept | 1 | -3.2095 | 0.0278 | 13351.1260 | <.0001 |

Var1 | 1 | 0.0848 | 0.00257 | 1087.7234 | <.0001 |

I could use NLMIXED to achieve the same analysis:

**proc**

**nlmixed**data=Staging.Mart qpoints=

**50**;

parms b0=

**0**b1=**0**; Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(

**1**+ExpEta)); model Prescibed ~ binary(p);

**run**;

Parameter Estimates | |||||||||

Parameter | Estimate | Standard Error | DF | t Value | Pr > |t| | Alpha | Lower | Upper | Gradient |

b0 | -3.2095 | 0.02778 | 37E3 | -115.55 | <.0001 | 0.05 | -3.2639 | -3.1551 | -0.00003 |

b1 | 0.08480 | 0.002571 | 37E3 | 32.98 | <.0001 | 0.05 | 0.07976 | 0.08984 | -0.00291 |

The code may be modified slightly to do the same job but allow for further flexability – a general log likelihood definition:

**proc**

**nlmixed**data=Staging.Mart qpoints=

**50**;

parms b0=

**0**b1=**0**; Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(

**1**+ExpEta)); ll=Prescibed*log(p)+(

**1**-Prescibed)*log(**1**-p); model Prescibed ~ general(ll);

**run**;

Now I can introduce the noise probabilities P1 and P0:

**proc**

**nlmixed**data=Staging.Mart qpoints=

**50**;

parms b0=

**0**b1=**0**; Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(

**1**+ExpEta)); P1=

**0.99**; *P(True Prescibed=1|Prescibed=1); P0=

**0.90**; *P(True Prescibed=0|Prescibed=0); if Prescibed=

**1**then ll=(log(p)+**0***log(**1**-p))*P1+ (

**0***log(p)+**1***log(**1**-p))*(**1**-P1); else ll=(log(p)+

**0***log(**1**-p))*(**1**-P0)+ (

**0***log(p)+**1***log(**1**-p))*P0; model Prescibed ~ general(ll);

**run**;

Parameter Estimates | |||||||||

Parameter | Estimate | Standard Error | DF | t Value | Pr > |t| | Alpha | Lower | Upper | Gradient |

b0 | -1.8449 | 0.01555 | 37E3 | -118.61 | <.0001 | 0.05 | -1.8753 | -1.8144 | -0.00004 |

b1 | 0.03795 | 0.001994 | 37E3 | 19.03 | <.0001 | 0.05 | 0.03404 | 0.04186 | -0.00089 |

The parameter estimates have changed significantly:

Another fun day at the office.