BusinessKen: Accounting for Annonimization noise in Binary Target variable (Nlmixed)

I usually am emphatic that the best way to improve model quality is to GET BETTER DATA!. However, Yesterday I used advanced statistics (well advanced for the business world I am in) to address a data quality issue. The situation arises due to restrictions put by government on the use of prescription data put in the public domain. When the data reaches my computer it has been through several hands and on the was some intentional noise is introduced and the relevant information put into bands to protect the individual GP. The way the annonimization is introduced may not be treated as fully random and independent noise.

Usually I just define a binary target variable identifying the top 20% prescribers and fit a logistic regression explaining who are the big prescribers (or even better, who are the fastest growers). However, for a drug that has just been launched the rate of misscalsification in the data is too uncomfortable. About 30% of the prescribers are masked in the data handed to me as non-prescribers – Gahhhh!

Lets say my binary target variable is called ‘Prescribed’ where 1 means at least one prescription in the period. It has two flavours the annonimized that I have and the true value that I wish I had. I can define two probabilities:

P₁=P(True Prescibed=1|Annonimized Prescibed=1)

P₀=P(True Prescibed=0|Annonimized Prescibed=0)

The less equal these probabilities are the more you should be worried even if both are relatively small.

If I disregard this issue and assuming I have only one independent variable called Var1 then I would naively do a logistic regression:

proc logistic data=Staging.Mart;

model Prescibed(event='1')=Var1;

run;

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-3.2095	0.0278	13351.1260	<.0001
Var1	1	0.0848	0.00257	1087.7234	<.0001

I could use NLMIXED to achieve the same analysis:

proc nlmixed data=Staging.Mart qpoints=50;

parms b0=0 b1=0;

Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(1+ExpEta));

model Prescibed ~ binary(p);

run;

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper	Gradient
b0	-3.2095	0.02778	37E3	-115.55	<.0001	0.05	-3.2639	-3.1551	-0.00003
b1	0.08480	0.002571	37E3	32.98	<.0001	0.05	0.07976	0.08984	-0.00291

The code may be modified slightly to do the same job but allow for further flexability – a general log likelihood definition:

proc nlmixed data=Staging.Mart qpoints=50;

parms b0=0 b1=0;

Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(1+ExpEta));

ll=Prescibed*log(p)+(1-Prescibed)*log(1-p);

model Prescibed ~ general(ll);

run;

Now I can introduce the noise probabilities P1 and P0:

proc nlmixed data=Staging.Mart qpoints=50;

parms b0=0 b1=0;

Eta=b0+b1*Var1;

ExpEta=exp(Eta);

P=(ExpEta/(1+ExpEta));

P1=0.99; *P(True Prescibed=1|Prescibed=1);

P0=0.90; *P(True Prescibed=0|Prescibed=0);

if Prescibed=1 then ll=(log(p)+0*log(1-p))*P1+

(0*log(p)+1*log(1-p))*(1-P1);

else ll=(log(p)+0*log(1-p))*(1-P0)+

(0*log(p)+1*log(1-p))*P0;

model Prescibed ~ general(ll);

run;

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper	Gradient
b0	-1.8449	0.01555	37E3	-118.61	<.0001	0.05	-1.8753	-1.8144	-0.00004
b1	0.03795	0.001994	37E3	19.03	<.0001	0.05	0.03404	0.04186	-0.00089

The parameter estimates have changed significantly:

Another fun day at the office.

Thursday, 9 June 2011

Accounting for Annonimization noise in Binary Target variable (Nlmixed)

No comments:

Post a Comment

Subscribe To

Thursday, 9 June 2011

Accounting for Annonimization noise in Binary Target variable (Nlmixed)

No comments:

Post a Comment