MATLAB: Confidence interval for positive value distribution with a large proportion of 0

confidence intervaltobit distributionwald test

Hi, Here is my situation: I have a sampling of 60 bills on a population of size 2500. On that sampling, 20 have an error, i.e they have been considered refundable when they were not, and 40 are OK. The bills have different amounts. I need to estimate with a confidence interval how much money should be reimbursed to correct for the errors that have been made.
I thought of using the Wald test to estimate a confidence interval around the probability of error I estimated from my sample (1/3). However, I want to take into account the variability of the amount in case of error too. I also thought of using the Tobit model, because I read such a model can be used for modeling a variable value that is 0 for a nontrivial fraction of the population and roughly continuously distributed over positive values. But since I don’t really have a independant variable to put in my model (I just have amount distribution), I don’t know what to do.
Thanks for your help!

Best Answer

  • This is going to depend on whether the errors are correlated or not.
    If they are not correlated, then it would seem appropriate to use a poisson distribution; I have some partial ideas about how one might proceed along this line, but as I am not strong in stats, I am sure there are better ways than I could come up with.
    If, though, as many as 1/3 of the bills had errors, it would seem more plausible to me that the errors are correlated, at least partially. For example, one of the billing agents might tend to make mistakes, or the billing recording mechansim was half fried for a few days. Errors from the same source tend to be similar.
    You should be able to create tables of billing amounts, proper billing amount, dollar value of undercharge, relative percentage of undercharge, and perhaps some similar measures, and you should be able to use a technique such as PCA or clustering to decide what measures the errors are most correlated with. That should help you to more accurately estimate the total billing error.
    There is a natural danger you should watch out for: consumption often does not follow a normal distribution. A topical example of that is the fights in Canada and the USA over Internet billing, with it being claimed that the top 2-3% of users use more than half of the capacity at peak times. Depending on the nature of what is being consumed, it might be important to estimate the probability that the peak usages in the sample of 60 are representative of the peak usages in the full 2500. If we estimate (say) 2.5% are "high usage" category, then over 60 bills the mean number in that category would only be 1.5 bills -- not enough for a statistical estimate of how much the high-consumption locations would use. If 1 to 3 of the bills out of 60 appear to be outliers in billing amount, then they might be outliers, or they might be hinting at a multi-modal or high-tail distribution that you need to know about.