You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a stats related problem I was hoping to get your advice on. Another group here at work have been given a pretty odd data set.
We have data from many hospitals in Canada. For each hospital, the total Market size (in sales) is given for a group of products, and also the total sales of the product in question. Therefore, we know the market share of the product in each hospital.
What we need to determine is a confidence interval for all of Canada, based on the mean Market share. The problem I am facing is three part: 1) Of the over 100 hospitals in the data set, only 29 have sales information. So really, we have only 29 samples. 2) Each hospital is different in terms of sales. What I mean is, some sell hardly anything, some sell quite a lot 3) If you were to look at the market share distribution, it is basically U-shaped. There are a group where the market share is 0%, a group where it is 100%, and literally 5 or so in between.
My question is, how would you approach this distribution? A colleague of mine simply took the average and standard deviation of the 29 Market share %'s and computed a Confidence Interval using t-scores. I was adamant that you can't do this since we know for a fact the distribution isn't normal at all and that the hospitals don't equally contribute to the market size.
I just wanted to ask if you've ever seen a problem like this and if you had any suggestions of how to approach it.
Thanks so much for your help. I know you're probably busy so any input is appreciated!
David
Reply:
Interesting question. I think the way to approach the problem is to ask why a confidence interval (CI) is being calculated, and then ask what it means.
In the second case, a CI is the range within which, let's say you have many many hospitals in Canada, with data. Then you take a small sample of data (e.g. 29 points) and you calculate the mean. Then take another 29 samples and calculate the mean. Repeat this many times. If you take those means you calculated, they will be different each time. A 95% CI says that 19 times out of 20, the mean value from any sample of 29 points will be within the CI's range.
You don't need to make any distribution assumption ... yet. You only need to assume a normal distribution to use the t-values to calculate the CI's low and high value.
In your case you only have 29 points and if you calculate a CI from them, and use t-values: you're assuming a normal distribution on the values you calculate the CI for (e.g. market share, sales, whatever) but more importantly, what is the interpretation of it?
Let's say it is 45% +/- 36%. Is this helpful to anyone, especially since you know what the underlying distribution looks like? Not really. A much more useful confidence interval in your case is 50% +/- 50%, because that is the range (0% to 100%) within which you expect to find the mean. Except if you told this to someone they'd think you don't know what you are doing.
One use for CI's is to judge if some new value is significantly bigger or smaller. Again, in this case, you can't do that with your CI, and in any case. What you might do is calculate a CI for the low values and another CI for the high values to tell whether a new hospital is better than the low group, or worst than the high group.
Anyway, those are just some thoughts - hope they help. My main point is that the CI is probably being used to fill is some report, and is not being used in the way a CI should be used.
The text was updated successfully, but these errors were encountered:
On Wed, Aug 31, 2011 at 17:07, DG wrote:
I have a stats related problem I was hoping to get your advice on. Another group here at work have been given a pretty odd data set.
We have data from many hospitals in Canada. For each hospital, the total Market size (in sales) is given for a group of products, and also the total sales of the product in question. Therefore, we know the market share of the product in each hospital.
What we need to determine is a confidence interval for all of Canada, based on the mean Market share. The problem I am facing is three part: 1) Of the over 100 hospitals in the data set, only 29 have sales information. So really, we have only 29 samples. 2) Each hospital is different in terms of sales. What I mean is, some sell hardly anything, some sell quite a lot 3) If you were to look at the market share distribution, it is basically U-shaped. There are a group where the market share is 0%, a group where it is 100%, and literally 5 or so in between.
My question is, how would you approach this distribution? A colleague of mine simply took the average and standard deviation of the 29 Market share %'s and computed a Confidence Interval using t-scores. I was adamant that you can't do this since we know for a fact the distribution isn't normal at all and that the hospitals don't equally contribute to the market size.
I just wanted to ask if you've ever seen a problem like this and if you had any suggestions of how to approach it.
Thanks so much for your help. I know you're probably busy so any input is appreciated!
David
Reply:
Interesting question. I think the way to approach the problem is to ask why a confidence interval (CI) is being calculated, and then ask what it means.
In the second case, a CI is the range within which, let's say you have many many hospitals in Canada, with data. Then you take a small sample of data (e.g. 29 points) and you calculate the mean. Then take another 29 samples and calculate the mean. Repeat this many times. If you take those means you calculated, they will be different each time. A 95% CI says that 19 times out of 20, the mean value from any sample of 29 points will be within the CI's range.
You don't need to make any distribution assumption ... yet. You only need to assume a normal distribution to use the t-values to calculate the CI's low and high value.
In your case you only have 29 points and if you calculate a CI from them, and use t-values: you're assuming a normal distribution on the values you calculate the CI for (e.g. market share, sales, whatever) but more importantly, what is the interpretation of it?
Let's say it is 45% +/- 36%. Is this helpful to anyone, especially since you know what the underlying distribution looks like? Not really. A much more useful confidence interval in your case is 50% +/- 50%, because that is the range (0% to 100%) within which you expect to find the mean. Except if you told this to someone they'd think you don't know what you are doing.
One use for CI's is to judge if some new value is significantly bigger or smaller. Again, in this case, you can't do that with your CI, and in any case. What you might do is calculate a CI for the low values and another CI for the high values to tell whether a new hospital is better than the low group, or worst than the high group.
Anyway, those are just some thoughts - hope they help. My main point is that the CI is probably being used to fill is some report, and is not being used in the way a CI should be used.
The text was updated successfully, but these errors were encountered: