### Wednesday, October 18, 2006

## A statistical tidbit

There are estimates that smoking kills x many people per year. There are estimates that so-and-so many people in the US are overweight or obese. If you flip a coin a million times, and get approximately 50% heads, the coin is assumed to be fair.

How are each of these done?

Well, in each case, you take a random selection of a population.

Even with the coin flipping? Yes. That million flips of the coin is a small subset of the number of coin flips you could conceivably make.

With the obesity issue, scientists attempt to gather a perfectly random selection of people, and find out if they are overweight or obese. If a perfectly random selection of people shows that X% of them are overweight or obese, it's safe to assume that percentage is close to the same for other people in the same population, so long as the sample size is large enough.

(Using statistics, it's possible to determine how big the sample size must be to ensure that you have a pretty good picture of the rate of being overweight. It takes a surprisingly small number to be sure you've gotten such a good picture.)

With smoking, you have an interesting issue. You try to look at a similar group of people, one set of whom smokes, when the other set doesn't. The difference between the death rates of both groups shows something that's become a dirty word (well, "phrase") for some: the "excess deaths" caused by smoking. And when it comes to guessing the number of excess deaths due to smoking in the US, no one (except, perhaps, the tobacco companies - and maybe Senators and Representatives from tobacco growing states) claims that it's impossible to use a relatively small number of deaths to suggest a much larger number of deaths in the entire population.

One might notice that looking at the same population both before, and after, a war has started is very similar to the this method. You're looking at some people who are not in a war zone, and then, you're looking at people where the only thing that has changed is that they are in a war zone.

I'm going to try to distill the basics of probability and statistics in a post in the near future to try to explain why these methods work.

How are each of these done?

Well, in each case, you take a random selection of a population.

Even with the coin flipping? Yes. That million flips of the coin is a small subset of the number of coin flips you could conceivably make.

With the obesity issue, scientists attempt to gather a perfectly random selection of people, and find out if they are overweight or obese. If a perfectly random selection of people shows that X% of them are overweight or obese, it's safe to assume that percentage is close to the same for other people in the same population, so long as the sample size is large enough.

(Using statistics, it's possible to determine how big the sample size must be to ensure that you have a pretty good picture of the rate of being overweight. It takes a surprisingly small number to be sure you've gotten such a good picture.)

With smoking, you have an interesting issue. You try to look at a similar group of people, one set of whom smokes, when the other set doesn't. The difference between the death rates of both groups shows something that's become a dirty word (well, "phrase") for some: the "excess deaths" caused by smoking. And when it comes to guessing the number of excess deaths due to smoking in the US, no one (except, perhaps, the tobacco companies - and maybe Senators and Representatives from tobacco growing states) claims that it's impossible to use a relatively small number of deaths to suggest a much larger number of deaths in the entire population.

One might notice that looking at the same population both before, and after, a war has started is very similar to the this method. You're looking at some people who are not in a war zone, and then, you're looking at people where the only thing that has changed is that they are in a war zone.

I'm going to try to distill the basics of probability and statistics in a post in the near future to try to explain why these methods work.