Power and Sample Size in a Nutshell

in Introductory

Many analytics problems are setup to compare one hypothesis versus another, maybe something like

The average value is within a certain range
versus
The average value is not within that range

But if you talk to the symbol-happy folk, they'll show it to you more like this

$H_0: \theta \in \Theta_0$
$H_1: \theta \notin \Theta_0$

where $\theta$ is the parameter we wish to test, $\Theta$ is the set of values $\theta$ may take, and $\Theta_0$ is some particular subset of $\Theta$. $H_0$ is called the null hypothesis, and $H_1$ is called the alternative hypothesis.

For example, suppose we need to decide whether we should apply something called foo to a group of people. This foo may be a medical treatment applied to patients, a marketing tactic applied to customers, or just imagine something relevant in your world. We're going to randomly select a group of people, apply foo to them, and then note their response. In symbols, let's represent the average or mean response as $\theta$. We know from past experience -- before we had foo -- that the average was some number $\theta_0$, and we want to know whether the foo people have a different average. So, we'll randomly select some people, apply foo to them, and then use some analytical method to test which of the following hypotheses is more plausible given our data on the foo people:

$H_0: \theta = \theta_0$
$H_1: \theta \neq \theta_0$

The question this site is dedicated to is: how many people do we need in our sample? To answer that, we need to know about something called Type I error and Type II error.

A Type I error happens when we wrongly conclude that the null hypothesis $H_0$ is false when it is actually true.

A Type II error happens when we wrongly conclude that the null hypothesis $H_0$ is true when it is actually false.

Maybe this table helps:

$H_0$ is actually true $H_0$ is actually false
We conclude $H_0$ is true We made correct conclusion We made a Type II error
We conclude $H_0$ is false We made a Type I error We made correct conclusion

Ideally, we'd like the probability of committing both types of errors to be zero. In reality, we're not so lucky.

It's a long tradition to represent the probability of committing a Type I error as $\alpha$, and to represent the probability of committing a Type II error as $\beta$. The power of an analytics or statistical procedure is its ability to show that the null hypothesis $H_0$ is false when it actually is false. Since $\beta$ is the probability of concluding that $H_0$ is true when it's actually false, then $1-\beta$ is the probability of concluding that $H_0$ is false when it actually is false (probabilities sum to 1).

Okay, that's a mouthful. It's one of those things that's hard to explain clearly, but really is a pretty straightforward concept.

To answer the question of how many people need to be in our sample, we note that sample size is related to our desired Type I and Type II error rates. Type I error rate is usually set at an industry standard; for example, in scientific experiments it's almost always set at 5%, or $\alpha=0.05$. Rather than focusing on the Type II error rate, $\beta$, it's much more common to set our sample size based on power, $1-\beta$. In simple terms, a small sample size gives us little power to reject the null hypothesis, whereas a large sample size gives us more statistical power. The level of statistical power that we'll be comfortable with is also usually an industry standard. We definitely want power to be larger than 50%; we probably wouldn't even conduct the study if we have less than even a 50-50 chance of getting it right. The usual power level that we aim for is 80%, but 90% is not uncommon in some areas. So, once we know what values of $\alpha$ and power we're comfortable with, we can use them to calculate what is the minimum sample size needed to achieve these error rates. We'll also need to know other things, like the value of $\theta_0$, but those details are left for other articles.

Sometimes we'll know in advance what power we want. For example, in our foo medical experiment we might be required by regulatory standards to only conduct the study if a 90% power can be achieved. In that case, we'll determine what sample size is needed to achieve that power level, and then see if our funds are sufficient to collect data on that number of people. If funds are insufficient, then other design parameter will have to be adjusted, such as the value of $\theta_0$.

There are other scenarios where the power level is not so set in stone. For example, when testing our foo marketing tactic, we might be more inclined to contrast the power associated with a small (and cheaper) sample size to that of a large (and more expensive) sample size. We'll then make a decision on whether the extra power is worth the extra cost.

Our hope is that this site provides you with a set of simple, easy-to-use tools that can help you make power and sample size decisions. We want to make it easy for you to tweak your error rates and input parameters so that you can quickly find the study design best for your situation. Please feel free to contact us with additional questions or assistance.