Power Analysis and Sample Size Calculation in Experimental Design - Sampling Theory and Hypothesis Testing Logic
In most situations in statistical analysis, we do not have access to an entire statistical population of interest, either because the population is too large, is not willing to be measured, or the measurement process is too expensive or time-consuming to allow more than a small segment of the population to be observed. As a result, we often make important decisions about a statistical population on the basis of a relatively small amount of sample data. Typically, we take a sample and compute a quantity called a statistic in order to estimate some characteristic of a population called a parameter.
For example, suppose a politician is interested in the proportion of people who currently favor her position on a particular issue. Her constituency is a large city with a population of about 1,500,000 potential voters. In this case, the parameter of interest, which we might call P, is the proportion of people in the entire population who favor the politician's position. The politician is going to commission an opinion poll, in which a (hopefully) random sample of people will be asked whether or not they favor her position. The number (N) of people to be polled will be quite small, relative to the size of the population. Once these people have been polled, the proportion of them favoring the politician's position will be computed. This proportion, which is a statistic, can be called p.
One thing is virtually certain before the study is ever performed: The population proportion (P) will not be equal to the sample proportion (p). Because the sample proportion (p) involves "the luck of the draw," it will deviate from the population proportion (P). The amount by which the sample proportion (p) is wrong, i.e., the amount by which it deviates from the population proportion (P), is called sampling error.
In any one sample, it is virtually certain there will be some sampling error (except in some highly unusual circumstances), and that we will never be certain exactly how large this error is. If we knew the amount of the sampling error, this would imply that we also knew the exact value of the parameter, in which case we would not need to be doing the opinion poll in the first place.
In general, the larger the sample size N, the smaller sampling error tends to be. (One can never be sure what will happen in a particular experiment, of course.) If we are to make accurate decisions about a parameter like p, we need to have an N large enough so that sampling error will tend to be "reasonably small." If N is too small, there is not much point in gathering the data, because the results will tend to be too imprecise to be of much use.
On the other hand, there is also a point of diminishing returns beyond which increasing N provides little benefit. Once N is "large enough" to produce a reasonable level of accuracy, making it larger simply wastes time and money.
So some key decisions in planning any experiment are, "How precise will my parameter estimates tend to be if I select a particular sample size?" and "How big a sample do I need to attain a desirable level of precision?"
The purpose of the Power Analysis module is to provide you with the statistical methods to answer these questions quickly, easily, and accurately. The module provides simple dialogs for performing power calculations and sample size estimation for many of the classic statistical procedures, and it also provides special noncentral distribution routines to allow the advanced user to perform a variety of additional calculations.
Suppose that the politician was interested in showing that more than the majority of people supported her position. Her question, in statistical terms: "Is p > .50?" Being an optimist, she believes that it is.
In statistics, the following strategy is quite common. State as a "statistical null hypothesis" something that is the logical opposite of what you believe. Call this hypothesis H0. Gather data. Then, using statistical theory, show from the data that it is likely H0 is false, and should be rejected.
By rejecting H0, you support what you actually believe. This kind of situation, which is typical in many fields of research, for example, is called "Reject-Support testing," (RS testing) because rejecting the null hypothesis supports the experimenter's theory.
The null hypothesis is either true or false, and the statistical decision process is set up so that there are no "ties." The null hypothesis is either rejected or not rejected. Consequently, before undertaking the experiment, we can be certain that only 4 possible things can happen. These are summarized in the table below
State of the World | |||
HO | H1 | ||
Decision | H0 | Correct Acceptance | Type II Error β |
H1 | Type I Error
α |
Correct Rejection |
The conventions are, of course, much more rigid with respect to a than with respect to b. For example, in the social sciences seldom, if ever, is a allowed to stray above the magical .05 mark. Let's review where that tradition came from.
In the context of significance testing, we can define two basic kinds of situations, reject-support (RS) (discussed above) and accept-support (AS). In RS testing, the null hypothesis is the opposite of what the researcher actually believes, and rejecting it supports the researcher's theory. In a two group RS experiment involving comparison of the means of an experimental and control group, the experimenter believes the treatment has an effect, and seeks to confirm it through a significance test that rejects the null hypothesis.
In the RS situation, a Type I error represents, in a sense, a "false positive" for the researcher's theory. From society's standpoint, such false positives are particularly undesirable. They result in much wasted effort, especially when the false positive is interesting from a theoretical or political standpoint (or both), and as a result stimulates a substantial amount of research. Such follow-up research will usually not replicate the (incorrect) original work, and much confusion and frustration will result.
In RS testing, a Type II error is a tragedy from the researcher's standpoint, because a theory that is true is, by mistake, not confirmed. So, for example, if a drug designed to improve a medical condition is found (incorrectly) not to produce an improvement relative to a control group, a worthwhile therapy will be lost, at least temporarily, and an experimenter's worthwhile idea will be discounted.
As a consequence, in RS testing, society, in the person of journal editors and reviewers, insists on keeping a low. The statistically well-informed researcher makes it a top priority to keep b low as well. Ultimately, of course, everyone benefits if both error probabilities are kept low, but unfortunately there is often, in practice, a trade-off between the two types of error.
The RS situation is by far the more common one, and the conventions relevant to it have come to dominate popular views on statistical testing. As a result, the prevailing views on error rates are that relaxing a beyond a certain level is unthinkable, and that it is up to the researcher to make sure statistical power is adequate. One might argue how appropriate these views are in the context of RS testing, but they are not altogether unreasonable.
In AS testing, the common view on error rates we described above is clearly inappropriate. In AS testing, H0 is what the researcher actually believes, so accepting it supports the researcher's theory. In this case, a Type I error is a false negative for the researcher's theory, and a Type II error constitutes a false positive. Consequently, acting in a way that might be construed as highly virtuous in the RS situation, for example, maintaining a very low Type I error rate like .001, is actually "stacking the deck" in favor of the researcher's theory in AS testing.
In both AS and RS situations, it is easy to find examples where significance testing seems strained and unrealistic. Consider first the RS situation. In some such situations, it is simply not possible to have very large samples. An example that comes to mind is social or clinical psychological field research. Researchers in these fields sometimes spend several days interviewing a single subject. A year's research may only yield valid data from 50 subjects. Correlational tests, in particular, have very low power when samples are that small. In such a case, it probably makes sense to relax a beyond .05, if it means that reasonable power can be achieved.
On the other hand, it is possible, in an important sense, to have power that is too high. For example, one might be testing the hypothesis that two population means are equal (i.e., Mu1 = Mu2) with sample sizes of a million in each group. In this case, even with trivial differences between groups, the null hypothesis would virtually always be rejected.
The situation becomes even more unnatural in AS testing. Here, if N is too high, the researcher almost inevitably decides against the theory, even when it turns out, in an important sense, to be an excellent approximation to the data. It seems paradoxical indeed that in this context experimental precision seems to work against the researcher.
To summarize, in Reject-Support research:
- The researcher wants to reject H0.
- Society wants to control Type I error.
- The researcher must be very concerned about Type II error.
- High sample size works for the researcher.
- If there is "too much power," trivial effects become "highly significant."
In Accept-Support research:
- The researcher wants to accept H0.
- "Society" should be worrying about controlling Type II error, although it sometimes gets confused and retains the conventions applicable to RS testing.
- The researcher must be very careful to control Type I error.
- High sample size works against the researcher.
- If there is "too much power," the researcher's theory can be "rejected" by a significance test even though it fits the data almost perfectly.