Reliability and Item Analysis Introductory Overview - Designing a Reliable Scale

After the discussion so far, it should be clear that, the more reliable a scale, the better (e.g., more valid) the scale. As mentioned earlier, one way to make a sum scale more valid is by adding items. Reliability and Item Analysis methods include options that allow you to compute how many items would have to be added in order to achieve a particular reliability, or how reliable the scale would be if a certain number of items were added. However, in practice, the number of items on a questionnaire is usually limited by various other factors (e.g., respondents get tired, overall space is limited, etc.). Let us return to our prejudice example, and outline the steps that one would generally follow in order to design the scale so that it will be reliable:

Step 1: Generating items
The first step is to write the items. This is essentially a creative process where the researcher makes up as many items as possible that seem to relate to prejudices against foreign-made cars. In theory, one should "sample items" from the domain defined by the concept. In practice, for example in marketing research, focus groups are often utilized to illuminate as many aspects of the concept as possible. For example, we could ask a small group of highly committed American car buyers to express their general thoughts and feelings about foreign-made cars. In educational and psychological testing, one commonly looks at other similar questionnaires at this stage of the scale design, again, in order to gain as wide a perspective on the concept as possible.
Step 2: Choosing items of optimum difficulty
In the first draft of our prejudice questionnaire, we will include as many items as possible (note that the Reliability and Item Analysis module will handle up to 300 items in a single scale). We then administer this questionnaire to an initial sample of typical respondents, and examine the results for each item. First, we would look at various characteristics of the items, for example, in order to identify floor or ceiling effects. If all respondents agree or disagree with an item, then it obviously does not help us discriminate between respondents, and thus, it is useless for the design of a reliable scale. In test construction, the proportion of respondents who agree or disagree with an item, or who answer a test item correctly, is often referred to as the item difficulty. In essence, we would look at the item means and standard deviations and eliminate those items that show extreme means, and zero or nearly zero variances.
Step 3: Choosing internally consistent items
Remember that a reliable scale is made up of items that proportionately measure mostly true score; in our example, we would like to select items that measure mostly prejudice against foreign-made cars, and few esoteric aspects we consider random error. To do so, we would look at the following spreadsheet:
STATISTICA

RELIABL.

ANALYSIS

Summary for scale: Mean=46.1100 Std.Dv.=8.26444 Valid n:100

Cronbach alpha: .794313 Standardized alpha: .800491

Average inter-item corr.: .297818

variable Mean if

deleted

Var. if

deleted

StDv. if

deleted

Itm-Totl

Correl.

Squared

Multp. R

Alpha if

deleted

ITEM1 41.61000 51.93790 7.206795 .656298 .507160 .752243
ITEM2 41.37000 53.79310 7.334378 .666111 .533015 .754692
ITEM3 41.41000 54.86190 7.406882 .549226 .363895 .766778
ITEM4 41.63000 56.57310 7.521509 .470852 .305573 .776015
ITEM5 41.52000 64.16961 8.010593 .054609 .057399 .824907
ITEM6 41.56000 62.68640 7.917474 .118561 .045653 .817907
ITEM7 41.46000 54.02840 7.350401 .587637 .443563 .762033
ITEM8 41.33000 53.32110 7.302130 .609204 .446298 .758992
ITEM9 41.44000 55.06640 7.420674 .502529 .328149 .772013
ITEM10 41.66000 53.78440 7.333785 .572875 .410561 .763314

Shown above are the results for 10 items, that are discussed in greater detail in Examples. Of most interest to us are the three right-most columns in this spreadsheet. They show us the correlation between the respective item and the total sum score (without the respective item), the squared multiple correlation between the respective item and all others, and the internal consistency of the scale (coefficient Alpha) if the respective item would be deleted. Clearly, items 5 and 6 "stick out," in that they are not consistent with the rest of the scale. Their correlations with the sum scale are .05 and .12, respectively, while all other items correlate at .45 or better. In the right-most column, we can see that the reliability of the scale would be about .82 if either of the two items were to be deleted. Thus, we would probably delete the two items from this scale.

Step 4: Returning to Step 1
After deleting all items that are not consistent with the scale, we may not be left with enough items to make up an overall reliable scale (remember that, the fewer items, the less reliable the scale). In practice, one often goes through several rounds of generating items and eliminating items, until one arrives at a final set that makes up a reliable scale.