
In SPSS, select Analyze, Scale, Reliability Analysis; list your variables; click Statistics; select Item Scale, Scale if Item Deleted; select SplitHalf from the Model dropdown list. OK. SPSS will take the first half of the items as the first split form, and the second half as listed in the dialog box as the second split form. If there are an odd number of items, the first form will be one item longer than the second form. You can also use the Paste button to call up the Syntax window and alter the /MODEL=SPLIT parameter to be /MODEL=SPLIT n, where n is the number of items in the second form.
The Pearson correlation of split forms estimates the halftest reliability of an instrument or scale. The SpearmanBrown "prophecy formula" predicts what the fulltest reliability would be, based on halftest correlations. This coefficient will be higher than the halftest reliability coefficient. This coefficient is usually equal to and easily handcalculated as twice the halftest correlation divided by the quantity 1 plus the halftest reliability. In SPSS, two SpearmanBrown splithalf reliability coefficients will appear in the "Reliability Statistics" portion of the output when splithalf is selected under the Model button: (1) "Equal length" gives the estimate of the reliability if both halves had equal numbers of items, and (2) "Unequal length" gives the reliability estimate assuming unequal numbers.
Testretest methods are disparaged by many researchers as a way of gauging reliability. Among the problems are that short intervals between administrations of the instrument will tend to yield estimates of reliability which are too high. There may be invalidity due to a learning/practice effect (subjects learn from the first administration and adjust their answers on the second). There may be invalidity due to a maturation effect when the interval between administrations is long (the subjects change over time). The bother of having to take a second administration may cause some subjects to drop out of the pool, leading to nonresponse biases. Note, however, that testretest designs are still widely used and published and there is support for this. McKelvie (1992), for instance, reports that reliability estimates under testretest designs are not inflated due to memory effects. Researchers using testretest reliability must address the special validity concerns, but may decide to go ahead if warranted.
Counts in diagonal cells will reflect interrater agreement and cells off the diagonal will represent disagreements. Kappa is a function of the ratio of agreements to disagreements in relation to expected frequencies. In SPSS it is not available in the Reliability module. Rather one must obtain it from the Crosstabs procedure (Kappa is a choice under the Statistics button in Crosstabs; it is not a default option). In SAS, weighted and unweighted kappa is computed by the FREQ procedure.
Interpretation. By convention, a Kappa > .70 is considered acceptable interrater reliability, but this depends highly on the researcher's purpose. Another rule of thumb is that K = 0.40 to 0.59 is moderate interrater reliability, 0.60 to 0.79 substantial, and 0.80 outstanding (Landis & Koch, 1977). For interrater reliability of a set of items, such as a scale, one would report mean Kappa.
Manual computation: let a = the sum of counts on the diagonal, reflecting agreements. Let e = the sum of expected counts on the diagonal, where expected is calculated as [(row total * column total)/n], summed for each cell on the diagonal. Let n = the total number of ratings (observations). Kappa then equals the ratio of the surplus of agreements over expected agreements, divided by the number of expected disagreements. This is equivalent to K = (a  e)/(n  e). Fleiss and Cohen (1973) have shown ICC, discussed below, is mathematically equivalent to weighted Kappa.
Weighted Kappa: For ordinal rankings or better, one can weight each cell in the agreement/disagreement table by a weight between 0 and 1, where 1 corresponds to the row and column categories being the same and 0 corresponds to the categories being maximally dissimilar.
Data setup: In using intraclass correlation for interrater reliability, one constructs a table in which column 1 is the target id (1, 2, ..., n) and subsequent columns are the raters (A, B, C, ...). The row variable is some grouping variable which is the target of the ratings, such as persons (Subject1, Subject2, etc.) or neighborhood (E, W, N, S). The cell entries after the first id column are the raters' ratings of the target on some interval variable or intervallike variable, such as some Likert scale. The purpose of ICC is to assess the interrater (column) effect in relation to the grouping (row) effect, using twoway ANOVA.
Interpretation: ICC is interpreted similar to Kappa, discussed above. ICC will approach 1.0 when there is no variance within targets (ex., subjects, neighborhoods  for any target, all raters give the same ratings), indicating total variation in measurements on the Likert scale is due solely to the target (ex., subject, neighborhood) variable. That is, ICC will be high when any given row tends to have the same score across the columns (which are the raters). For instance, one may find all raters rate an item the same way for a given target, indicating total variation in the measure of a variable depends solely on the values of the variable being measured  that is, there is perfect interrater reliability. Put another way, ICC may be thought of as the ratio of variance explained by the independent variable divided by total variance, where total variance is the explained variance plus variance due to the raters plus residual variance. ICC is 1.0 only when there is no variance due to the raters and no residual variance to explain.
In SPSS, select Analyze, Scale/Reliability Analysis; select your variables; click Statistics; in the Descriptives group, select Item and select Intraclass correlation coefficient.; select a model from the Model dropdown list (ex., twoway mixed); select a type from the Type dropdown list (ex., consistency). Continue. OK. Models and Types are discussed below.
Models: ICC varies depending on whether the judges are all judges of interest or are conceived as a random sample of possible judges, and whether all targets are rated or only a random sample, and whether reliability is to be measured based on individual ratings or mean ratings of all judges. These considerations give rise to six forms of intraclass correlation, described in the classic article by Shrout and Fleiss (1979). In SPSS, these types are selected under the Model button of the Reliability dialog and under the Type dropdown list (3 models times 2 types = the six forms of ICC). .
Types: Under the Model button of the SPSS Reliability dialog, the Type dropdown list allows the researcher to specify one of two types of ICC computation:
Single versus average measures: Each model has two versions of the intraclass correlation coefficient:
Average measure reliability requires a reasonable number of judges to form a stable average. The number of judges required is estimated beforehand as nj = ICC*(1  rl)/rl( 1  ICC*), where nj is the number of judges needed, rl is the lower bound from the (1a)*100% confidence interval around the ICC, discovered in a pilot study; and ICC* is the minimum level of ICC acceptable to the researcher (ex., .80).
Use in other contexts. ICC is sometimes used outside the context of interrater reliability. In general, ICC is a coefficient which approaches 1.0 as the betweengroups effect (the row effect) is very large relative to the withingroups effect (the column effect), whatever the rows and columns represent. In this way ICC is a measure of homogeneity: it approaches 1.0 when any given row tends to have the same values for all columns. For instance, let columns be survey respondents and let rows be Census block numbers, and let the attribute measured be white=0/nonwhite=1. If blocks are homogenous by race, any given row will tend to have mostly 0's or mostly 1's, and ICC will be high and positive. As a rule of thumb, when the row variable is some grouping or clustering variable, such as Census areas, ICC will more and more approach 1.0 as the size of the clusters decreases and becomes more compact (ex., as one goes from metropolitan statistical areas to Census tracts to Census blocks). ICC is 0 when withingroups variance equals betweengroups variance, indicative of the grouping variable having no effect. Though less common, note that ICC can become negative when the withingroups variance exceeds the betweengroups variance.
If Tukey's test shows multiplicative interaction, any model computing scores for cases based on the scale must include the case main effect, the item main effect, and the casebyitem interaction effect. In a footnote to the Tukey test output, SPSS prints an estimates of the power to which items in a set would need to be raised in order to be additive. (Warning: while transforms may eliminate nonadditivity, raising item scores to too high a power will generate large values for all subjects, obscuring differences among subjects).
In SPSS, select Analyze, Scale, Reliability Analysis; click Statistics; check Tukey's test of additivity
The Spearman correction for attenuation of a correlation: let
r_{xy}* be corrected r for the correlation of x and y; let
r_{xy} be the uncorrected correlation; then r_{xy}* is a
function of the reliabilities of the two variables, r_{xx} and
r_{yy}:
This formula will result in an estimated true correlation ( r_{xy}*) which is higher than the observed correlation (r_{xy}), and all the more so the lower the reliabilities. Corrected r may be greater than 1.0, in which case it is customarily rounded down to 1.0.
Note that use of attenuationcorrected correlation is the subject of controversy (see, for ex., Winne & Belfry, 1982). Moreover, because corrected r will no longer have the same sampling distribution as r, a conservative approach is to take the upper and lower confidence limits of r and compute corrected r for both, giving a range of attenuationcorrected values for r. However, Muchinsky (1996) has noted that attenuationcorrected reliabilities, being not directly comparable with uncorrected correlation, are therefore not appropriate for use with inferential statistics in hypothesis testing and this would include taking confidence limits. Still, Muchinsky and others acknowledge that the difference between a correlation and attenuationcorrected correlation may be useful, at least for exploratory purposes, in assessing whether a low correlation is low because of unreliability of the measures or because the measures are actually uncorrelated.
One situation in which negative reliability might occur is when the scale items represent more than one dimension of meaning, and these dimensions are negatively correlated, and one split half test is more representative of one dimension while the other split half is more representative of another dimension. As Krus & Helmstadter point out, factor analyzing the entire set of items first would reveal if the set of items is plausibly conceptualized as unidimensional.
A second scenario for negative reliability is discussed by Magnusson (1966: 67), who notes that when true reliability approaches zero and sample size is small, random disturbance in the data may yield a small negative reliability coefficient.
In the case of Cronbach's alpha, Nichols (1999) notes that values less than 0 or greater than 1.0 may occur, especially when the number of cases and/or items is small. Negative alpha indicates negative average covariance among items, and when sample size is small, misleading samples and/or measurement error may generate a negative rather than positive average covariance. The more the items measure different rather than the same dimension, the greater the possibility of negative average covariance among items and hence negative alpha.
In SPSS, select Analyze, Scale/Reliability; select your items; click Statistics; in the Descriptives area, select Item, Scale, Scale if Deleted; in Summarize, select summary statistics (Means, Variances, Covariances, Correlations); and in the ANOVA table group, select Cochran chisquare. Continue. OK.
Cochran's Q is discussed further in the section on significance tests for more than two dependent samples.
Derivation of the ICC formula, following Ebel (1951: 409411): Let A be the true variance in subjects' ratings due to the normal expectation that different subjects will have true different scores on the rating variable. Let B be the error variance in subjects' ratings attributable to interrater unreliability. The intent of ICC is to form the ratio, ICC = A/(A + B). That is, intraclass correlation is to be true intersubject variance as a percent of total variance, where total variance is true variance plus variance attributable to interrater error in classification. B is simply the meansquare estimate of withinsubjects variance (variance in the ratings for a given subject by a group of raters), computed in ANOVA. The meansquare estimate of betweensubjects variance equals k times A (the true component) plus B (the interrater error component), since each mean contains a true component and an error component.
Given B = ms_{within}, and given ms_{between} = kA + B, substituting these equalities into the intended equation (ICC = A/[A+B]), the equation for ICC reduces to the formula for the mostused version of intraclass correlation (Haggard, 1958: 60):
Copyright 1998, 2008, 2009 by G. David Garson.
Last updated 1/28/2009.