As useful as the nonparametric statistics are, they are fairly blunt tools and can easily overlook real differences. Also, they can only tell us that “something is happening,” but cannot describe it in any precise detail.
Parametric statistics assume more about the quality of the
data, but in return they can tell us more about what is going on with those
data. The most common parametric
statistics assume the “General
Linear Model”—that is, they assume that the “true,”
underlying distribution of the data can be described by a straight line (or one
of its variants). We will look particularly
at correlation and analysis of
variance.
Correlation and Analysis of Variance (ANOVA) are among the most powerful of the parametric statistics. Both test for the presence of a relationship between two characteristics, and are based on an assumption that is called the “general linear model.” This assumption states that the relationship between characteristics (or “variables”), in its ideal form, can be described as a straight line. In other words, a change in one variable always produces a change in other variables and that change is always in a certain direction (greater or less) and at the same strength. For many relationships, this makes sense. The harder you swing a hammer, the bigger the dent you make in the wood. The harder you push on the gas pedal, the faster you go. The harder you work in school, the better the grades you get. And so on. As long as you can reasonably make this assumption (that there is a relationship, and that it is “linear”), then you can apply the linear model. Correlation requires the additional assumption that both variables are “normally” distributed; ANOVA only requires that one of the variables be normally distributed.
The advantage of correlation and ANOVA over, for example, c-square, is that they can tell you not only whether there is a relationship but also what the relationship looks like and how strong it is. This is a direct result of the linear assumption: The shape of the relationship can be described by the line that best fits it (that is, by the direction and the slope of the line). The strength can be described by how closely the data fit the ideal line which describes the relationship.
Correlation is the statistical tool which most clearly expresses the general linear model. To perform a correlation, you must have observations of two characteristics for each case you wish to include, and the observation must both be measured on interval scales. You must further be willing to assume that the distribution underlying the observations is “normal,” or balanced about the mean.
There are several formulas connected with correlation:
r = Σ (y - y ) (x - x )
Σ(y – y)0.5 Σ(x - x)0.5
The correlation coefficient is an index number which will always fall between –1.0 and +1.0. The closer the value is to 1, the stronger the relationship. If the sign of the coefficient is positive, it means that the value of one variable increases if the value of the other one increases. If the sign is negative, it means that an increase in the value of one variable is associated with a decline in the value of the other.
b = Σ (y – y)
(x – x )
Σ (x – x )2
a = y - βx
t = r .
√Σ (y
– y )2 / (n – 2)
Σ(x – x )2
The obtained value of t is compared to the standard values of t in a table. The degrees of freedom are “n-2.” Since you have no way of knowing whether the true value of the correlation is higher or lower than the obtained coefficient, you are testing both possibilities. This is called a “two-tailed” test. In the tables of t-values, the proper values for a two-tailed test at .05 level of probability are listed in the “0.025” column (the .05 is split between the two sides of the distribution).
Analysis of variance applies the general linear model to situations in which one of the variables is measured on an interval scale, but the other variable (the “x,” or causal variable) is membership in a group. For example, a neighborhood group might be complaining that they are not getting their fair share of the city’s park & recreation money. ANOVA will allow you to determine whether there is merit to such a claim. An advantage of ANOVA over correlation is that no assumption need be made that the relationship between the two variables is a straight line. Analysis of variance will work with “U-shaped” or other curvilinear relationships.
In the natural order of things, every member of a group will not behave in exactly the same way. There will be a certain amount of variability within each group. However, if the groups are truly distinct, then it is reasonable to assume that each group also behaves differently from the others. These two sources of variance, within-group variance and between-group variance, should add up to the total difference observable in the entire community. In an idealized situation, all of the variance would be seen between the groups. In other words, if the “group” designation is not causing the difference in the other characteristic, there would be no variance between groups and there would be similar variance within each group.
Analysis of variance calculates a statistic called E2. The formula is:
E2 = 1 – Σ (y – y group)2
Σ (y – y total)2
E2 calculates the proportion of the total variance which is due to group membership (“between group” variance). One cannot calculate between-group variance directly, but it is what is left over after within-group variance is taken into account. The formula calculates the ratio of within-group variance to total variance and subtracts the result from unity (1), arriving at the ratio of between-group variance to total variance.
Just as a t-Test is used to determine whether the correlation ratio could be explained by chance variation, there is also a test to determine the significance of Analysis of Variance. It is similar to t, and is called the “F-test.” For analysis of variance, the F-statistic is calculated by the ratio of the variance between groups to the variance within groups, standardizing each to take the number of groups and the size of the sample into account:
F = Σ
(y – y between)2 /
(groups – 1)
Σ (y – ygroups )2 / (observations
– groups)
The two “standardizing” factors are measures of the “degrees of freedom” in each of the terms. As with the t-statistic, one consults a table of F-values using the two measures of degrees of freedom to determine the significance of the analysis. The following table contains only few of the possible values of F. If necessary, consult an F-table in any statistics text or in a collection of standard math tables.
|
Degrees of freedom for Observations (n-c) |
||||||
df for Groups (c-1) |
1 |
10 |
20 |
30 |
40 |
50 |
infinity |
1 |
161.0 |
4.96 |
4.35 |
4.17 |
4.08 |
4.03 |
3.84 |
2 |
200.0 |
4.10 |
3.49 |
3.32 |
3.23 |
3.18 |
2.99 |
3 |
216.0 |
3.71 |
3.10 |
2.92 |
2.84 |
2.79 |
2.6 |
4 |
225.0 |
3.48 |
2.87 |
2.69 |
2.61 |
2.56 |
2.37 |
F-Table for .05 Level of Significance
Analysis of variance tells you if there is a relationship, and how strong it is. The statistics do not tell you where the relationship lies. You must determine that by inspecting the pattern of the group means and the within-group deviation from those means.
Now that you know the mathematics behind the tests, the simple truth is that you will almost never calculate these statistics “by hand.” Instead, you can use the “function key” in Microsoft Excel. You can get to it several ways—there is a “Formulas” tab at the top of Excel (where “Home” and “File,” etc. are—depending on which version of Excel you are using). Or, at the very beginning of the data entry bar (just above the matrix of cells, the bar that shows what you have entered or are entering into each cell) there is a “fx” box (it stands for “function of x”). If you click on the “fx” box, it gives you a slate of pre-loaded formulas (if you don’t find it using fx, go up to the Formulas tab and look in the libraries there). The function for correlation is called CORREL You can find separate functions for the slope and intercept and the F-test (F.TEST). Unfortunately, there is no Excel function for ANOVA.
Analysis of variance and correlation are complicated enough that an entire course could easily be offered on the variations commonly used with either one of them. For example, this unit has not touched on curve-fitting, multiple regression, causal modeling, factor analysis, canonical correlation, or analysis of covariance. This is a very powerful tool and is worth pursuing in much greater detail, if you have the inclination.
The regression coefficient (the square of the correlation coefficient) is a “PRE” measure—it gives you the “proportional reduction in error” obtained by using the information from the analysis. In other words, if you observe a correlation of 0.5, that means you can increase your odds of correctly predicting an outcome (reduce your error) by 25% (the square of 0.5). Similarly, a correlation of 0.3 increases your accuracy by 10% and a correlation of 0.7 by 50%. One implication of this is that you can find a significant correlation of, say, 0.3 but it can be not very important (only a 10% increase in accuracy).
There is another sense in which the discussion here is lacking: I have presented the statistics in their basic form. Almost all of them have corrections which must be applied in special circumstances. Correlation requires that the data be normally distributed; there are some techniques for loosening that requirement. Adjustments like these are the topics of courses in statistics.
Choose one of the following
cases. Don’t just “tell
me” that there is a difference—prove
that the difference you observed is, in fact, a “significant”
difference (at the .05 level)by reporting and
explaining an appropriate quantitative (statistical) test.
1. Using
the US Census online (http://factfinder.census.gov/home/saff/main.html?_lang=en
), or any other source available to you, develop a table listing two
characteristics of the cities in
2. Compare
the average per capita income for
3. Consider
the hypotheses that you developed in the unit on research design. Select one or more that can be (1)
measured operationally by using Census or other publicly available data series
and (2) can be tested using correlation or ANOVA. Develop the data and test your
hypothesis. Then test the
robustness of your initial test, by repeating it using data from a different
time or a different place. What
light do the data throw on your hypotheses?
Blalock,
H.M., Jr. 1972. Social Statistics, 2nd Ed..
Chemical
Rubber Company. 1959. CRC Standard Mathematical Tables, 12th
Ed.
Draper, N.R. & H. Smith. 1966. Applied Regression Analysis. NY: John Wiley & Sons.
Krueckeberg, D.A. & A. Silvers. 1974. Urban Planning Analysis: Methods and Models. NY: John Wiley & Sons.
Willemain, T.R.
1980. Statistical Methods
for Planners.
© 2006 A.J.Filipovitch
Revised 6 January 2007