Statistics II: Definitions & Mathematical Basis

Correlation

Correlation is the statistical tool which most clearly expresses the general linear model. To perform a correlation, you must have observations of two characteristics for each case you wish to include, and the observation must both be measured on interval scales. You must further be willing to assume that the distribution underlying the observations is “normal,” or balanced about the mean.

There are several formulas connected with correlation:

Correlation coefficient (r): This statistic provides an estimate of the overall strength of the relationship between the two characteristics. Each characteristic (variable) is labeled; one will be called “y” and the other “x.” By convention, the “causal” variable (if you have reason to suspect that one is a cause of the other) is labeled “x.” The correlation coefficient is the ratio of the covariance (the joint variance of an observation on both the x and y dimensions) to the product of the variance of within each dimension. Since the variance within each dimension is squared, the ratio is balanced by also squaring the covariance. As with standard deviation, the expression is then scaled back to its original dimension by taking its square root:

r = (Σ (y - y ) (x - x ))²

Σ(y – y)² Σ(x - x)²

The correlation coefficient is an index number which will always fall between –1.0 and +1.0. The closer the value is to 1, the stronger the relationship. If the sign of the coefficient is positive, it means that the value of one variable increases if the value of the other one increases. If the sign is negative, it means that an increase in the value of one variable is associated with a decline in the value of the other.

Slope (b, or “regression coefficient”): Slope is defined analogously to “slope” in geometry. There, the slope of a line is the ratio of the change in “y” (the “rise”) to the change in “x” (the “run”). In the case of the regression line, the slope is the ratio of the covariance to the variance in “x.”

b = Σ (y – y) (x – x )

Σ (x – x )²

Intercept (a, or “regression constant”): The intercept is the value of “y” when the value of “x” is 0—in other words it is the place where the regression line crosses the “y” axis if the data were plotted on graph paper. From geometry, you might remember that the formula for a straight line is “ y=a+bx”, where “b” is the slope of the line, “x” is a variable which you supplied, and “a” was the intercept. From the observed values of the data, we can derive the mean for “x” and for “y,” and we have already seen the formula for calculating the slope. Therefore, we can rearrange the formula using those values, to arrive at a value for the slope:

a = y - βx

t-Test of significance: It is, finally, useful to know whether the obtained correlation ratio could be explained by nothing more than a chance association, rather than a causal one. For correlation, the ratio of the regression coefficient (b) to its standard error takes the form of a t-distribution. If the t-value is greater than that expected by chance, one may conclude that the coefficient is measuring a real relationship. The standard error of the regression coefficient is the ratio of the unexplained variation in “y”:

t = r .

√Σ (y – y )² / (n – 2) Σ(x – x )²

The obtained value of t is compared to the standard values of t in a table. The degrees of freedom are “n-2.” Since you have no way of knowing whether the true value of the correlation is higher or lower than the obtained coefficient, you are testing both possibilities. This is called a “two-tailed” test. In the tables of t-values, the proper values for a two-tailed test at .05 level of probability are listed in the “0.025” column (the .05 is split between the two sides of the distribution).

Analysis of Variance (ANOVA)

Analysis of variance applies the general linear model to situations in which one of the variables is measured on an interval scale, but the other variable (the “x,” or causal variable) is membership in a group. For example, a neighborhood group might be complaining that they are not getting their fair share of the city’s park & recreation money. ANOVA will allow you to determine whether there is merit to such a claim. An advantage of ANOVA over correlation is that no assumption need be made that the relationship between the two variables is a straight line. Analysis of variance will work with “U-shaped” or other curvilinear relationships.

In the natural order of things, every member of a group will not behave in exactly the same way. There will be a certain amount of variability within each group. However, if the groups are truly distinct, then it is reasonable to assume that each group also behaves differently from the others. These two sources of variance, within-group variance and between-group variance, should add up to the total difference observable in the entire community. In an idealized situation, all of the variance would be seen between the groups. In other words, if the “group” designation is not causing the difference in the other characteristic, there would be no variance between groups and there would be similar variance within each group.

Analysis of variance calculates a statistic called E². The formula is:

E² = 1 – Σ (y – y _group)²

Σ (y – y _total)²

E² calculates the proportion of the total variance which is due to group membership (“between group” variance). One cannot calculate between-group variance directly, but it is what is left over after within-group variance is taken into account. The formula calculates the ratio of within-group variance to total variance and subtracts the result from unity (1), arriving at the ratio of between-group variance to total variance.

Just as a t-Test is used to determine whether the correlation ratio could be explained by chance variation, there is also a test to determine the significance of Analysis of Variance. It is similar to t, and is called the “F-test.” For analysis of variance, the F-statistic is calculated by the ratio of the variance between groups to the variance within groups, standardizing each to take the number of groups and the size of the sample into account:

F = Σ (y – y _between)² / (groups – 1)

Σ (y – y_groups )² / (observations – groups)

The two “standardizing” factors are measures of the “degrees of freedom” in each of the terms. As with the t-statistic, one consults a table of F-values using the two measures of degrees of freedom to determine the significance of the analysis. The following table contains only few of the possible values of F. If necessary, consult an F-table in any statistics text or in a collection of standard math tables.

	Degrees of freedom for Observations (n-c)
df for Groups (c-1)	1	10	20	30	40	50	infinity
1	161.0	4.96	4.35	4.17	4.08	4.03	3.84
2	200.0	4.10	3.49	3.32	3.23	3.18	2.99
3	216.0	3.71	3.10	2.92	2.84	2.79	2.6
4	225.0	3.48	2.87	2.69	2.61	2.56	2.37

F-Table for .05 Level of Significance

Analysis of variance tells you if there is a relationship, and how strong it is. The statistics do not tell you where the relationship lies. You must determine that by inspecting the pattern of the group means and the within-group deviation from those means.

Next Section

Back to Syllabus

609