URBS 502: Regression Analyses

As useful as the nonparametric statistics are, they are fairly blunt tools and can easily overlook real differences. Also, they can only tell us that “something is happening,” but cannot describe it in any precise detail.

Parametric statistics assume more about the quality of the data, but in return they can tell us more about what is going on with those data. The most common parametric statistics assume the “General Linear Model”—that is, they assume that the “true,” underlying distribution of the data can be described by a straight line (or one of its variants). We will look particularly at correlation and analysis of variance.

General Description

Correlation and Analysis of Variance (ANOVA) are among the most powerful of the parametric statistics. Both test for the presence of a relationship between two characteristics, and are based on an assumption that is called the “general linear model.” This assumption states that the relationship between characteristics (or “variables”), in its ideal form, can be described as a straight line. In other words, a change in one variable always produces a change in other variables and that change is always in a certain direction (greater or less) and at the same strength. For many relationships, this makes sense. The harder you swing a hammer, the bigger the dent you make in the wood. The harder you push on the gas pedal, the faster you go. The harder you work in school, the better the grades you get. And so on. As long as you can reasonably make this assumption (that there is a relationship, and that it is “linear”), then you can apply the linear model. Correlation requires the additional assumption that both variables are “normally” distributed; ANOVA only requires that one of the variables be normally distributed.

The advantage of correlation and ANOVA over, for example, c-square, is that they can tell you not only whether there is a relationship but also what the relationship looks like and how strong it is. This is a direct result of the linear assumption: The shape of the relationship can be described by the line that best fits it (that is, by the direction and the slope of the line). The strength can be described by how closely the data fit the ideal line which describes the relationship.

Mathematical Basis

Correlation

Correlation is the statistical tool which most clearly expresses the general linear model. To perform a correlation, you must have observations of two characteristics for each case you wish to include, and the observation must both be measured on interval scales. You must further be willing to assume that the distribution underlying the observations is “normal,” or balanced about the mean.

There are several formulas connected with correlation:

Correlation coefficient (r): This statistic provides an estimate of the overall strength of the relationship between the two characteristics. Each characteristic (variable) is labeled; one will be called “y” and the other “x.” By convention, the “causal” variable (if you have reason to suspect that one is a cause of the other) is labeled “x.” The correlation coefficient is the ratio of the covariance (the joint variance of an observation on both the x and y dimensions) to the product of the variance of within each dimension. Since the variance within each dimension is squared, the ratio is balanced by also squaring the covariance. As with standard deviation, the expression is then scaled back to its original dimension by taking its square root:

r = Σ (y - y ) (x - x )

Σ(y – y)^0.5 Σ(x - x)^0.5

The correlation coefficient is an index number which will always fall between –1.0 and +1.0. The closer the value is to 1, the stronger the relationship. If the sign of the coefficient is positive, it means that the value of one variable increases if the value of the other one increases. If the sign is negative, it means that an increase in the value of one variable is associated with a decline in the value of the other.

Slope (b, or “regression coefficient”): Slope is defined analogously to “slope” in geometry. There, the slope of a line is the ratio of the change in “y” (the “rise”) to the change in “x” (the “run”). In the case of the regression line, the slope is the ratio of the covariance to the variance in “x.”

b = Σ (y – y) (x – x )

Σ (x – x )²

Intercept (a, or “regression constant”): The intercept is the value of “y” when the value of “x” is 0—in other words it is the place where the regression line crosses the “y” axis if the data were plotted on graph paper. From geometry, you might remember that the formula for a straight line is “ y=a+bx”, where “b” is the slope of the line, “x” is a variable which you supplied, and “a” was the intercept. From the observed values of the data, we can derive the mean for “x” and for “y,” and we have already seen the formula for calculating the slope. Therefore, we can rearrange the formula using those values, to arrive at a value for the slope:

a = y - βx

t-Test of significance: It is, finally, useful to know whether the obtained correlation ratio could be explained by nothing more than a chance association, rather than a causal one. For correlation, the ratio of the regression coefficient (b) to its standard error takes the form of a t-distribution. If the t-value is greater than that expected by chance, one may conclude that the coefficient is measuring a real relationship. The standard error of the regression coefficient is the ratio of the unexplained variation in “y”:

t = r .

√Σ (y – y )² / (n – 2) Σ(x – x )²

The obtained value of t is compared to the standard values of t in a table. The degrees of freedom are “n-2.” Since you have no way of knowing whether the true value of the correlation is higher or lower than the obtained coefficient, you are testing both possibilities. This is called a “two-tailed” test. In the tables of t-values, the proper values for a two-tailed test at .05 level of probability are listed in the “0.025” column (the .05 is split between the two sides of the distribution).

Analysis of Variance (ANOVA)

Analysis of variance applies the general linear model to situations in which one of the variables is measured on an interval scale, but the other variable (the “x,” or causal variable) is membership in a group. For example, a neighborhood group might be complaining that they are not getting their fair share of the city’s park & recreation money. ANOVA will allow you to determine whether there is merit to such a claim. An advantage of ANOVA over correlation is that no assumption need be made that the relationship between the two variables is a straight line. Analysis of variance will work with “U-shaped” or other curvilinear relationships.

In the natural order of things, every member of a group will not behave in exactly the same way. There will be a certain amount of variability within each group. However, if the groups are truly distinct, then it is reasonable to assume that each group also behaves differently from the others. These two sources of variance, within-group variance and between-group variance, should add up to the total difference observable in the entire community. In an idealized situation, all of the variance would be seen between the groups. In other words, if the “group” designation is not causing the difference in the other characteristic, there would be no variance between groups and there would be similar variance within each group.

Analysis of variance calculates a statistic called E². The formula is:

E² = 1 – Σ (y – y _group)²

Σ (y – y _total)²

E² calculates the proportion of the total variance which is due to group membership (“between group” variance). One cannot calculate between-group variance directly, but it is what is left over after within-group variance is taken into account. The formula calculates the ratio of within-group variance to total variance and subtracts the result from unity (1), arriving at the ratio of between-group variance to total variance.

Just as a t-Test is used to determine whether the correlation ratio could be explained by chance variation, there is also a test to determine the significance of Analysis of Variance. It is similar to t, and is called the “F-test.” For analysis of variance, the F-statistic is calculated by the ratio of the variance between groups to the variance within groups, standardizing each to take the number of groups and the size of the sample into account:

F = Σ (y – y _between)² / (groups – 1)

Σ (y – y_groups )² / (observations – groups)

The two “standardizing” factors are measures of the “degrees of freedom” in each of the terms. As with the t-statistic, one consults a table of F-values using the two measures of degrees of freedom to determine the significance of the analysis. The following table contains only few of the possible values of F. If necessary, consult an F-table in any statistics text or in a collection of standard math tables.

	Degrees of freedom for Observations (n-c)
df for Groups (c-1)	1	10	20	30	40	50	infinity
1	161.0	4.96	4.35	4.17	4.08	4.03	3.84
2	200.0	4.10	3.49	3.32	3.23	3.18	2.99
3	216.0	3.71	3.10	2.92	2.84	2.79	2.6
4	225.0	3.48	2.87	2.69	2.61	2.56	2.37

F-Table for .05 Level of Significance

Analysis of variance tells you if there is a relationship, and how strong it is. The statistics do not tell you where the relationship lies. You must determine that by inspecting the pattern of the group means and the within-group deviation from those means.

Calculating the Statistics

Now that you know the mathematics behind the tests, the simple truth is that you will almost never calculate these statistics “by hand.” Instead, you can use the “function key” in Microsoft Excel. You can get to it several ways—there is a “Formulas” tab at the top of Excel (where “Home” and “File,” etc. are—depending on which version of Excel you are using). Or, at the very beginning of the data entry bar (just above the matrix of cells, the bar that shows what you have entered or are entering into each cell) there is a “fx” box (it stands for “function of x”). If you click on the “fx” box, it gives you a slate of pre-loaded formulas (if you don’t find it using fx, go up to the Formulas tab and look in the libraries there). The function for correlation is called CORREL You can find separate functions for the slope and intercept and the F-test (F.TEST). Unfortunately, there is no Excel function for ANOVA.

Interpreting the Results

Analysis of variance and correlation are complicated enough that an entire course could easily be offered on the variations commonly used with either one of them. For example, this unit has not touched on curve-fitting, multiple regression, causal modeling, factor analysis, canonical correlation, or analysis of covariance. This is a very powerful tool and is worth pursuing in much greater detail, if you have the inclination.

The regression coefficient (the square of the correlation coefficient) is a “PRE” measure—it gives you the “proportional reduction in error” obtained by using the information from the analysis. In other words, if you observe a correlation of 0.5, that means you can increase your odds of correctly predicting an outcome (reduce your error) by 25% (the square of 0.5). Similarly, a correlation of 0.3 increases your accuracy by 10% and a correlation of 0.7 by 50%. One implication of this is that you can find a significant correlation of, say, 0.3 but it can be not very important (only a 10% increase in accuracy).

There is another sense in which the discussion here is lacking: I have presented the statistics in their basic form. Almost all of them have corrections which must be applied in special circumstances. Correlation requires that the data be normally distributed; there are some techniques for loosening that requirement. Adjustments like these are the topics of courses in statistics.

Cases for Study

Choose one of the following cases. Don’t just “tell me” that there is a difference—prove that the difference you observed is, in fact, a “significant” difference (at the .05 level)by reporting and explaining an appropriate quantitative (statistical) test.

1. Using the US Census online (http://factfinder.census.gov/home/saff/main.html?_lang=en ), or any other source available to you, develop a table listing two characteristics of the cities in Minnesota (choose two that might reasonably be expected to have some relation to each other—population size and poverty, for example). Does it appear that there is a relationship between them (test this, don’t just inspect the data and reply “yes” or “no”)? Is this relationship stable over time (i.e., what happens if you run the test using different years)? Is it stable across place (i.e., what happens if you run the test for different states)? How might you explain this relationship, or its absence (i.e., what factors might account for it)? What further information would you need to test your explanation?

2. Compare the average per capita income for Minnesota’s 25 largest cities to the average for the nation and for the West North Central Region (MN, SD, ND, IA, WI). Are people in Minnesota cities better off or worse off than the nation as a whole? (Again, test this, don’t just inspect the data.) than the region? Does it make a difference if you distinguish between the Metro and out-State cities? Was the same relation apparent ten years ago? What would account for your findings?

3. Consider the hypotheses that you developed in the unit on research design. Select one or more that can be (1) measured operationally by using Census or other publicly available data series and (2) can be tested using correlation or ANOVA. Develop the data and test your hypothesis. Then test the robustness of your initial test, by repeating it using data from a different time or a different place. What light do the data throw on your hypotheses?