In this unit, each of the tools will be considered in turn. The discussin will focus on mathematical formulas and definitions, although each formula will also be described in English.
Descriptive statistics are used to summarize the characteristics of a distribution of data. There are two descriptive statistics which will be explained here: mean and standard deviation.
The mean and standard deviation can be calculated only if the data are measured on an “interval” scale, or can be assumed to have been derived from an underlying interval scale. An interval scale is one in which the numbers represent an ordered relationship (2 is greater than 1, 5 is greater than 4, etc.), and the distance (or “interval”) between each number is the same (the interval between 2 and 1 is the same “length” as the interval between 5 and 4). Interval scales are so common that you are probably wondering what the fuss is all about: population, income, number of housing units are all measured on an interval scale. Attitude scales (“On a scale of 1 to 5, how do you rate….”), while not strictly interval scales, are often assumed to reflect an underlying interval relationship. But some common measures are not interval, and calculating an average can result in nonsense—like the mythical “0.4” child born to the average family, or finding that the average sex distribution is “1.329” (if “1” means “male,” should that “statistic” be interpreted as saying that the average male is oversexed?). If you cannot assume an underlying interval distribution, sue the median or the mode to report the central tendency of the distribution. If you try to calculate a mean, you will get garbage; and if someone catches you at it, you will end up looking silly.
The mean is the total value of the observations, divided by the number of observations. The result is a weighted mid-point of the distribution of the data. The formula is:
x = Σx
n
“x” (read,
“X-bar”) is the standard symbol for the mean. “x”
is a standard symbol for a single observation; “å” (the capital
Greek letter, sigma) is a standard symbol for “the sum of.” “n”
is commonly used in statistics to represent the number of observations.
The standard deviation is the average deviation of the observations from the mean. It is calculated by:
1. Obtaining
the deviation of each observation from the mean ( x - x);
2. Squaring
each deviation (x - x)2 , which minimizes the impact of minor
deviations and emphasizes the larger ones (“squared deviations” are
called “variance”);
3. Calculating
the total variance å
( x - x)2
4. Calculating
the average variance å (x - x)2
n
5. Obtaining
the “standard,” or average, deviation by taking the square root of
the variance (You can think of this as scaling the variance back down to the
size of the sample, or as reversing the process of squaring the individual
deviations which was done earlier to emphasize larger differences over smaller
ones). The complete formula reads:
6. σ
= √ Σ(x - x)2
/ n (s is the lower-case Greek
letter “sigma,” and is the symbol for standard deviation)
Both the mean and the standard deviation are given here in the form that describes the distribution of data as it was obtained. When you are using empirical (or observed) values as estimates of the true parameters of an underlying distribution (as you will in the t-Test, correlation, and ANOVA), you have to adjust the formula. The adjustment is to replace the sample size (“n”) with “n-1” to account for the tendency of a sample to underestimate the true values of parameters.
Inductive statistics are used to determine whether an observed relationship or an observed difference is significant, all things considered. There are two inductive statistics which will be explained here: Chi-square, which is a nonparametric statistic; and Student’s t-Test, a parametric statistic.
Chi-square is a general test for evaluating whether observed frequencies are significantly different form those expected by chance. It is used whenever the observations can be “cross-classified,” or evaluated for two items of information at the same time (for example, the social class and the owner/renter status for each household in a city)/ Unlike the other statistics in this unit and the next, Chi-square is not confined to interval data; any means of categorizing the data is acceptable, as long as it is possible to cross-classify the categories.
Cross-classification is usually accomplished through a “contingency table,” the this one:
Education |
||||
|
|
Low |
High |
|
Income |
High |
126 |
99 |
|
|
Low |
71 |
162 |
|
|
|
|
|
458 |
The first step in creating a contingency table is to divide the measurements of each characteristic into categories; here, two categories were used for each group (“High” and “Low”). Any number of categories may be used, and it is not necessary to have the same number of categories in the rows as there are in the columns. Each cell in the table then represents the number of observations which fit the category at both the row heading and the column heading. For example, in the example above 71 households were both low income and low education, and 162 households were both low income and high education. Notice that the total of all the cells (458) is the same as the total number of observations (“458,” in the lower right corner of the contingency table).
The formula for computing Chi-square has two parts:
· “Expected frequencies”: Each cell of the contingency table contains the “observed frequency” of occurrences under that row-and-column heading. The “expected frequency” of occurrence is that cell’s proportion of the row (the number of observations in the row divided by the total number of observations—“r/n”) multiplied by the cell’s proportion of the column (the number of observations in the column divided by the total number of observations—“c/n”). This expression can be simplified to
fe = r ´ c
n
· Calculating formula: The calculation of Chi-square builds on the ratio of observed to expected frequencies. It is calculated by:
fe
c2 = å (fo - fe)2
fe
df = (r – 1) ´ (c- 1)
|
Chi-Square |
|
df = |
0.05 |
0.01 |
1 |
3.841 |
5.412 |
2 |
5.991 |
9.210 |
3 |
7.815 |
11.341 |
4 |
9.488 |
13.277 |
5 |
11.070 |
15.086 |
6 |
12.592 |
16.812 |
10 |
18.307 |
23.209 |
This statistic is used to test for the significance of the difference between two means. The t-Test is simply asking whether the observed value of the mean could have been obtained, with random variation due to sampling and chance, form some other group. Or, in other words, is the value obtained from this set of observations sufficiently different that one could conclude that the two distributions are independent of each other?
The formula for the t-Test is fairly simple. The difference between the observed mean (called the “sample mean”) and the criterion mean (called the “population mean”) is divided by the “standard error of the sample,” which is the ratio of the sample standard deviation to the square root of the sample size:
·
Standard error: s.e = s
Ön
·
t-Test: t
= X - Pop.
Mean
s.e.
Remember that both the sample standard deviation and the sample mean are calculated using “n-1” instead of “n.” The t-statistic is interpreted much the same as Chi-square, examining the t-value to determine if it exceeds the critical values listed in a table of t-values. Degrees of freedom are determined by the number of observations less one (df = n-1).
|
t-Test |
||
df = |
0.05 |
0.025 |
0.01 |
1 |
6.314 |
12.706 |
31.821 |
2 |
2.920 |
4.303 |
6.965 |
3 |
2.353 |
3.182 |
4.541 |
4 |
2.132 |
2.776 |
3.747 |
5 |
2.015 |
2.571 |
3.365 |
6 |
1.943 |
2.447 |
3.143 |
10 |
1.812 |
2.228 |
2.764 |
© 1996 A.J.Filipovitch
Revised 11 March 2005