Statistics I:  Definitions & Mathematical Basis


In this unit, each of the tools will be considered in turn.  The discussin will focus on mathematical formulas and definitions, although each formula will also be described in English.

Descriptive Statistics

Descriptive statistics are used to summarize the characteristics of a distribution of data.  There are two descriptive statistics which will be explained here:  mean and standard deviation.

 

The mean and standard deviation can be calculated only if the data are measured on an “interval” scale, or can be assumed to have been derived from an underlying interval scale.  An interval scale is one in which the numbers represent an ordered relationship (2 is greater than 1, 5 is greater than 4, etc.), and the distance (or “interval”) between each number is the same (the interval between 2 and 1 is the same “length” as the interval between 5 and 4).  Interval scales are so common that you are probably wondering what the fuss is all about:  population, income, number of housing units are all measured on an interval scale.  Attitude scales (“On a scale of 1 to 5, how do you rate….”), while not strictly interval scales, are often assumed to reflect an underlying interval relationship.  But some common measures are not interval, and calculating an average can result in nonsense—like the mythical “0.4” child born to the average family, or finding that the average sex distribution is “1.329” (if “1” means “male,” should that “statistic” be interpreted as saying that the average male is oversexed?).  If you cannot assume an underlying interval distribution, sue the median or the mode to report the central tendency of the distribution.  If you try to calculate a mean, you will get garbage; and if someone catches you at it, you will end up looking silly.

 

Mean (Average)

The mean is the total value of the observations, divided by the number of observations.  The result is a weighted mid-point of the distribution of the data.  The formula is:

 

x = Σx

       n

 

x” (read, “X-bar”) is the standard symbol for the mean.  x” is a standard symbol for a single observation; “å” (the capital Greek letter, sigma) is a standard symbol for “the sum of.”  n” is commonly used in statistics to represent the number of observations.

 

Standard Deviation

The standard deviation is the average deviation of the observations from the mean.  It is calculated by:

1.    Obtaining the deviation of each observation from the mean ( x - x);

2.    Squaring each deviation (x - x)2 , which minimizes the impact of minor deviations and emphasizes the larger ones (“squared deviations” are called “variance”);

3.    Calculating the total variance å ( x - x)2

4.    Calculating the average variance å (x - x)2

                                                               n

5.    Obtaining the “standard,” or average, deviation by taking the square root of the variance (You can think of this as scaling the variance back down to the size of the sample, or as reversing the process of squaring the individual deviations which was done earlier to emphasize larger differences over smaller ones).  The complete formula reads: 

6.    σ = √ Σ(x  - x)2 / n    (s is the lower-case Greek letter “sigma,” and is the symbol for standard deviation)

 

Both the mean and the standard deviation are given here in the form that describes the distribution of data as it was obtained.  When you are using empirical (or observed) values as estimates of the true parameters of an underlying distribution (as you will in the t-Test, correlation, and ANOVA), you have to adjust the formula.  The adjustment is to replace the sample size (“n”) with “n-1” to account for the tendency of a sample to underestimate the true values of parameters.

                               

Inductive Statistics

Inductive statistics are used to determine whether an observed relationship or an observed difference is significant, all things considered.  There are two inductive statistics which will be explained here:  Chi-square, which is a nonparametric statistic; and Student’s t-Test, a parametric statistic.

 

Chi-Square

Chi-square is a general test for evaluating whether observed frequencies are significantly different form those expected by chance.  It is used whenever the observations can be “cross-classified,” or evaluated for two items of information at the same time (for example, the social class and the owner/renter status for each household in a city)/  Unlike the other statistics in this unit and the next, Chi-square is not confined to interval data; any means of categorizing the data is acceptable, as long as it is possible to cross-classify the categories.

 

Cross-classification is usually accomplished through a “contingency table,” the this one:

 

Education

 

 

Low

High

 

Income

High

126

99

 

 

Low

71

162

 

 

 

 

 

458

 

The first step in creating a contingency table is to divide the measurements of each characteristic into categories; here, two categories were used for each group (“High” and “Low”).  Any number of categories may be used, and it is not necessary to have the same number of categories in the rows as there are in the columns.  Each cell in the table then represents the number of observations which fit the category at both the row heading and the column heading.  For example, in the example above 71 households were both low income and low education, and 162 households were both low income and high education.  Notice that the total of all the cells (458) is the same as the total number of observations (“458,” in the lower right corner of the contingency table).

 

The formula for computing Chi-square has two parts:

·        “Expected frequencies”:  Each cell of the contingency table contains the “observed frequency” of occurrences under that row-and-column heading.  The “expected frequency” of occurrence is that cell’s proportion of the row (the number of observations in the row divided by the total number of observations—“r/n”) multiplied by the cell’s proportion of the column (the number of observations in the column divided by the total number of observations—“c/n”).  This expression can be simplified to

 

fe = r ´ c

                             n 

 

·        Calculating formula:  The calculation of Chi-square builds on the ratio of observed to expected frequencies.  It is calculated by:

  1. Computing the difference between the observed and expected frequencies:  fo - fe
  2. Squaring the difference (to emphasize extreme deviations):  (fo - fe)2
  3. Scaling the deviation back down to size by dividing it by the expected frequency:  (fo - fe)2

                                                                                                                                      fe

  1. Adding the value of these deviations to obtain the Chi-square value:

 

c2 = å  (fo - fe)2

                           fe

 

  1. Determining the “degrees of freedom,” which is the number of rows minus one multiplied by the number of columns minus one:

df = (r – 1) ´ (c- 1)

  1. Consulting a standard table of Chi-square values to determine the significance of the result.  The standard of significance is a measure of the likelihood that the results of a statistical test might lead one to an incorrect conclusion—for example, the likelihood that one might conclude that there is a real difference when in fact there is none.  The stricter the standard for avoiding error (for example, choosing a 0.01 level rather than a 0.05), the greater the likelihood that one might be overlooking a real difference.  In the social sciences, the standard for “significance” is usually 0.05, although on some occasions 0.01 might be used.  Significance will be discussed in more detail in Unit 6, Database Design & Sampling.  Tables of Chi-square values are included in almost every statistics text; the most commonly used values are included in the table below:

 

 

Chi-Square

df =

0.05

0.01

1

3.841

5.412

2

5.991

9.210

3

7.815

11.341

4

9.488

13.277

5

11.070

15.086

6

12.592

16.812

10

18.307

23.209

 

Student’s t-Test

This statistic is used to test for the significance of the difference between two means.  The t-Test is simply asking whether the observed value of the mean could have been obtained, with random variation due to sampling and chance, form some other group.  Or, in other words, is the value obtained from this set of observations sufficiently different that one could conclude that the two distributions are independent of each other?

 

The formula for the t-Test is fairly simple.  The difference between the observed mean (called the “sample mean”) and the criterion mean (called the “population mean”) is divided by the “standard error of the sample,” which is the ratio of the sample standard deviation to the square root of the sample size:

·        Standard error:  s.e =     s      

          Ön

·        t-Test:    t =  X - Pop. Mean

              s.e.

Remember that both the sample standard deviation and the sample mean are calculated using “n-1” instead of “n.”  The t-statistic is interpreted much the same as Chi-square, examining the t-value to determine if it exceeds the critical values listed in a table of t-values.  Degrees of freedom are determined by the number of observations less one (df = n-1). 

 

 

 

t-Test

df =

0.05

0.025

0.01

1

6.314

12.706

31.821

2

2.920

4.303

6.965

3

2.353

3.182

4.541

4

2.132

2.776

3.747

5

2.015

2.571

3.365

6

1.943

2.447

3.143

10

1.812

2.228

2.764

 

 

Next Section

Back to Syllabus


609

 

© 1996 A.J.Filipovitch
Revised 11 March 2005