Database Design: Gathering Data

WARNING!!! You can easily spend too much time gathering data. Problem-solving should involve 3 stages of almost equal duration:

· Problem definition: If you think it through carefully first, you can save yourself a lot of time later in rework.

· Data gathering: The point of this section.

· Data analysis: The job isn’t finished until the paperwork is done. Make sure to budget time for it.

Types of Data/Measurement

· Sources of data: Choosing a source for your data will depend on the tradeoff between resources on hand (cost, but also human resources) and the potential bias which could be introduced. Most commonly, you will use secondary data whenever it is available. These are data which have already been gathered (whether inside your agency, such as building permit records, or externally, such as census data). The alternative is to gather the data yourself (“primary data”). This is almost always more expensive, but the results are also almost always more tailored to the question you are asking. Primary data may be obtained from surveys & questionnaires, interviews, or direct inspection.

· Recording data: Measurement creates equivalence among objects of diverse origins—it is a form of standardization. It both hides and highlights reality. The process of measurement involves assigning a number to objects according to some rule (this rule should be recorded in your data dictionary!). It simultaneously determines both the amount and what it is an amount of—what is measured and how it is measured are determined jointly. This is another way of thinking about the issue of “operational definition.”

· Scales of measurement: All measurement falls into one of four categories.

1. Nominal scale: Measures represent membership in mutually exclusive categories. No order is implied between the categories. For example, “1” for Female, “2” for Male.

2. Ordinal scale: Measure represents rank ordering of items. Numbers represent higher or lower, but no expectations about the interval between the numbers. For example, an preference ranking for political candidates where “1” represents most preferred, “2” next most, etc. (Note that, for one person, “1” & “2” might be almost a tie, while for another “2” may be some distance from “1,” almost tied with “3.” This can generate some interesting results when there are three candidates and none achieve a clear majority—as has been described by Kenneth Arrow and is called “Arrow’s Paradox.”)

3. Interval scale: Measures represent rank order with a common distance between each rank (number), but the “origin” (zero point) of the number line is arbitrary. Stock market indices, like the Dow Jones, fit this description—a rise of 100 points is exactly offset by a decline of 100 points, but it does not make a lot of sense to say that the Dow is “twice as valuable” at 2000 than it was at 1000.

4. Ratio scale: Measures represent rank order with a common distance between each rank, with the addition of a true “zero.” Many of the common measures (number of housing starts, dollars of income, population, etc.) are ratio scales.

The choice of measurement scale will limit the kinds of statistical analysis one can perform. Chi-square, for example, assumes only that the measures are nominal; correlation requires that both sets of measures be at least interval scales; ANOVA requires that the dependent (“y”) variable be at least interval scale, but the independent variable (“x”) can be a nominal scale.

Issues in Data Collection

There are several issues that should be considered when you are planning to gather data. At best, failure to control these issues will introduce unnecessary, random variability into the data and weaken the measures of association you might obtain. At worst, the unnecessary variability will not be random but systematic. This could mask a true relationship (if the bias goes against your hypothesis) or create a spurious relationship (if the bias goes in favor of your hypothesis).

· Accuracy: Avoid shifting the measure over time or across observations. If it takes a long time to gather the data, it is possible that changes will occur within the group you are measuring simply due to the passage of time, for example. This is a significant problem for the US Census—while it is supposed to be a snapshot at a single date, the data are actually gathered over a period of months and in that time people may be born or die or may move in or out of an area. Similarly, if two or more people are gathering the data (again, a problem for the Census), they might make different judgments when applying the measurement criteria; or a single person making multiple observations might be more alert early in the process or might become more adept with the procedure later in the process.

· Completeness: All the relevant population segments should be included. This doesn’t mean everybody should be counted (we will get to “sampling” later in this unit), but it does mean that a representative set of everybody should be included. Of course, what needs to be “represented” will depend on the purpose to which the data will be put—which gets to one of the problems with using secondary data (they were gathered by somebody else for their purposes, and may not be representative for your purposes).

· Comparability: The same definitions have to be applied in similar settings. For example, “family” (as in “single-family dwelling unit”) would appear to be a fairly straightforward term. But is it “any group of people living together in the same dwelling unit,” or is it “any group of people related to each other by blood or marriage”?

· Problem of “the volunteer”: Often you will gather data using a “convenience sample”—ask whoever is available. This can range from the clearly biased—calling all your buddies and asking their opinion—to the possibly acceptable—setting up a table in the shopping mall and asking anyone who will stop by. But even this latter approach may introduce bias. People who will step forward and volunteer are different from the others, if for no other reason than that they are willing to volunteer. This is even true of random telephone surveys, when some people volunteer to stay on the line and others hang up. We assume, in this case, that there are no significant differences between those who respond and those who don’t, but there should always be a trace of doubt, like a gargoyle at your shoulder, when you do that.

Sampling

· Why sample? Most of the time, it is not possible to observe every case of something. Instead, we select a sample from the total population of cases. Sampling always introduces some bias, or deviation from the true distribution of the population, but it is possible to estimate that bias and report it with the results. The estimated bias will depend on the size of the sample (the larger the sample, the smaller the bias introduced by sampling)—not the size of the sample in relation to the total population (i.e., the proportion of the population), but the absolute size (whether the total population is large or small, a certain sample size will be needed to obtain a specified “confidence interval.” The necessary size of the sample will depend on the use to which it will be put. The greater the variability in the responses or the more risky a false conclusion, the larger the sample must be.

· Types of samples: Some sampling schemes are so common that have their own names:

o Population “sample”: Sample “everyone”

o Simple random sample: Cases are randomly selected from the total population. This is the “put the names in the a hat” method. The modern, quantitative approach is to use a random number generator (BTW, Excel spreadsheets have a function that will do this) to select cases from a list.

o Systematic sample: This is the approach of picking the n^thname on each page of a phone book, for example. This will also generate a random sample, if there is no bias in the original list. It could result in an undersampling of a particular ethnic group if they are represented by a small set of family names (e.g., a large number of Koreans share the family name “Lee,” and a large number of Hmong share the family name Vang).

o Stratified sample: One way to insure that groups, perhaps small in number but significant for other reasons, are included is to “stratify” the sample by subgroup, and then randomly select from within each group. If one wanted to look at the effect of, say, a proposed education policy on school districts, one might want to select an equal number of rural and urban districts on the assumption that the impact on them might be different.

o Cluster sample: Sometimes you have access to a number of groups (say, cohorts composed of people who used your agency’s services, grouped by week for all of last year). In that case, you might choose to “sample” the groups and use the information from all the members of each group you sample. This is useful when you have reason to believe that the group characteristics may be important in themselves (perhaps your staff went through a major training exercise midway through the year….).

· Determining Sample Size: There are different ways to determine sample size, depending on how you will be using the sample (what sort of statistical test you will be using):

o For a proportion:

§ n = 0.96(1 / p’ – p*)², where p^* is the hypothesized value and p^’is the obtained value

§ Note that sample size (n) is determined by the difference you can accept between what you expect (p^*) and what you get (p^’ ).

o For an average:

§ n = [(1.96)² * (σ / √n-1)²] / (x* - x )² where x^* is the hypothesized mean and x is the observed mean

§ Note here that sample size (n) is determined not only by the acceptable difference between the hypothesized and obtained mean, but also by the variance around the mean (s).

Next Section

Back to Syllabus

609