How you record your data can affect your ability to relate various elements in the data to each other. It is important to think through (in advance!) what type of analysis you intend to do, to assure that it is measured and stored in a way that will permit its retrieval later. How the data are gathered can affect its reliability and its validity (two different things). The combined effect of these two is called “sensitivity.”
Even with the advent of microcomputers, there is still an expense attached to obtaining and analyzing data. If one were to measure all of any phenomenon, there would be no need to infer anything—one would only need to observe and report. But since one can never (well, almost never) measure everything, one must select what one will measure so as to get as accurate a picture of the totality as possible. This is called sampling. There are several ways of drawing a sample, and the size of the sample you draw will determine the confidence and the significance of the results you obtain.
Some years back, I received a request from a nearby city to help them develop data for an ongoing comparison of their city with several other similar cities (nowadays, we would call it “benchmarking.”) They wanted at least 3 years of data for their city and 6-12 other cities. Some of the cities were to be of similar population, some of similar taxbase, some of similar area, etc. They also wanted the data to feed internal and external peer evaluations of various city operations, including Planning and Budgeting (both capital and operating); Acquisition and use of equipment and materials; Human Resources (selection, training, responsibility, and accountability); Communications (including public relations and customer service); and Evaluation Procedures (findings, recommendations, and imlementation).
This is a big project. It is going to take a lot of time and money to assemble all these data, and once it is done it has to be usable by a widely divergent group of people for a number of different purposes. How do you go about doing something like this?
Welcome to the exciting (really, it is!) world of database design. This little project involves a number of conceptual elements: The first is simply figuring out how to manage all that information. How many pieces of data do you need? Where do you store them? How do you interrelate them? And what do you want to do with them, anyhow? The second issue is figuring out how to gather the information to populate all the cells of the database. What type of measures will one use? What type of sample will one draw? How large a sample does one need? The third issue revolves around reporting (and interpreting) the results. How much confidence can you have in the results you will draw from the data?
So, you have a pile of data and you want to bring some order to it. How do you think your way through this problem?
To begin with, there are several important distinctions to keep in mind when you are managing information:
Having decided whether you really want logical access to data (rather than physical storage of documents), you still have to make some decisions about how you intend to use the data. The more “refined” the task, the more refined the data (and access) must be:
In summary, there are several questions you should ask as you begin to design a database:
· What are the questions you want the database to be able to answer?
· Which questions require documents? Which require data?
· What kind of logical access is required?
· What kind of physical access is required?
The point of all of this is to organize your data in such a way that it is as flexible as possible—in the trade, it’s called “data independence.” But there are many dimensions to independence, and in real life there are often tradeoffs which you will have to consider:
It gets interesting when all three groups are using the same system simultaneously in the development phase of a project! Over time, the physical model fades more to the background, but the logical and the conceptual models are always in some degree of tension in any management information system.
One of the ways to keep all this information straight is to write a “data dictionary,” a description of all the relevant characteristics of the data you have gathered. If you look in the front of a printed Census publication, you will find extensive notation about the source and quality of the data used to generate the tables in their publications. This is a data dictionary. But you do not need to be doing anything as complicated as the Census to need a data dictionary. As you will discover (if you haven’t already) even data that you knew intimately a year ago will seem strange and awkward when you come back to it after a year’s absence (for example, what is in that file you saved a year ago labeled “x23ft.xls”?).
The heart of a Conceptual level analysis is a process called
“normalization.” In
this case, “normal” is not used in the sense of “usual”
but in the root meaning of “norm” (which was the Latin name for a
carpenter’s square)—a pattern or a model. By the way, this is also the meaning
intended in the old term, “Normal School,” which was used for
institutions that later were called Teachers’ Colleges (
The heart of Logical level analysis is the process of describing the data structures. There are essentially three:
Finally, you need to give some thought to the way information will be reported from the database:
· Filtering: If a database is going to be shared (and why else would you be going through all this trouble?), it is generally wise to decide beforehand who has a need to know what, and to limit access on that basis. Sometimes the database will be public, and everyone in the world is welcome to look at it (the Census databases are like that). Some databases are extremely private and hardly anyone should be permitted access (personnel files and medical records are like that). Some databases include a mixture of data, such that users might have access to one part of the database but not to others. The lowest level of access is “read only.” The user is permitted to see whatever is in the database, but may not modify any of it. At a higher level of access, a user will have permission to modify the database (“write access”). Finally, the highest level of access allows the user to control who else has access to the database (“administrator access”). By the way, higher levels of access do not necessarily mean higher “status” in the organization—clerical workers regularly need write access, and the section chief often needs only read access (and the administrator is often not in the section at all, but is over in the Information Technology section).
· Key Variables: For any database, there will be a small subset of variables that will be of central interest and most commonly used in reporting. These may or may not be the “primary key” or “secondary key” variables of the database (e.g., social security number may be a primary key variable, but the reports will probably use the person’s name). Generally, reports will be generated in groups, based on the key variables and their dimensions. The selection of the key variables should be determined by the end-users of the reports—no matter how much sense it makes to the database designer, if it doesn’t make sense to the end user it won’t get used.
· Uses of the Reports: There are four broad categories of reporting systems, depending on the most common use of the database:
o Monitoring: These reports will be designed to capture a snapshot in time, and perhaps to present it against previous snapshots. Census reports are an excellent example of this approach. These reports tend to provide a lot of data and take a lot of pages to do it.
o Modeling: These reports are designed to capture a current snapshot and predict or project future (or alternative) events. A population projection is a good example of this approach. These reports tend to capture a lot of data in a few pages.
o Interrogative: This is the “customized” approach to reporting. Rather than pre-formatting the report, the reporting system is designed to allow the user to query the database directly in a unique request. These reports usually hold a lot of data in waiting, but use only a small part of it and produce a simple response.
o Strategic
Decision System: This is
sometimes called an “
WARNING!!! You can easily spend too much time gathering data. Problem-solving should involve 3 stages of almost equal duration:
· Problem definition: If you think it through carefully first, you can save yourself a lot of time later in rework.
· Data gathering: The point of this section.
· Data analysis: The job isn’t finished until the paperwork is done. Make sure to budget time for it.
· Sources of data: Choosing a source for your data will depend on the tradeoff between resources on hand (cost, but also human resources) and the potential bias which could be introduced. Most commonly, you will use secondary data whenever it is available. These are data which have already been gathered (whether inside your agency, such as building permit records, or externally, such as census data). The alternative is to gather the data yourself (“primary data”). This is almost always more expensive, but the results are also almost always more tailored to the question you are asking. Primary data may be obtained from surveys & questionnaires, interviews, or direct inspection.
· Recording data: Measurement creates equivalence among objects of diverse origins—it is a form of standardization. It both hides and highlights reality. The process of measurement involves assigning a number to objects according to some rule (this rule should be recorded in your data dictionary!). It simultaneously determines both the amount and what it is an amount of—what is measured and how it is measured are determined jointly. This is another way of thinking about the issue of “operational definition.”
· Scales of measurement: All measurement falls into one of four categories.
1. Nominal scale: Measures represent membership in mutually exclusive categories. No order is implied between the categories. For example, “1” for Female, “2” for Male.
2. Ordinal scale: Measure represents rank ordering of items. Numbers represent higher or lower, but no expectations about the interval between the numbers. For example, an preference ranking for political candidates where “1” represents most preferred, “2” next most, etc. (Note that, for one person, “1” & “2” might be almost a tie, while for another “2” may be some distance from “1,” almost tied with “3.” This can generate some interesting results when there are three candidates and none achieve a clear majority—as has been described by Kenneth Arrow and is called “Arrow’s Paradox.”)
3. Interval scale: Measures represent rank order with a common distance between each rank (number), but the “origin” (zero point) of the number line is arbitrary. Stock market indices, like the Dow Jones, fit this description—a rise of 100 points is exactly offset by a decline of 100 points, but it does not make a lot of sense to say that the Dow is “twice as valuable” at 2000 than it was at 1000.
4. Ratio scale: Measures represent rank order with a common distance between each rank, with the addition of a true “zero.” Many of the common measures (number of housing starts, dollars of income, population, etc.) are ratio scales.
The choice of measurement scale will limit the kinds of statistical analysis one can perform. Chi-square, for example, assumes only that the measures are nominal; correlation requires that both sets of measures be at least interval scales; ANOVA requires that the dependent (“y”) variable be at least interval scale, but the independent variable (“x”) can be a nominal scale.
There are several issues that should be considered when you are planning to gather data. At best, failure to control these issues will introduce unnecessary, random variability into the data and weaken the measures of association you might obtain. At worst, the unnecessary variability will not be random but systematic. This could mask a true relationship (if the bias goes against your hypothesis) or create a spurious relationship (if the bias goes in favor of your hypothesis).
· Accuracy: Avoid shifting the measure over time or across observations. If it takes a long time to gather the data, it is possible that changes will occur within the group you are measuring simply due to the passage of time, for example. This is a significant problem for the US Census—while it is supposed to be a snapshot at a single date, the data are actually gathered over a period of months and in that time people may be born or die or may move in or out of an area. Similarly, if two or more people are gathering the data (again, a problem for the Census), they might make different judgments when applying the measurement criteria; or a single person making multiple observations might be more alert early in the process or might become more adept with the procedure later in the process.
· Completeness: All the relevant population segments should be included. This doesn’t mean everybody should be counted (we will get to “sampling” later in this unit), but it does mean that a representative set of everybody should be included. Of course, what needs to be “represented” will depend on the purpose to which the data will be put—which gets to one of the problems with using secondary data (they were gathered by somebody else for their purposes, and may not be representative for your purposes).
· Comparability: The same definitions have to be applied in similar settings. For example, “family” (as in “single-family dwelling unit”) would appear to be a fairly straightforward term. But is it “any group of people living together in the same dwelling unit,” or is it “any group of people related to each other by blood or marriage”?
· Problem of “the volunteer”: Often you will gather data using a “convenience sample”—ask whoever is available. This can range from the clearly biased—calling all your buddies and asking their opinion—to the possibly acceptable—setting up a table in the shopping mall and asking anyone who will stop by. But even this latter approach may introduce bias. People who will step forward and volunteer are different from the others, if for no other reason than that they are willing to volunteer. This is even true of random telephone surveys, when some people volunteer to stay on the line and others hang up. We assume, in this case, that there are no significant differences between those who respond and those who don’t, but there should always be a trace of doubt, like a gargoyle at your shoulder, when you do that.
There are always three questions that you should answer in any report of your analysis of data:
· What are the possible relationships?
· What are the actual relationships?
· What are the implications of this?
Problem 1 Develop a database for one (or some set of) those urban issues you found yourself questioning, building it from data already available on the Web (you can use URSI’s “Sources of Data” link from the Resources button on the home page, at http://sbs.mnsu.edu/ursi/research/sources.html ).
This is an extensive database. Explore its structure, using both forward and backward links (ie., drill deeper into the data and pull back out to see the larger database [!] within which it is embedded).
Ammons, David.
2001. Municipal Benchmarks, 2nd ed.
Bearden, William,
Richard Netemeyer, Mary Mobley. 1993. Handbook
of Marketing Scales.
Hatry, Harry P. et alii. 1977. How
Effective Are Your Community Services?
Knapp, Gerrit J.
Kraemer, Helena
C. & Sue Thiemann. 1987. How
Many Subjects? Statistical
Power Analysis in Research.
Miller, Delbert
C. & Neil Salkind 2002. Handbook of Research Design and Social Measurement.
Sheldon, Eleanor B. & Wilbert E. Moore, eds. 1968. Indicators of Social Change: Concepts and Measurements. NY: Russell Sage Foundation.
Vellman, Paul F & David C. Hoaglin. 1981. Applications,
Basics, and Computing of Exploratory Data Analysis.
© 1996 A.J.Filipovitch
Revised 21 February 2010