Database Design

How you record your data can affect your ability to relate various elements in the data to each other. It is important to think through (in advance!) what type of analysis you intend to do, to assure that it is measured and stored in a way that will permit its retrieval later. How the data are gathered can affect its reliability and its validity (two different things). The combined effect of these two is called “sensitivity.”

Even with the advent of microcomputers, there is still an expense attached to obtaining and analyzing data. If one were to measure all of any phenomenon, there would be no need to infer anything—one would only need to observe and report. But since one can never (well, almost never) measure everything, one must select what one will measure so as to get as accurate a picture of the totality as possible. This is called sampling. There are several ways of drawing a sample, and the size of the sample you draw will determine the confidence and the significance of the results you obtain.

General Description

Some years back, I received a request from a nearby city to help them develop data for an ongoing comparison of their city with several other similar cities (nowadays, we would call it “benchmarking.”) They wanted at least 3 years of data for their city and 6-12 other cities. Some of the cities were to be of similar population, some of similar taxbase, some of similar area, etc. They also wanted the data to feed internal and external peer evaluations of various city operations, including Planning and Budgeting (both capital and operating); Acquisition and use of equipment and materials; Human Resources (selection, training, responsibility, and accountability); Communications (including public relations and customer service); and Evaluation Procedures (findings, recommendations, and imlementation).

This is a big project. It is going to take a lot of time and money to assemble all these data, and once it is done it has to be usable by a widely divergent group of people for a number of different purposes. How do you go about doing something like this?

Welcome to the exciting (really, it is!) world of database design. This little project involves a number of conceptual elements: The first is simply figuring out how to manage all that information. How many pieces of data do you need? Where do you store them? How do you interrelate them? And what do you want to do with them, anyhow? The second issue is figuring out how to gather the information to populate all the cells of the database. What type of measures will one use? What type of sample will one draw? How large a sample does one need? The third issue revolves around reporting (and interpreting) the results. How much confidence can you have in the results you will draw from the data?

So, you have a pile of data and you want to bring some order to it. How do you think your way through this problem?

Concepts & Initial Issues

To begin with, there are several important distinctions to keep in mind when you are managing information:

Storage vs. Access: Do not confuse storing data with having access to the data (and, usually, it is access in which you are most interested). For example, you might have (stored in a dusty basement somewhere) 50 boxes of records from 1950 to 1970, in no particular order or sequence. The data are stored, but they are not particularly accessible. Similarly, you might have a database with 1500 different data elements, many of which are very similar in name but not in meaning (like “address” and “residential address” and “street address”—the first might include business addresses, the second might include post office addresses, and the third might include business and residential addresses but not post office addresses) and 100,000 records (say, 10,000 transactions a year for the last 10 years). It is possible to navigate your way through such a tangle to find everyone who lived in the area in 1995, but unless you have thought things through very carefully beforehand it might not be easy. Generally speaking, storage is easy, access is difficulty.
Logical vs. Physical Access: Again, being able to put your hands on the data is not the same as being able to manipulate them. To use the previous example, you can gain physical access to the data stored in the dusty basement but there will be no logical search pattern you could use to find a specific record. Logical access provides a way to efficiently search the database for exactly the records you want. Given the speed of computers, sometimes it is not worth the effort to craft a maximally efficient logical structure—a spreadsheet program can easily sort through a database of 50 elements and 200 records using brute force (“random access” as it is called in the trade). But if you are using a database like our other example—1500 data elements for each of 100,000 records—it could take all day (or, more likely, all night) to search the database at random. Imagine the problems they have at, say, the Census Bureau or the IRS!
Data vs. Document Access: Data are the information stored on documents. Usually, you don’t need the document, just the information it represents. There are exceptions, of course—the Courts (for example) often want to see the document, not your notation of what was on it. The problem, of course, is that you have to have figured out in advance what from the document was worth keeping.

Having decided whether you really want logical access to data (rather than physical storage of documents), you still have to make some decisions about how you intend to use the data. The more “refined” the task, the more refined the data (and access) must be:

Record-keeping: This is the simplest, least complicated task. It is also the basis of every other task to follow. There are even times when this is all that you need or want. For example, property records (at least in the days before the County Courthouses started to computerize them) were primarily designed to provide a registry of transactions. There was no particular intention to use the information, but if one needed to get to it one could (although not easily, as any title search can attest).
Calculating/Printing: At the next level of complexity, one might now want to make some simple calculations based on the records and maybe print out the results. An accountant’s ledgers fit this model. There is some summary of the records along with storage.
Record Searching: It is still more difficult to sort through a large number of records, either to categorize the records by grouping or to identify a single record. You are probably familiar with this kind of activity from looking at tables of census data—millions of records are searched and sorted to generate the tables for which planners and other analysts are so grateful.
Record Restructuring: It can be a fairly sophisticated analytical chore to rearrange the records and their relation to each other (usually, to permit greater flexibility of analysis). To return to the Census Data example, the County and City Databook provides information by census tract on housing density and median income. Were you to analyze those data, you might notice a negative relationship between the two (as income increases, housing density declines). But if you were to look at census block data, for some areas of some cities (say, Chicago’s Gold Coast) the opposite might be true. But you would need to be able to restructure the unit of analysis to discover this. GIS systems, by the way, are particularly good at these kinds of restructuring.
Sophisticated Analysis: In some cases (like econometric analysis), the data are transformed by sophisticated mathematical modeling to reveal relationships which are not easily teased out from the background noise and multiple chains of causality. Such models usually require very large amounts of very carefully measured data. Economic Base analysis and employment forecasting, for example, require very large matrices of economic transaction data.
Process Control: Finally, the ultimate in database refinement, a stream of data may be gathered, manipulated, and analyzed in real time to provide feedback to modify the behavior of a system. For example, a water treatment plant may gather data from a series of monitors to continuously adjust the flow in the system so it operates at peak capacity.

In summary, there are several questions you should ask as you begin to design a database:

· What are the questions you want the database to be able to answer?

· Which questions require documents? Which require data?

· What kind of logical access is required?

· What kind of physical access is required?

Objective: Data Independence

The point of all of this is to organize your data in such a way that it is as flexible as possible—in the trade, it’s called “data independence.” But there are many dimensions to independence, and in real life there are often tradeoffs which you will have to consider:

Physical Independence: Some of the sections of this course were, initially, written on an electric typewriter (yes, I’m that old). Others were first written on a MacIntosh IIe. This was an improvement—the text became independent of the paper on which it was written. But when I later shifted to a DOS machine, I found that I could not (at least, not easily or practically) transfer the digital code from Mac to DOS (or back again). With the development of HTML, we had a software that read the same (well, sort of) on any platform. But my HTML code stored on 5 ¼” floppies can’t be read on today’s 3 ½” diskdrives. Hardware and software change, and (at least so far) the older versions eventually become unretrievable. As much as possible, you want to try to find a way to store your data so retrieval will not be tied to a specific technology (that is, despite all its other drawbacks, one of the remaining advantages of the book).
Logical Independence: When I was in High School, they taught us to write term papers by copying our notes onto index cards (do they still use such antique technology?). The advantage was that you could sort and reorganize your cards many times before you came up with the final structure of your paper. Each card was “logically” independent of the others. Similarly, you want to design your database so that information can be mixed and matched in as many ways as possible—and so new information can be added in as needed. Some databases are very flexible with data that were originally specified, but it is almost impossible to add in additional data elements.
Data Sharing: Often, a single database will serve multiple users (with different needs for access and output):

Conceptual Model: This is what the end user is interested in. At this level, data are identified by “variable name” (like “Housing Units” or “Median Income”).
Logical Model: This is what the applications staff (the “techies” on the staff) are interested in. They are less interested in what the data are saying than they are in how the data are structured. They are particularly interested in record layouts and data element relationships. Their question is whether the logic of the flow of data will provide the results the user wants.
Physical Model: This is what the programmers who designed the original application software are interested in. They don’t care what you call the variables, and how you lay out your records is your problem. They are concerned with writing computer code that create the logical relationships that the applications staff need to use.

It gets interesting when all three groups are using the same system simultaneously in the development phase of a project! Over time, the physical model fades more to the background, but the logical and the conceptual models are always in some degree of tension in any management information system.

Data Dictionary

One of the ways to keep all this information straight is to write a “data dictionary,” a description of all the relevant characteristics of the data you have gathered. If you look in the front of a printed Census publication, you will find extensive notation about the source and quality of the data used to generate the tables in their publications. This is a data dictionary. But you do not need to be doing anything as complicated as the Census to need a data dictionary. As you will discover (if you haven’t already) even data that you knew intimately a year ago will seem strange and awkward when you come back to it after a year’s absence (for example, what is in that file you saved a year ago labeled “x23ft.xls”?).

Contents: A data dictionary may contain three levels of information—

Conceptual level: What are the “entities” (concepts), their attributes (measures), and their relation (including the source from which they were collected and the uses to which they might be put)
Logical level: How are the data elements (attributes, at the conceptual level) grouped? The most common techniques are trees (hierarchical branching), network (multiple-level branching), and tables (fixed matrix). The data elements should also be cross-referenced to the applications program which is using them.
Physical level: This is a description of each element on each record (think of this as an index card with one or more variables on it—or one line in a file). It provides the information a programmer would need—length (maximum number of characters permitted), precision (think of it as the number of digits stored after the decimal point), storage medium (disk? Hard drive? Server? Etc.), and security (who has permission to read it? Write it? Delete it?).

Process: There is a simple process for building a data dictionary--

Identify the entities and their applications
Define the entities, their attributes, and their relations to each other
Build the data dictionary (conceptual, logical, and/or physical)
Review & evaluate with end users

Issues: There are a number of issues that will arise in the process of developing a data dictionary—

Choice of entities: What rules should be developed for deciding which entities belong in the database and which do not? This is the opportunity to address criticisms of the current system.
Relations among data: To what extent should current data relationships be carried into a new system? To what extent should the process be left free to define relationships, even if it means ignoring current usage?
Classification of data: Sometimes natural categories will arise in the process of design which had not been apparent previously, or new substantive applications may suggest themselves.
Context: The best information system, judged by some “logical” criterion, will not be a good design if it does not fit into the context in which it will be used. This gets to the “soft” side of database design—“user friendliness” (or, “user comfort”).

Normalization

The heart of a Conceptual level analysis is a process called “normalization.” In this case, “normal” is not used in the sense of “usual” but in the root meaning of “norm” (which was the Latin name for a carpenter’s square)—a pattern or a model. By the way, this is also the meaning intended in the old term, “Normal School,” which was used for institutions that later were called Teachers’ Colleges (Minnesota State University was once “Mankato Normal School,” from whence came the school cheer, “Mankato for the normal!”). Anyway, normalization is a process for organizing elements in a database to achieve maximum efficiency.

Terms: Normalization is a technical process, and uses three terms in very specific ways:

Attribute: characteristic of an entity; basic element of analysis
Entity: person, object or concept represented in a database
Relationship: combination of entities. May be 1-1, 1-many (1-N), or N-N. One-to-one means one entity is mapped onto one other entity (e.g., name and social security number). One-to-Many means one entity is mapped onto two or more others (e.g., name to gender and income). Many-to-Many means multiple entities are related to multiple others (e.g., name and address mapped onto occupation and number of children).

Normalization

Non-normal data: simple list of all the elements to be used in all the applications of the database.
First Normal Form: eliminate any duplicate attributes (e.g., “name” is commonly gathered from several sources)
Second Normal Form: define the primary keys and group attributes into entities (e.g., “person” may be all the personal attributes grouped around social security number—so even “name” is secondary, and “residence” may be all the real estate attributes grouped around parcel ID number, etc.)
Third Normal Form: determine any secondary keys from dependencies within entities which are independent of the primary key (e.g., within the person, there may be groupings for education, income, and occupation—or maybe those three will be attributes grouped around a secondary key of socio-economic status. It depends on what you will be doing with the data.)

Reasons for normalization: Normalization is a lot of work, and can sometimes come to resemble scholastic debates about angels and pinheads. Why do it?

Minimize redundancy: Storage will be more efficient if data is entered once and only once.
Minimize update anomalies: If the same information resides in two locations (think “Phone number”), it is possible (and therefore likely) that sometimes only one of the locations will be updated. Then how will you know which one is correct?
Minimize insertion and deletion anomalies: Every time a datum is entered, there is the opportunity for error (if only from mistyping). The fewer opportunities for entry, the fewer opportunities for introducing error and the higher the quality of the data.

Data Structures

The heart of Logical level analysis is the process of describing the data structures. There are essentially three:

Relational Database (Table): A relational database (sometimes called a “flat file”) by convention arranges attributes across the columns and observations down the rows of a simple matrix. Each row, however, is a complete unit—a new observation must be generated to change even one attribute in the row. For example, to update the address one would create an entirely new observation with the new address. If one wished to retain the old address (say, for archival purposes), then a “date entered” field could be one of the attributes and would create two unique records with essentially the same information except for the address attribute and date entered attribute.
Hierarchical Database (Tree): A hierarchical database can be thought of as an upside-down tree made up of nodes and branches. The uppermost node is called the “root” (a “root-only” tree is a table). All logical connections “flow” up the branches to the root and then (if needed) back down the branches. Access to the lower branches of the tree is through the higher levels, by way of the primary and secondary keys. For example, if “person” were the root and social security number were the key, one could get to the person’s address by including social security number in the “residence” entity as well as in the “person” entity (such a structure would permit more than one person to reside at the same address).
Network: A network is also a hierarchical structure, but it allows links to be made directly between secondary keys. The logic no longer flows only through the root, but can flow directly through the branches. For example, if the “residence” entity included “zip code” and a “business” entity included zip code, one could find all the restaurants in a neighborhood without having to know who lives in that neighborhood.

Report Formats

Finally, you need to give some thought to the way information will be reported from the database:

· Filtering: If a database is going to be shared (and why else would you be going through all this trouble?), it is generally wise to decide beforehand who has a need to know what, and to limit access on that basis. Sometimes the database will be public, and everyone in the world is welcome to look at it (the Census databases are like that). Some databases are extremely private and hardly anyone should be permitted access (personnel files and medical records are like that). Some databases include a mixture of data, such that users might have access to one part of the database but not to others. The lowest level of access is “read only.” The user is permitted to see whatever is in the database, but may not modify any of it. At a higher level of access, a user will have permission to modify the database (“write access”). Finally, the highest level of access allows the user to control who else has access to the database (“administrator access”). By the way, higher levels of access do not necessarily mean higher “status” in the organization—clerical workers regularly need write access, and the section chief often needs only read access (and the administrator is often not in the section at all, but is over in the Information Technology section).

· Key Variables: For any database, there will be a small subset of variables that will be of central interest and most commonly used in reporting. These may or may not be the “primary key” or “secondary key” variables of the database (e.g., social security number may be a primary key variable, but the reports will probably use the person’s name). Generally, reports will be generated in groups, based on the key variables and their dimensions. The selection of the key variables should be determined by the end-users of the reports—no matter how much sense it makes to the database designer, if it doesn’t make sense to the end user it won’t get used.

· Uses of the Reports: There are four broad categories of reporting systems, depending on the most common use of the database:

o Monitoring: These reports will be designed to capture a snapshot in time, and perhaps to present it against previous snapshots. Census reports are an excellent example of this approach. These reports tend to provide a lot of data and take a lot of pages to do it.

o Modeling: These reports are designed to capture a current snapshot and predict or project future (or alternative) events. A population projection is a good example of this approach. These reports tend to capture a lot of data in a few pages.

o Interrogative: This is the “customized” approach to reporting. Rather than pre-formatting the report, the reporting system is designed to allow the user to query the database directly in a unique request. These reports usually hold a lot of data in waiting, but use only a small part of it and produce a simple response.

o Strategic Decision System: This is sometimes called an “Information Center.” It combines all three of the above approaches into a single entity. In a sense, a library is a strategic decision system. While often discussed in the literature, they are less commonly found in the real world—they require constant staffing and continual database updating.

Gathering Data

WARNING!!! You can easily spend too much time gathering data. Problem-solving should involve 3 stages of almost equal duration:

· Problem definition: If you think it through carefully first, you can save yourself a lot of time later in rework.

· Data gathering: The point of this section.

· Data analysis: The job isn’t finished until the paperwork is done. Make sure to budget time for it.

Types of Data/Measurement

· Sources of data: Choosing a source for your data will depend on the tradeoff between resources on hand (cost, but also human resources) and the potential bias which could be introduced. Most commonly, you will use secondary data whenever it is available. These are data which have already been gathered (whether inside your agency, such as building permit records, or externally, such as census data). The alternative is to gather the data yourself (“primary data”). This is almost always more expensive, but the results are also almost always more tailored to the question you are asking. Primary data may be obtained from surveys & questionnaires, interviews, or direct inspection.

· Recording data: Measurement creates equivalence among objects of diverse origins—it is a form of standardization. It both hides and highlights reality. The process of measurement involves assigning a number to objects according to some rule (this rule should be recorded in your data dictionary!). It simultaneously determines both the amount and what it is an amount of—what is measured and how it is measured are determined jointly. This is another way of thinking about the issue of “operational definition.”

· Scales of measurement: All measurement falls into one of four categories.

1. Nominal scale: Measures represent membership in mutually exclusive categories. No order is implied between the categories. For example, “1” for Female, “2” for Male.

2. Ordinal scale: Measure represents rank ordering of items. Numbers represent higher or lower, but no expectations about the interval between the numbers. For example, an preference ranking for political candidates where “1” represents most preferred, “2” next most, etc. (Note that, for one person, “1” & “2” might be almost a tie, while for another “2” may be some distance from “1,” almost tied with “3.” This can generate some interesting results when there are three candidates and none achieve a clear majority—as has been described by Kenneth Arrow and is called “Arrow’s Paradox.”)

3. Interval scale: Measures represent rank order with a common distance between each rank (number), but the “origin” (zero point) of the number line is arbitrary. Stock market indices, like the Dow Jones, fit this description—a rise of 100 points is exactly offset by a decline of 100 points, but it does not make a lot of sense to say that the Dow is “twice as valuable” at 2000 than it was at 1000.

4. Ratio scale: Measures represent rank order with a common distance between each rank, with the addition of a true “zero.” Many of the common measures (number of housing starts, dollars of income, population, etc.) are ratio scales.

The choice of measurement scale will limit the kinds of statistical analysis one can perform. Chi-square, for example, assumes only that the measures are nominal; correlation requires that both sets of measures be at least interval scales; ANOVA requires that the dependent (“y”) variable be at least interval scale, but the independent variable (“x”) can be a nominal scale.

Issues in Data Collection

There are several issues that should be considered when you are planning to gather data. At best, failure to control these issues will introduce unnecessary, random variability into the data and weaken the measures of association you might obtain. At worst, the unnecessary variability will not be random but systematic. This could mask a true relationship (if the bias goes against your hypothesis) or create a spurious relationship (if the bias goes in favor of your hypothesis).

· Accuracy: Avoid shifting the measure over time or across observations. If it takes a long time to gather the data, it is possible that changes will occur within the group you are measuring simply due to the passage of time, for example. This is a significant problem for the US Census—while it is supposed to be a snapshot at a single date, the data are actually gathered over a period of months and in that time people may be born or die or may move in or out of an area. Similarly, if two or more people are gathering the data (again, a problem for the Census), they might make different judgments when applying the measurement criteria; or a single person making multiple observations might be more alert early in the process or might become more adept with the procedure later in the process.

· Completeness: All the relevant population segments should be included. This doesn’t mean everybody should be counted (we will get to “sampling” later in this unit), but it does mean that a representative set of everybody should be included. Of course, what needs to be “represented” will depend on the purpose to which the data will be put—which gets to one of the problems with using secondary data (they were gathered by somebody else for their purposes, and may not be representative for your purposes).

· Comparability: The same definitions have to be applied in similar settings. For example, “family” (as in “single-family dwelling unit”) would appear to be a fairly straightforward term. But is it “any group of people living together in the same dwelling unit,” or is it “any group of people related to each other by blood or marriage”?

· Problem of “the volunteer”: Often you will gather data using a “convenience sample”—ask whoever is available. This can range from the clearly biased—calling all your buddies and asking their opinion—to the possibly acceptable—setting up a table in the shopping mall and asking anyone who will stop by. But even this latter approach may introduce bias. People who will step forward and volunteer are different from the others, if for no other reason than that they are willing to volunteer. This is even true of random telephone surveys, when some people volunteer to stay on the line and others hang up. We assume, in this case, that there are no significant differences between those who respond and those who don’t, but there should always be a trace of doubt, like a gargoyle at your shoulder, when you do that.

Interpreting the Analysis

There are always three questions that you should answer in any report of your analysis of data:

· What are the possible relationships?

· What are the actual relationships?

· What are the implications of this?

Cases for Study

Problem 1 Develop a database for one (or some set of) those urban issues you found yourself questioning, building it from data already available on the Web (you can use URSI’s “Sources of Data” link from the Resources button on the home page, at http://sbs.mnsu.edu/ursi/research/sources.html ).

Lay out the design of a database that you would use at work to answer the kinds of questions you have been asking these last few weeks. It should include at least 15 entities (with at least 3 attributes for each entity). Remember to design the data dictionary as part of this process.
Using some of the data you gathered for the earlier assignments, estimate the sample size of records (observations, cities, people) you would need to ask those sorts of questions.
Consider the sensitivity of your database (no, I don’t mean that you shouldn’t hurt its feelings—remember, in technical writing use technical words only in their technical sense)..
What is there to keep you from assembling this database now and using it for your Thesis?

Problem 2 Look at the US Census’ “American Fact Finder” website for the American Community Survey (ACS) for the most recent year: (http://factfinder.census.gov )

This is an extensive database. Explore its structure, using both forward and backward links (ie., drill deeper into the data and pull back out to see the larger database [!] within which it is embedded).

Lay out the design of a database. Remember to design the data dictionary as part of this process.
Estimate the sample size of records (comparison cities, city residents, etc.) you would need to get useful information from the database.
Consider the sensitivity of two entities from this database.

Bibliography

Ammons, David. 2001. Municipal Benchmarks, 2^nd ed. Newbury Park: Sage Publications.

Bearden, William, Richard Netemeyer, Mary Mobley. 1993. Handbook of Marketing Scales. Newbury Park: Sage Publications.

Hatry, Harry P. et alii. 1977. How Effective Are Your Community Services? Washington, DC: ICMA.

Knapp, Gerrit J. 2001. Land Market Monitoring for Smart Urban Growth. Cambridge, MA: Lincoln Institute for Land Policy.

Kraemer, Helena C. & Sue Thiemann. 1987. How Many Subjects? Statistical Power Analysis in Research. Newbury Park: Sage Publications

Miller, Delbert C. & Neil Salkind 2002. Handbook of Research Design and Social Measurement. Newbury Park: Sage Publications..

Sheldon, Eleanor B. & Wilbert E. Moore, eds. 1968. Indicators of Social Change: Concepts and Measurements. NY: Russell Sage Foundation.

Vellman, Paul F & David C. Hoaglin. 1981. Applications, Basics, and Computing of Exploratory Data Analysis. Boston, MA: Duxbury Press.

609