Database Design: Managing Information

So, you have a pile of data and you want to bring some order to it. How do you think your way through this problem?

Concepts & Initial Issues

To begin with, there are several important distinctions to keep in mind when you are managing information:

Storage vs. Access: Do not confuse storing data with having access to the data (and, usually, it is access in which you are most interested). For example, you might have (stored in a dusty basement somewhere) 50 boxes of records from 1950 to 1970, in no particular order or sequence. The data are stored, but they are not particularly accessible. Similarly, you might have a database with 1500 different data elements, many of which are very similar in name but not in meaning (like “address” and “residential address” and “street address”—the first might include business addresses, the second might include post office addresses, and the third might include business and residential addresses but not post office addresses) and 100,000 records (say, 10,000 transactions a year for the last 10 years). It is possible to navigate your way through such a tangle to find everyone who lived in the area in 1995, but unless you have thought things through very carefully beforehand it might not be easy. Generally speaking, storage is easy, access is difficulty.
Logical vs. Physical Access: Again, being able to put your hands on the data is not the same as being able to manipulate them. To use the previous example, you can gain physical access to the data stored in the dusty basement but there will be no logical search pattern you could use to find a specific record. Logical access provides a way to efficiently search the database for exactly the records you want. Given the speed of computers, sometimes it is not worth the effort to craft a maximally efficient logical structure—a spreadsheet program can easily sort through a database of 50 elements and 200 records using brute force (“random access” as it is called in the trade). But if you are using a database like our other example—1500 data elements for each of 100,000 records—it could take all day (or, more likely, all night) to search the database at random. Imagine the problems they have at, say, the Census Bureau or the IRS!
Data vs. Document Access: Data are the information stored on documents. Usually, you don’t need the document, just the information it represents. There are exceptions, of course—the Courts (for example) often want to see the document, not your notation of what was on it. The problem, of course, is that you have to have figured out in advance what from the document was worth keeping.

Having decided whether you really want logical access to data (rather than physical storage of documents), you still have to make some decisions about how you intend to use the data. The more “refined” the task, the more refined the data (and access) must be:

Record-keeping: This is the simplest, least complicated task. It is also the basis of every other task to follow. There are even times when this is all that you need or want. For example, property records (at least in the days before the County Courthouses started to computerize them) were primarily designed to provide a registry of transactions. There was no particular intention to use the information, but if one needed to get to it one could (although not easily, as any title search can attest).
Calculating/Printing: At the next level of complexity, one might now want to make some simple calculations based on the records and maybe print out the results. An accountant’s ledgers fit this model. There is some summary of the records along with storage.
Record Searching: It is still more difficult to sort through a large number of records, either to categorize the records by grouping or to identify a single record. You are probably familiar with this kind of activity from looking at tables of census data—millions of records are searched and sorted to generate the tables for which planners and other analysts are so grateful.
Record Restructuring: It can be a fairly sophisticated analytical chore to rearrange the records and their relation to each other (usually, to permit greater flexibility of analysis). To return to the Census Data example, the County and City Databook provides information by census tract on housing density and median income. Were you to analyze those data, you might notice a negative relationship between the two (as income increases, housing density declines). But if you were to look at census block data, for some areas of some cities (say, Chicago’s Gold Coast) the opposite might be true. But you would need to be able to restructure the unit of analysis to discover this. GIS systems, by the way, are particularly good at these kinds of restructuring.
Sophisticated Analysis: In some cases (like econometric analysis), the data are transformed by sophisticated mathematical modeling to reveal relationships which are not easily teased out from the background noise and multiple chains of causality. Such models usually require very large amounts of very carefully measured data. Economic Base analysis and employment forecasting, for example, require very large matrices of economic transaction data.
Process Control: Finally, the ultimate in database refinement, a stream of data may be gathered, manipulated, and analyzed in real time to provide feedback to modify the behavior of a system. For example, a water treatment plant may gather data from a series of monitors to continuously adjust the flow in the system so it operates at peak capacity.

In summary, there are several questions you should ask as you begin to design a database:

· What are the questions you want the database to be able to answer?

· Which questions require documents? Which require data?

· What kind of logical access is required?

· What kind of physical access is required?

Objective: Data Independence

The point of all of this is to organize your data in such a way that it is as flexible as possible—in the trade, it’s called “data independence.” But there are many dimensions to independence, and in real life there are often tradeoffs which you will have to consider:

Physical Independence: Some of the sections of this course were, initially, written on an electric typewriter (yes, I’m that old). Others were first written on a MacIntosh IIe. This was an improvement—the text became independent of the paper on which it was written. But when I later shifted to a DOS machine, I found that I could not (at least, not easily or practically) transfer the digital code from Mac to DOS (or back again). With the development of HTML, we had a software that read the same (well, sort of) on any platform. But my HTML code stored on 5 ¼” floppies can’t be read on today’s 3 ½” diskdrives. Hardware and software change, and (at least so far) the older versions eventually become unretrievable. As much as possible, you want to try to find a way to store your data so retrieval will not be tied to a specific technology (that is, despite all its other drawbacks, one of the remaining advantages of the book).
Logical Independence: When I was in High School, they taught us to write term papers by copying our notes onto index cards (do they still use such antique technology?). The advantage was that you could sort and reorganize your cards many times before you came up with the final structure of your paper. Each card was “logically” independent of the others. Similarly, you want to design your database so that information can be mixed and matched in as many ways as possible—and so new information can be added in as needed. Some databases are very flexible with data that were originally specified, but it is almost impossible to add in additional data elements.
Data Sharing: Often, a single database will serve multiple users (with different needs for access and output):

Conceptual Model: This is what the end user is interested in. At this level, data are identified by “variable name” (like “Housing Units” or “Median Income”).
Logical Model: This is what the applications staff (the “techies” on the staff) are interested in. They are less interested in what the data are saying than they are in how the data are structured. They are particularly interested in record layouts and data element relationships. Their question is whether the logic of the flow of data will provide the results the user wants.
Physical Model: This is what the programmers who designed the original application software are interested in. They don’t care what you call the variables, and how you lay out your records is your problem. They are concerned with writing computer code that create the logical relationships that the applications staff need to use.

It gets interesting when all three groups are using the same system simultaneously in the development phase of a project! Over time, the physical model fades more to the background, but the logical and the conceptual models are always in some degree of tension in any management information system.

Data Dictionary

One of the ways to keep all this information straight is to write a “data dictionary,” a description of all the relevant characteristics of the data you have gathered. If you look in the front of a printed Census publication, you will find extensive notation about the source and quality of the data used to generate the tables in their publications. This is a data dictionary. But you do not need to be doing anything as complicated as the Census to need a data dictionary. As you will discover (if you haven’t already) even data that you knew intimately a year ago will seem strange and awkward when you come back to it after a year’s absence (for example, what is in that file you saved a year ago labeled “x23ft.xls”?).

Contents: A data dictionary may contain three levels of information—

Conceptual level: What are the “entities” (concepts), their attributes (measures), and their relation (including the source from which they were collected and the uses to which they might be put)
Logical level: How are the data elements (attributes, at the conceptual level) grouped? The most common techniques are trees (hierarchical branching), network (multiple-level branching), and tables (fixed matrix). The data elements should also be cross-referenced to the applications program which is using them.
Physical level: This is a description of each element on each record (think of this as an index card with one or more variables on it—or one line in a file). It provides the information a programmer would need—length (maximum number of characters permitted), precision (think of it as the number of digits stored after the decimal point), storage medium (disk? Hard drive? Server? Etc.), and security (who has permission to read it? Write it? Delete it?).

Process: There is a simple process for building a data dictionary--

Identify the entities and their applications
Define the entities, their attributes, and their relations to each other
Build the data dictionary (conceptual, logical, and/or physical)
Review & evaluate with end users

Issues: There are a number of issues that will arise in the process of developing a data dictionary—

Choice of entities: What rules should be developed for deciding which entities belong in the database and which do not? This is the opportunity to address criticisms of the current system.
Relations among data: To what extent should current data relationships be carried into a new system? To what extent should the process be left free to define relationships, even if it means ignoring current usage?
Classification of data: Sometimes natural categories will arise in the process of design which had not been apparent previously, or new substantive applications may suggest themselves.
Context: The best information system, judged by some “logical” criterion, will not be a good design if it does not fit into the context in which it will be used. This gets to the “soft” side of database design—“user friendliness” (or, “user comfort”).

Normalization

The heart of a Conceptual level analysis is a process called “normalization.” In this case, “normal” is not used in the sense of “usual” but in the root meaning of “norm” (which was the Latin name for a carpenter’s square)—a pattern or a model. By the way, this is also the meaning intended in the old term, “Normal School,” which was used for institutions that later were called Teachers’ Colleges (Minnesota State University was once “Mankato Normal School,” from whence came the school cheer, “Mankato for the normal!”). Anyway, normalization is a process for organizing elements in a database to achieve maximum efficiency.

Terms: Normalization is a technical process, and uses three terms in very specific ways:

Attribute: characteristic of an entity; basic element of analysis
Entity: person, object or concept represented in a database
Relationship: combination of entities. May be 1-1, 1-many (1-N), or N-N. One-to-one means one entity is mapped onto one other entity (e.g., name and social security number). One-to-Many means one entity is mapped onto two or more others (e.g., name to gender and income). Many-to-Many means multiple entities are related to multiple others (e.g., name and address mapped onto occupation and number of children).

Normalization

Non-normal data: simple list of all the elements to be used in all the applications of the database.
First Normal Form: eliminate any duplicate attributes (e.g., “name” is commonly gathered from several sources)
Second Normal Form: define the primary keys and group attributes into entities (e.g., “person” may be all the personal attributes grouped around social security number—so even “name” is secondary, and “residence” may be all the real estate attributes grouped around parcel ID number, etc.)
Third Normal Form: determine any secondary keys from dependencies within entities which are independent of the primary key (e.g., within the person, there may be groupings for education, income, and occupation—or maybe those three will be attributes grouped around a secondary key of socio-economic status. It depends on what you will be doing with the data.)

Reasons for normalization: Normalization is a lot of work, and can sometimes come to resemble scholastic debates about angels and pinheads. Why do it?

Minimize redundancy: Storage will be more efficient if data is entered once and only once.
Minimize update anomalies: If the same information resides in two locations (think “Phone number”), it is possible (and therefore likely) that sometimes only one of the locations will be updated. Then how will you know which one is correct?
Minimize insertion and deletion anomalies: Every time a datum is entered, there is the opportunity for error (if only from mistyping). The fewer opportunities for entry, the fewer opportunities for introducing error and the higher the quality of the data.

Data Structures

The heart of Logical level analysis is the process of describing the data structures. There are essentially three:

Relational Database (Table): A relational database (sometimes called a “flat file”) by convention arranges attributes across the columns and observations down the rows of a simple matrix. Each row, however, is a complete unit—a new observation must be generated to change even one attribute in the row. For example, to update the address one would create an entirely new observation with the new address. If one wished to retain the old address (say, for archival purposes), then a “date entered” field could be one of the attributes and would create two unique records with essentially the same information except for the address attribute and date entered attribute.
Hierarchical Database (Tree): A hierarchical database can be thought of as an upside-down tree made up of nodes and branches. The uppermost node is called the “root” (a “root-only” tree is a table). All logical connections “flow” up the branches to the root and then (if needed) back down the branches. Access to the lower branches of the tree is through the higher levels, by way of the primary and secondary keys. For example, if “person” were the root and social security number were the key, one could get to the person’s address by including social security number in the “residence” entity as well as in the “person” entity (such a structure would permit more than one person to reside at the same address).
Network: A network is also a hierarchical structure, but it allows links to be made directly between secondary keys. The logic no longer flows only through the root, but can flow directly through the branches. For example, if the “residence” entity included “zip code” and a “business” entity included zip code, one could find all the restaurants in a neighborhood without having to know who lives in that neighborhood.

Report Formats

Finally, you need to give some thought to the way information will be reported from the database:

· Filtering: If a database is going to be shared (and why else would you be going through all this trouble?), it is generally wise to decide beforehand who has a need to know what, and to limit access on that basis. Sometimes the database will be public, and everyone in the world is welcome to look at it (the Census databases are like that). Some databases are extremely private and hardly anyone should be permitted access (personnel files and medical records are like that). Some databases include a mixture of data, such that users might have access to one part of the database but not to others. The lowest level of access is “read only.” The user is permitted to see whatever is in the database, but may not modify any of it. At a higher level of access, a user will have permission to modify the database (“write access”). Finally, the highest level of access allows the user to control who else has access to the database (“administrator access”). By the way, higher levels of access do not necessarily mean higher “status” in the organization—clerical workers regularly need write access, and the section chief often needs only read access (and the administrator is often not in the section at all, but is over in the Information Technology section).

· Key Variables: For any database, there will be a small subset of variables that will be of central interest and most commonly used in reporting. These may or may not be the “primary key” or “secondary key” variables of the database (e.g., social security number may be a primary key variable, but the reports will probably use the person’s name). Generally, reports will be generated in groups, based on the key variables and their dimensions. The selection of the key variables should be determined by the end-users of the reports—no matter how much sense it makes to the database designer, if it doesn’t make sense to the end user it won’t get used.

· Uses of the Reports: There are four broad categories of reporting systems, depending on the most common use of the database:

o Monitoring: These reports will be designed to capture a snapshot in time, and perhaps to present it against previous snapshots. Census reports are an excellent example of this approach. These reports tend to provide a lot of data and take a lot of pages to do it.

o Modeling: These reports are designed to capture a current snapshot and predict or project future (or alternative) events. A population projection is a good example of this approach. These reports tend to capture a lot of data in a few pages.

o Interrogative: This is the “customized” approach to reporting. Rather than pre-formatting the report, the reporting system is designed to allow the user to query the database directly in a unique request. These reports usually hold a lot of data in waiting, but use only a small part of it and produce a simple response.

o Strategic Decision System: This is sometimes called an “Information Center.” It combines all three of the above approaches into a single entity. In a sense, a library is a strategic decision system. While often discussed in the literature, they are less commonly found in the real world—they require constant staffing and continual database updating.

Next Section

Back to Syllabus

609