Database Design: Managing Information
So, you have a pile of data and you want to bring some order
to it. How do you think your way
through this problem?
Concepts & Initial Issues
To begin with, there are several important distinctions to
keep in mind when you are managing information:
- Storage
vs. Access: Do not confuse
storing data with having access to the data (and, usually, it is access
in which you are most interested).
For example, you might have (stored in a dusty basement somewhere)
50 boxes of records from 1950 to 1970, in no particular order or
sequence. The data are stored,
but they are not particularly accessible. Similarly, you might have a
database with 1500 different data elements, many of which are very similar
in name but not in meaning (like “address” and
“residential address” and “street
address”—the first might include business addresses, the
second might include post office addresses, and the third might include
business and residential addresses but not post office addresses) and
100,000 records (say, 10,000 transactions a year for the last 10
years). It is possible
to navigate your way through such a tangle to find everyone who lived in
the area in 1995, but unless you have thought things through very
carefully beforehand it might not be easy. Generally speaking, storage is
easy, access is difficulty.
- Logical
vs. Physical Access: Again,
being able to put your hands on the data is not the same as being able to
manipulate them. To use the
previous example, you can gain physical access to the data stored in the
dusty basement but there will be no logical search pattern you could use
to find a specific record.
Logical access provides a way to efficiently search the
database for exactly the records you want. Given the speed of computers,
sometimes it is not worth the effort to craft a maximally efficient
logical structure—a spreadsheet program can easily sort through a
database of 50 elements and 200 records using brute force (“random
access” as it is called in the trade). But if you are using a database
like our other example—1500 data elements for each of 100,000
records—it could take all day (or, more likely, all night) to search
the database at random.
Imagine the problems they have at, say, the Census Bureau or the
IRS!
- Data
vs. Document Access: Data
are the information stored on documents. Usually, you don’t need the
document, just the information it represents. There are exceptions, of
course—the Courts (for example) often want to see the document, not
your notation of what was on it.
The problem, of course, is that you have to have figured out in
advance what from the document was worth keeping.
Having decided whether you really want logical access
to data (rather than physical storage of documents), you still have to make
some decisions about how you intend to use the data. The more “refined” the task,
the more refined the data (and access) must be:
- Record-keeping: This is the simplest, least
complicated task. It is also
the basis of every other task to follow. There are even times when this is all
that you need or want. For
example, property records (at least in the days before the County Courthouses started to computerize
them) were primarily designed to provide a registry of transactions. There was no particular intention
to use the information, but if one needed to get to it one could
(although not easily, as any title search can attest).
- Calculating/Printing: At the next level of complexity,
one might now want to make some simple calculations based on the records
and maybe print out the results.
An accountant’s ledgers fit this model. There is some summary of the
records along with storage.
- Record
Searching: It is still
more difficult to sort through a large number of records, either to
categorize the records by grouping or to identify a single record. You are probably familiar with this
kind of activity from looking at tables of census data—millions of
records are searched and sorted to generate the tables for which planners
and other analysts are so grateful.
- Record
Restructuring: It can be a
fairly sophisticated analytical chore to rearrange the records and their
relation to each other (usually, to permit greater flexibility of
analysis). To return to the
Census Data example, the County and City Databook
provides information by census tract on housing density and median
income. Were you to analyze
those data, you might notice a negative relationship between the two (as
income increases, housing density declines). But if you were to look at census
block data, for some areas of some cities (say, Chicago’s Gold Coast) the opposite
might be true. But you would
need to be able to restructure the unit of analysis to discover this. GIS systems, by the way, are particularly
good at these kinds of restructuring.
- Sophisticated
Analysis: In some cases
(like econometric analysis), the data are
transformed by sophisticated mathematical modeling to reveal relationships
which are not easily teased out from the background noise and multiple
chains of causality. Such
models usually require very large amounts of very carefully measured
data. Economic Base analysis
and employment forecasting, for example, require very large matrices of
economic transaction data.
- Process
Control: Finally, the
ultimate in database refinement, a stream of data may be gathered,
manipulated, and analyzed in real time to provide feedback to modify the
behavior of a system. For
example, a water treatment plant may gather data from a series of monitors
to continuously adjust the flow in the system so it operates at peak
capacity.
In summary, there are several questions you should ask as
you begin to design a database:
·
What are the questions you want the database to
be able to answer?
·
Which questions require documents? Which require data?
·
What kind of logical access is required?
·
What kind of physical access is required?
Objective: Data Independence
The point of all of this is to organize your data in such a
way that it is as flexible as possible—in the trade, it’s called
“data independence.”
But there are many dimensions to independence, and in real life there
are often tradeoffs which you will have to consider:
- Physical
Independence: Some of the sections of this course
were, initially, written on an electric typewriter (yes, I’m that
old). Others were first
written on a MacIntosh IIe. This was an improvement—the
text became independent of the paper on which it was written. But when I later shifted to a DOS
machine, I found that I could not (at least, not easily or practically)
transfer the digital code from Mac to DOS (or back again). With the development of HTML, we
had a software that read the same (well, sort of) on any platform. But my HTML code stored on 5
¼” floppies can’t be read on today’s 3
½” diskdrives. Hardware and software change, and
(at least so far) the older versions eventually become unretrievable. As much as possible, you want to
try to find a way to store your data so retrieval will not be tied to a
specific technology (that is, despite all its other drawbacks, one of the
remaining advantages of the book).
- Logical
Independence: When I was in High School, they
taught us to write term papers by copying our notes onto index cards (do
they still use such antique technology?). The advantage was that you could
sort and reorganize your cards many times before you came up with the
final structure of your paper.
Each card was “logically” independent of the
others. Similarly, you want to
design your database so that information can be mixed and matched in as
many ways as possible—and so new information can be added in as
needed. Some databases are
very flexible with data that were originally specified, but it is almost
impossible to add in additional data elements.
- Data
Sharing: Often, a single
database will serve multiple users (with different needs for access and
output):
- Conceptual
Model: This is what the
end user is interested in. At
this level, data are identified by “variable name” (like
“Housing Units” or “Median Income”).
- Logical
Model: This is what the
applications staff (the “techies” on the
staff) are interested in.
They are less interested in what the data are saying than
they are in how the data are structured. They are particularly interested
in record layouts and data element relationships. Their question is whether the
logic of the flow of data will provide the results the user wants.
- Physical
Model: This is what the
programmers who designed the original application software are interested
in. They don’t care
what you call the variables, and how you lay out your records is your problem. They are concerned with writing
computer code that create the logical
relationships that the applications staff need to use.
It gets interesting when all three
groups are using the same system simultaneously in the development phase of a
project! Over time, the physical
model fades more to the background, but the logical and the conceptual models
are always in some degree of tension in any management information system.
Data Dictionary
One of the ways to keep all this information straight is to
write a “data dictionary,” a description of all the relevant
characteristics of the data you have gathered. If you look in the front of a printed
Census publication, you will find extensive notation about the source and
quality of the data used to generate the tables in their publications. This is a data dictionary. But you do not need to be doing anything
as complicated as the Census to need a data dictionary. As you will discover (if you
haven’t already) even data that you knew intimately a year ago will seem
strange and awkward when you come back to it after a year’s absence (for
example, what is in that file you saved a year ago labeled
“x23ft.xls”?).
- Contents: A data dictionary may contain three
levels of information—
- Conceptual
level: What are the
“entities” (concepts), their attributes (measures), and their
relation (including the source from which they were collected and the
uses to which they might be put)
- Logical
level: How are the data
elements (attributes, at the conceptual level) grouped? The most common techniques are
trees (hierarchical branching), network (multiple-level branching), and
tables (fixed matrix). The
data elements should also be cross-referenced to the applications program
which is using them.
- Physical
level: This is a
description of each element on each record (think of this as an index
card with one or more variables on it—or one line in a file). It provides the information a
programmer would need—length (maximum number of characters
permitted), precision (think of it as the number of digits stored after
the decimal point), storage medium (disk? Hard drive? Server? Etc.), and
security (who has permission to read it? Write it? Delete it?).
- Process: There is a simple process for
building a data dictionary--
- Identify
the entities and their applications
- Define
the entities, their attributes, and their relations to each other
- Build
the data dictionary (conceptual, logical, and/or physical)
- Review
& evaluate with end users
- Issues: There are a number of issues that
will arise in the process of developing a data dictionary—
- Choice
of entities: What rules
should be developed for deciding which entities belong in the database
and which do not? This is the
opportunity to address criticisms of the current system.
- Relations
among data: To what
extent should current data relationships be carried into a new
system? To what extent should
the process be left free to define relationships, even if it means
ignoring current usage?
- Classification
of data: Sometimes
natural categories will arise in the process of design which had not been
apparent previously, or new substantive applications may suggest
themselves.
- Context: The best information system,
judged by some “logical” criterion, will not be a good design
if it does not fit into the context in which it will be used. This gets to the
“soft” side of database design—“user
friendliness” (or, “user comfort”).
Normalization
The heart of a Conceptual level analysis is a process called
“normalization.” In
this case, “normal” is not used in the sense of “usual”
but in the root meaning of “norm” (which was the Latin name for a
carpenter’s square)—a pattern or a model. By the way, this is also the meaning
intended in the old term, “Normal School,” which was used for
institutions that later were called Teachers’ Colleges (Minnesota
State University was once “Mankato
Normal School,” from whence came
the school cheer, “Mankato for the normal!”). Anyway, normalization is a process for
organizing elements in a database to achieve maximum efficiency.
- Terms: Normalization is a technical
process, and uses three terms in very specific ways:
- Attribute: characteristic of an entity; basic
element of analysis
- Entity: person, object or concept
represented in a database
- Relationship: combination of entities. May be 1-1, 1-many (1-N), or
N-N. One-to-one means one
entity is mapped onto one other entity (e.g., name and social security
number). One-to-Many means
one entity is mapped onto two or more others (e.g., name to gender and
income). Many-to-Many means
multiple entities are related to multiple others (e.g., name and address
mapped onto occupation and number of children).
- Normalization
- Non-normal
data: simple list of all the
elements to be used in all the applications of the database.
- First
Normal Form: eliminate any duplicate attributes
(e.g., “name” is commonly gathered from several sources)
- Second
Normal Form: define the primary keys and
group attributes into entities (e.g., “person” may be all the
personal attributes grouped around social security number—so even
“name” is secondary, and “residence” may be all
the real estate attributes grouped around parcel ID number, etc.)
- Third
Normal Form: determine any secondary
keys from dependencies within entities which are independent
of the primary key (e.g., within the person, there may be groupings for
education, income, and occupation—or maybe those three will be
attributes grouped around a secondary key of socio-economic status. It depends on what you will be
doing with the data.)
- Reasons
for normalization: Normalization
is a lot of work, and can sometimes come to resemble scholastic debates
about angels and pinheads. Why
do it?
- Minimize
redundancy: Storage will be
more efficient if data is entered once and only once.
- Minimize
update anomalies: If the same
information resides in two locations (think “Phone number”),
it is possible (and therefore likely) that sometimes only one of the
locations will be updated.
Then how will you know which one is correct?
- Minimize
insertion and deletion anomalies:
Every time a datum is entered, there is the opportunity for error
(if only from mistyping). The
fewer opportunities for entry, the fewer opportunities for introducing
error and the higher the quality of the data.
Data Structures
The heart of Logical level analysis is the process of
describing the data structures.
There are essentially three:
- Relational
Database (Table): A relational
database (sometimes called a “flat file”) by convention
arranges attributes across the columns and observations down the rows of a
simple matrix. Each row,
however, is a complete unit—a new observation must be generated to
change even one attribute in the row.
For example, to update the address one would create an entirely new
observation with the new address.
If one wished to retain the old address (say, for archival
purposes), then a “date entered” field could be one of the
attributes and would create two unique records with essentially the same
information except for the address attribute and date entered attribute.
- Hierarchical
Database (Tree): A
hierarchical database can be thought of as an upside-down tree made up of
nodes and branches. The
uppermost node is called the “root” (a “root-only”
tree is a table). All logical
connections “flow” up the branches to the root and then (if
needed) back down the branches.
Access to the lower branches of the tree is through the higher
levels, by way of the primary and secondary keys. For example, if
“person” were the root and social security number were the
key, one could get to the person’s address by including social
security number in the “residence” entity as well as in the
“person” entity (such a structure would permit more than one
person to reside at the same address).
- Network: A network is also a hierarchical
structure, but it allows links to be made directly between secondary
keys. The logic no longer
flows only through the root, but can flow directly through the branches. For example, if the
“residence” entity included “zip code” and a
“business” entity included zip code, one could find all the
restaurants in a neighborhood without having to know who lives in that
neighborhood.
Report Formats
Finally, you need to give some thought to the way
information will be reported from the database:
·
Filtering: If a database is going to be shared (and
why else would you be going through all this trouble?), it is generally wise to
decide beforehand who has a need to know what, and to limit access on that
basis. Sometimes the database will
be public, and everyone in the world is welcome to look at it (the Census
databases are like that). Some
databases are extremely private and hardly anyone should be permitted access
(personnel files and medical records are like that). Some databases include a mixture of
data, such that users might have access to one part of the database but not to
others. The lowest level of access
is “read only.” The
user is permitted to see whatever is in the database, but may not modify any of
it. At a higher level of access, a
user will have permission to modify the database (“write
access”). Finally, the
highest level of access allows the user to control who else has access to the
database (“administrator access”). By the way, higher levels of access do
not necessarily mean higher “status” in the
organization—clerical workers regularly need write access, and the
section chief often needs only read access (and the administrator is often not
in the section at all, but is over in the Information Technology section).
·
Key Variables: For any database, there will be a small
subset of variables that will be of central interest and most commonly used in
reporting. These may or may not be
the “primary key” or “secondary key” variables of the
database (e.g., social security number may be a primary key variable, but the
reports will probably use the person’s name). Generally, reports will be generated in
groups, based on the key variables and their dimensions. The selection of the key variables
should be determined by the end-users of the reports—no matter how much
sense it makes to the database designer, if it doesn’t make sense to the
end user it won’t get used.
·
Uses of the Reports: There are four broad
categories of reporting systems, depending on the most common use of the
database:
o Monitoring: These reports will be designed to
capture a snapshot in time, and perhaps to present it against previous
snapshots. Census reports are an
excellent example of this approach.
These reports tend to provide a lot of data and take a lot of pages to
do it.
o Modeling: These reports are designed to capture a
current snapshot and predict or project future (or alternative) events. A
population projection is a good example of this approach. These reports tend to capture a lot of
data in a few pages.
o Interrogative: This is the “customized”
approach to reporting. Rather than
pre-formatting the report, the reporting system is designed to allow the user
to query the database directly in a unique request. These reports usually hold a lot of data
in waiting, but use only a small part of it and produce a simple response.
o Strategic
Decision System: This is
sometimes called an “Information
Center.” It combines all three of the above
approaches into a single entity. In
a sense, a library is a strategic decision system. While often discussed in the literature,
they are less commonly found in the real world—they require constant
staffing and continual database updating.
Next
Section
Back to Syllabus
© 1996 A.J.Filipovitch
Revised 11 March 2005