Wednesday, August 30, 2006

Introduction to Database Normalization


Database normalization can essentially be defined as the practice of optimizing table structures. Optimization is accomplished as a result of a thorough investigation of the various pieces of data that will be stored within the database, in particular concentrating upon how this data is interrelated. An analysis of this data and its corresponding relationships is advantageous because it can result both in a substantial improvement in the speed in which the tables are queried, and in decreasing the chance that the database integrity could be compromised due to tedious maintenance procedures.

Database normalization provides table optimization through the investigation of entity relationships. But why is this necessary? In this section, I’ll elaborate a bit upon why normalization is necessary when creating commercial database applications.
Essentially, table optimization is accomplished through the elimination of all instances of data redundancy and unforeseen scaleability issues.

Redundancy

Data redundancy is exactly what you think it is; the repetition of data. One obvious drawback of data repetition is that it consumes more space and resources than is necessary. Consider the following table:

Table 1-1: Poorly defined table

student_id

class_name

time

location

professor_id

999-40-9876

Math 148

MWF 11:30

Rm. 432

prof145

999-43-0987

Physics 113

TR 1:30

Rm. 12

prof143

999-42-9842

Botany 42

F 12:45

Rm. 9

prof167

999-41-9832

Matj 148

MWF 11:30

Rm. 432

prof145


Basically this table is a mapping of various students to the classes found within their schedule. Seems logical enough, right? Actually, there are some serious issues with the choice to store data in this format. First of all, assuming that the only intention of this table is to create student-class mappings, then there really is no need to repeatedly store the class time and professor ID. Just think that if there are 30 students to a class, then the class information would be repeated 30 times over!

Moreover, redundancy introduces the possibility for error. You might have noticed the name of the class found in the final row in the table (Matj 148). Given the name of the class found in the first row, chances are that Matj 148 should actually be Math 148! While this error is easily identifiable when just four rows are present in the table, imagine finding this error within the rows representing the 60,000 enrolled students at my alma mater, The Ohio State University. Chances that you’ll find these errors are unlikely, at best. And the cost of even attempting to find them will always be high.

Unforeseen Scaleability Issues

Unforeseen scaleability issues generally arise due to lack of forethought pertaining to just how large a database might grow. Of course, as a database grows in size, initial design decisions will continue to play a greater role in the speed of and resources allocated to this database. For example, it is typically a very bad idea to limit the potential for expansion of the information that is to be held within the db, even if there are currently no plans to expand. For example, structurally limiting the database to allot space for only three classes per student could prove deadly if next year the school board decides to permit all students to schedule three classes. This also works in the opposite direction; What if the school board subsequently decides to only allow students to schedule two classes? Have you allowed for adequate flexibility in the design so as to easily adapt to these new policies?

The remedy to these problems is through the use of a process known as database normalization. A subject of continued research and debate over the years, several general rules have been formulated that layout the process one should follow in the quest to normalize a database.

Find more

Tags: , , , , , , , , , ,

0 Comments:

Post a Comment

<< Home