RDBMS Normalization Example

The process of RDBMS normalization is an exercise in logic - from a managerial data access and usage perspective. There is nothing complicated or mysterious. It requires only a little thought and examination of how data are most logically used. One good point to remember is that a RDBMS works exactly as you would if you had all of your data stored on 3 X 5 index cards. All you have to do is tell the computer how to lay them out on the table and how to pick them up in a manner that produces the desired data.

First, we have to consider functional dependency. It addresses the concept that certain data fields are dependent upon other data fields in order to uniquely define and access them. Consider, for example, the following situation:

Data on students' last names are stored in a data file. There may be (very probably will be) several students with the last name "Smith." If we want a computer program to retrieve information about student "Smith," we must have some way to specify which specific "Smith" we desire. This is done through the specification of a uniquely assigned student number. John Smith may have student number 11223, while Steve Smith may have 14322, and Tom Smith may have 33215. By specifying the correct student number, we are able to retrieve information about the desired student named "Smith." Retrieving information about the specific "Smith" depends upon the specification of the correct student number. For this reason we say that the last name is functionally dependent upon the student number.

Why not, you may suggest, specify both the last name and the first name. Won't this eliminate any ambiguity when searching for students named "Smith?" The problem is not eliminated because it is possible (actually quite probable) that two or more students have the same last and first names, causing problems when searching for specific information. If we have two students whose names are John Smith, we run into the same problem when searching for specific information. We can see that both the student last name and first name are functionally dependent upon the student number.

The value of an understanding of functional dependency will become apparent as we work through the normalization process.

Now let's turn our attention to a "representative" situation for which database normalization is required. Think, for example, of the process of getting you enrolled in classes. There is quite a bit of information that must be entered into the RDBMS. The normalization process brings logical order and structure to the information gathering process.

When thinking about the logical normalization process we first look at all of the data required to accomplish a task. Consider the following (part of a) report:

Class Enrollment
Class CodeClass DescriptionStudent NumberName
MGT503Mgt Info Systems00001Mast, Rick
00003Earnhardt, Dale
00005Labonte, Terry
MGT540Quant Methods00002Wallace, Rusty
00003Earnhardt, Dale
00004Marlin, Sterling

What is called "repeating groups" exists within this data. Each class code can have any number of students in it, so the students' information constitutes what is called a repeating group. Data cannot be stored or processed in a database when it is in this form. What we must have is one record containing all the data for each student who is enrolled in a class. There can be no "gaps" in the data when stored in a file. The following table (data file) illustrates the data in First Normal Form (1NF)

Class Enrollment
Class CodeClass DescriptionStudent NumberName
MGT503Mgt Info Systems00001Mast, Rick
MGT503Mgt Info Systems00003Earnhardt, Dale
MGT503Mgt Info Systems00005Labonte, Terry
MGT540Quant Methods00002Wallace, Rusty
MGT540Quant Methods00003Earnhardt, Dale
MGT540Quant Methods00004Marlin, Sterling

Converting to 1NF basically requires that we "flatten" the report above so that each row (tuple, record) contains no repeating groups. No more than one entry per field can be entered and no "gaps" exist in the data.

Now consider the following object (data file), named ENROLL, that contains the data fields (attributes) required to enroll you in a class. (note: this object contains all data fields whereas the above examples illustrated only the first four fields)

ENROLL(Class_Code, Class_Description, Student_Number, Student_Name, Address, City, State, Zip, Major_Code, Major_Description, Course_Grade, Class_Start_Date, Class_Start_Time, Class_Room_Number, Building_Number, Building_Name, Building_Address, Professor_Number, Professor_Name)

An object is said to be in 1NF if there are no repeating groups of attributes (fields).

This object is said to be in First Normal Form (1NF) if it is in the format illustrated above with no "gaps" or repeating groups. It is simply a collection of data fields necessary to complete the job of enrolling you in class, with each record in the file containing all data necessary for the enrollment.

The problem with 1NF is that there is redundancy with respect to entering all of the data into a computer for each and every class in which you enroll. For example, your name, address, etc., will have to be entered for each class that you take. If you take four classes, your name will have to be entered four times - not a comforting thought for the data entry person, not to mention the opportunities to incorrectly enter it. Developing a logical method of eliminating the entry of your name four times leads us to the definition of what is called Second Normal Form (2NF).

We must next introduce the concept of a "KEY" field. A key field is one (or more logically joined) field(s) that is used to uniquely identify each record in a data file. For example, the Student_Number field can be used to uniquely identify each student's record in a student data file. However, since one student may be enrolled in more than one class each quarter, the Student_Number field alone is not sufficient to uniquely identify each record in the ENROLL file illustrated above. The combination of the Student_Number field and the Class_Code field forms a unique combination and can therefore be considered as the key field for the ENROLL file. We usually indicate the key fields in the object descriptions by underlining the field name(s) as illustrated below.

A relation is in 2NF if, and only if, it is in 1NF and every nonkey attribute (field) is fully functionally dependent upon the key field. This means that all data attributes (fields) that are not used to uniquely identify records (tuples or rows) in a file (table) should not appear more than once in the entire database and should never have to be entered into the database more than once. Any non-identifying data fields should be placed into separate objects (files). For example, we could remove the name, address, etc. fields into an object named STUDENT and remove them from the ENROLL object. The result will yield two objects (files):

STUDENT(Student_Number, Student_Name, Address, City, State, Zip, Major_Code)
ENROLL(Student_Number, Class_Code, Major_Description, Class_Description, Course_Grade, Class_Start_Date, Class_Start_Time, Class_Room_Number, Building_Number, Building_Name, Building_Address, Professor_Code, Professor_Name)

Here we see that the student name, address, etc., are functionally dependent upon the student number in the STUDENT object (file), and that the class description, start date, building name, etc., are functionally dependent upon the Student Number and the Class Code in the ENROLL object (file).

The relation between these objects (files) is said to be in 2NF. The relation is the logical linkage between the files so that all data necessary to enroll students in classes is available and may be uniquely retrieved when necessary.

While getting the data files into 2NF is better than 1NF, there are still some problems with the form. For example, if the location of the class changes buildings, all records in the ENROLL file for that class will have to be updated. The building name and address are "transitively dependent" upon the building number. Resolving the "transitive dependency" leads us to Third Normal Form (3NF).

A relation is in 3NF if, and only if, it is in 2NF and no nonkey fields are transitively dependent upon the key field(s). That is, no nonkey field can be functionally dependent upon another nonkey field. Our example is clearly not in 3NF since the building name (nonkey field) depends upon the building number (nonkey field). The relation can be resolved into 3NF by dividing it into component relations, each meeting 3NF form. Astute students will also have recognized that the class description, start time, and start date are transitively dependent upon the class code, which is not considered a key field here because it forms only part of the key field for the ENROLL object. They will also recognize that professor name is functionally dependent upon the professor code, which is not a key field. The building code and professor code fields are not key fields because they are not used to uniquely identify each record in the ENROLL object (file).

ENROLL(Student_Number, Class_Code, Course_Grade)
BUILDING(Building_Number, Building_Name, Building_Address)
CLASS(Class_Code, Class_Description, Class_Start_Date, Class_Start_Time, Class_Room_Number, Building_Number, Professor_Code)
PROFESSOR(Professor_Code, Professor_Name, Department_Code, Department_Name)
MAJOR(Major_Code, Major Description)
STUDENT(Student_Number, Student_Name, Address, City, State, Zip, Major_Code)

Note also that the PROFESSOR object is not in 3NF since the Department Name is transitively dependent upon the Department Code. We resolve this into:

PROFESSOR(Professor_Code, Professor_Name, Department_Code)
DEPARTMENT(Department_Code, Department_Name, Chair_Professor_Code, etc.)

This exercise illustrates that you must consider ALL relationships within the organization's database and resolve ALL relations into 3NF. This can take some time and effort, but the rewards are great.

A very important point here is that no data may be lost during the normalization process. We must always be able to reconstruct the original data after the normalization. To lose data will cause problems and will be the result of an invalid normalization process.

Some very good sites that provide further information about normalization can be found at these links:

We are now ready to examine how we managerially uitlize relations and what logical (and physical) operations are permitted when working with data files.

Return to Chapter 6-9