Database Normalization

THIS POST IS A WORK IN PROGRESS. THERE MAY BE HEAVY MODIFICATIONS BETWEEN NOW AND THE TIME IT IS COMPLETED.

I have acquired enough knowledge about databases to get acceptably normalized tables. However, I always forget about the exact distinction between each normalized form. Therefore, I decided to write this post, collecting relevant information from all sources I come across into one place. It will also help those learning about this subject matter for the first time.

Example

Let’s say we have the following table:

StudentID StudentFirstName StudentLastName Semester Courses Grades InstructorID InstructorFirstName InstructorLastName
1 John Doe Spring 2008 1. Math
2. Science
3. Foreign Language
1. A
2. C
3. B
1. 15582
2. 15582
3. 788569
1. Joe
2. Joe
3. Jane
1. Public
2. Public
3. Smith
1 John Doe Fall 2007 1. Foreign Language
2. Geography
3. History
1. F
2. B
3. A
1. 788569
2. 57156
3. 996244
1. Jane
2. Henry
3. Prue
1. Smith
2. Frederick
3. Chang

Now we can see one main problem in this table. Data is repeated in the same column for the same row. For example, the Courses column for the first row has three different courses listed. If we wanted to search for the grade in Science class, we will have a hard time matching grades with their courses. Similarly, instructors will not be matched with courses without some extra work.

The solution seems to be to break up one row into multiple rows. Thus we get:

StudentID StudentFirstName StudentLastName Semester Courses Grades InstructorID InstructorFirstName InstructorLastName
1 John Doe Spring 2008 Math A 15582 Joe Public
1 John Doe Spring 2008 Science C 15582 Joe Public
1 John Doe Spring 2008 Foreign Language B 788569 Jane Smith
1 John Doe Fall 2007 Foreign Language F 788569 Jane Smith
1 John Doe Fall 2007 Geography B 57156 Henry Frederick
1 John Doe Fall 2007 History A 996244 Prue Chang

Now it will be much easier to find the exact information we are looking for if we want to see what grade a student got in, say, Math.

The next step is to find a way to uniquely identify each row in the table. This is where the candidate key comes into play. We have many options. We can use StudentID as a key. If we use it, however, we are unable to get a unique value for the Courses column: for StudentID 1, we get three different courses: Math, Science, and Foreign Language. So StudentID by itself is not enough.

We can discard StudentFirstName and StudentLastName from the choices for a key because they also have the same problem as StudentID. Maybe we can combine StudentID and Semester to form a key? However, this combination also have the same problem.

How about StudentID and Courses? Yes, they look like good candidates. However, again we have a problem. For Foreign Language, we have two different values for the Semester column: Spring 2008 and Fall 2007. As you can see, determining a candidate key is quite a task.

We may eventually combine StudentID, Semester, and Courses to create a candidate key. If we wish to use it as a primary key, then we have to deal with the concept of functional dependency.

Functional Dependency and First Normal Form

The concept is quite simple: once you have chosen a primary key, all columns, other than those in the primary key, should have a unique value for that primary key. For example, if StudentID is 1, Semester is Fall 2007, and Courses is History, you should only get a unique value for StudentFirstName, StudentLastName, Grades, InstructorID, InstructorFirstName, InstructorLastName, which in this case is Jon, Doe, A, 996244, Prue, Chang. Only one row has the values we are looking for.

Another way to look at it is: for every primary key, a unique value in a column in a row should be determined. You may not have the case where two values are found for the same primary key in the same column. When such is the case, we say that the primary key functionally determines the non-key values, or that the non-key values are functionally dependent on the primary key.

Another important thing to remember is that the primary key should be unique. In our example, the primary key is a combination of three columns (values). So in this case, each column may have duplicate values in the table, but when they are combined with each other, the combination must be unique.

For example, StudentID of 1 is repeated in six rows, Semester of Spring 2008 is repeated in three rows and Courses values of Foreign Language is repeated in two rows. However, the combination of StudentID. Semester, and Courses is unique in all six rows. This means the primary key is unique in our table.

If you have reached this situation, you are in the First Normal Form (1NF).

Second Normal Form

Once you have achieved 1NF, you may put the table in the database. However, if you look closely, there are some problems visible.

If John Doe needs to take another course in Spring 2008, you will insert a new row with some duplicate data in some columns, such as StudentID, StudentFirstName, StudentLastName, and Semester. And depending on the course being taken, other columns may also need to have duplicate data. But the biggest problem is, you cannot insert the new course without inserting the other (duplicate) data. This is called Insert Anomaly.

Let’s say you need to update Jane Smith’s last name because she got married and changed her name to Jane Smith-Pitt. You will need to update her last name in all the places it exists. Our example shows only two records (rows) but it would certainly be more if she taught more than one student. This problem is known as Update Anomaly.

If for some reason you need to delete Prue Chang’s data, you will have to delete the entire record wherever it appears. So you will lose any data which appeared alongside hers but was not exactly hers. This is known as Delete Anomaly.

These anomalies point out an important fact. All non-key values are not exactly dependent on the primary key alone. In fact, they are partially dependent on part of the key. For example, StudentFirstName and StudentLastName are dependent on StudentID. In fact, to find out the name of the student, all you need is the StudentID. This property is known as Partial Dependency.

To get our table into Second Normal Form (2NF), we need to remove all partial dependencies from the table. To do so, you need to figure out which non-key values are dependent on other values. We can see two such cases: StudentFirstName and StudentLastName are dependent on StudentID, and InstructorFirstName and InstructorLastName are dependent on InstructorID.

We take all partial dependencies and put them in their own tables. However, the primary key will not change for our main table because the remaining non-key values are still dependent on the primary key. In our example, we will have two more tables:

Student Table

StudentID StudentFirstName StudentLastName
1 John Doe

Instructor Table

InstructorID InstructorFirstName InstructorLastName
15582 Joe Public
788569 Jane Smith
57156 Henry Frederick
996244 Prue Chang

Courses Table

StudentID Semester Courses Grades InstructorID
1 Spring 2008 Math A 15582
1 Spring 2008 Science C 15582
1 Spring 2008 Foreign Language B 788569
1 Fall 2007 Foreign Language F 788569
1 Fall 2007 Geography B 57156
1 Fall 2007 History A 996244

Now that we have split up the main table into three smaller tables, we may say our courses table is in 2NF. But first, we have to make sure that the new tables created are first in 1NF and then also in 2NF.

In the Student table, we have a primary key called StudentID and it functionally determines the non-key values. Since none of the non-key values in partially dependent on any non-key columns, the table is also in 2NF.

Similarly, in the Instructor table, we have a primary key called InstructorID and it functionally determines the non-key values. Since none of the non-key values in partially dependent on any non-key columns, the table is also in 2NF.

In our Courses table, we have a new concept: Foreign Keys. The StudentID and InstructorID columns get their values from the Student and Instructor tables respectively. Therefore, these are called foreign keys. For these two columns, Student and Instructor tables may be thought of as parent tables while the Courses table is their child table. Foreign key columns may only have data which is already present in their parent tables. If the parent table does not have some data, it may not be entered in the foreign key columns of the child tables. On the other hand, any data present in the parent table does not necessarily have to be in its child table.

In our example tables, we only need to insert one record in the Student table if a new student is admitted and then use other tables without having to insert the same data again and again. So we got rid of the insert anomaly. Similar is the case for instructors. If an instructor changes her name, we just need to update the data in one table. Thus, we no longer have the update anomaly. If we delete an instructor, we need not delete student data from the database. Again, delete anomaly has been fixed. And since there are no more partial dependencies, all of our tables are in 2NF.

Advertisements

One Response to Database Normalization

  1. This is very up-to-date info. I’ll share it on Facebook.

%d bloggers like this: