Perfect Database Server

My current to-do favorite project is to come up with definitive solutions to the question: how to set up the perfect database server? I realize that there is not much information in the question to go on. What kind of hardware, operating system, DBMS, network environment, and application requirements are we talking about? Do you want open source or commercial closed-source solutions? What kind of data storage requirements are there? Do you want the ability to scale up (better hardware) or scale out (more hardware, like clustering)? What kind of budget do you have? See, just one question brings out so many questions (and I am not finished with this list yet). My future goal, therefore, is to answer such questions with authority for as many solutions as I (with or without help from other interested parties) can come up with.

The first thing I think, is to ask what does the application require? Does it require a basic data store or some fancy “enterprise” solution? Does it require read-heavy or write-heavy operations, or is it a mix of both? How many users of the application are anticipated?

Next step is to determine the environment in which this database will run. It includes people who will be running this server. What is their experience and expertise? How willing are they to venture out from their comfort zones? What kinds of servers are they already managing? Does the environment have solutions from a particular vendor or is it a hybrid? Do you want dedicated servers or are willing to go with virtualization? Can you find the performance hit of virtualization acceptable?

Now we look at the budget. How much can you spend on hardware? If your application requires a cluster, can you afford it? If your application can run on a single database server, can you purchase a better server than absolutely required? Can you afford Oracle, DB2, and/or MS SQL Server? Or do you want a different solution in PostgreSQL and MySQL? Do you want the features and support of Solaris, or are you comfortable with Linux?

After thinking a little, I see that not all solutions can be tested by just one person. There isn’t enough time and money to look over all possibilities. I can start with free and open source software. Once I have exhausted those to the best I can, I can move toward proprietary solutions. The trick, however, is to go back and revisit the solutions some time later to keep up with updates and new technology. For example, if I tested Debian with MySQL, then if a new version of either Debian or MySQL comes up, I should go back and see what needs to change in the solution I provided.

I think I will start with the following things (if and when I can):
FreeBSD, Debian, Red Hat Enterprise Linux, Solaris, OpenSolaris
Without virtualization and with virtualization (Xen, KVM, VMware, OpenVZ, Linux-VServer)
MySQL, PostgreSQL, Apache Derby, One$DB, Firebird, Ingres, DB2, Oracle
Standalone and cluster
Local and external data storage (iSCSI, ZFS, UFS, GFS, DRDB, XFS, JFS, ReiserFS, Ext3)

Advertisements

Database Optimization

Tons of digital and traditional ink has been spent tackling the issue of optimizing databases. There are many techniques and preferences to go about this activity, and one can always find two people who disagree over approaches to the same goal. I will not even attempt to present lots of my ideas. What I will attempt to do, though, is to gather questions to ask when doing optimization. How anyone answers to questions would depend on circumstances, both known and unforeseen.

I need to point to two resources which prompted me to tackle this issue: 10 Tips for Optimizing for MySQL Queries and 10 Tips for Optimizing MySQL Queries (That don’t Suck).

Normalization

There is never a good reason not to normalize your database. This should always be the first step. Anything you want to do, do normalization before it. I just cannot stress enough on the importance of normalizing before doing anything else. Normalization prevents storing redundant data and provides two big benefits: saves storage space, and makes it easier to deal with inserts and updates.

Read and Write

Divide your tables into two main categories: frequent reads and frequent writes. For example, a table for genre of movies doesn’t change that often but it used by other tables, so it would be in the frequent reads category. On the other hand, a table keeping log data gets many, many inserts; it goes in the frequent writes category.

You can use these categories in many ways. One way could be that you choose faster disks (say, SAS instead of SATA) to store read-frequently tables and somewhat slower disks for read-less-frequently tables. You can even apply two labels to the same table. For example, read-frequently write-frequently, read-frequently write-less-frequently, read-less-frequently write-frequently, and read-less-frequently write-less-frequently. And then you can decide how to manage these tables.

You can further categorize columns in tables using the same criteria. If certain columns of a table are read frequently, and all columns are written occasionally, then you have to ask a question: would it be alright to break up the table into two or more so that frequently read columns are in one table and all others into other tables? If most columns of a table are written to frequently, then you might want to keep them together so that the DBMS doesn’t have to work too much finding related columns in disparate tables for each write operation.

Break up Tables

If you have a smaller number of tables/columns which are access frequently (read or write), then it is easier for the DBMS to keep this in memory, making things faster. So when you can, break up tables even more than standard normalization so that there is less data to be stored in memory.

If your table needs a few variable length columns then explore the possibility of separating them into other tables to have more tables with fixed length rows. This speeds up access on disk.

Indexes

My suggestion is to not build indexes from day one during development. First create the application which will be using this data. As the application takes good shape, you will get a good idea of which queries are used more frequently than others. And then see which columns and tables are required for these queries. Once you have a better idea, create indexes.

But before you actually create indexes, profile these queries and gather performance data. Then create indexes and and again gather performance data. Compare the two sets of performance data to see whether indexes actually helped and how. The golden rule is to create less indexes (because they take storage space themselves) but to create necessary and effective indexes.

A good resource is “lessons learnt from creating indexes.”

Keys

There are many types of keys, some of which are: primary, unique, and foreign. These three are the big ones. Rule of thumb is to use natural keys, that is, keys that uniquely identify data based on the data itself. For example, a person’s social security number is a natural key for identification. It is unique and two people (theoretically) cannot have the same number.

But there are situations where the data type of a natural key is variable length. Similarly, a composite key, where two or more columns combine to make a key, can sometimes be cumbersome to manage, especially when this key is used as a foreign key in some other table. In this case, for faster performance, an integer artificial key can be considered. For example, name and date of birth can be a candidate key for a table about people. Name is variable length and indexing them becomes cumbersome. So an artificial “id” can be given to each person, implemented as an automatically incrementing integer.

Artificial integer keys also come in handy when using object relational mappers (ORM) like sqlobject and hibernate. They usually are unable to handle composite keys and need just one column as the key. This is where you could use integer keys. In this case, you might still want to put unique constraints on the columns of composite candidate keys so that in future when ORMs can handle composite keys, you already have data structured as such.

Optimize Queries

The first rule in writing queries is to not access columns you don’t need. For example, use select name, dateofbirth from person where dateofbirth between '1980-01-01' and '1980-12-31' instead of select * from person where dateofbirth between '1980-01-01' and '1980-12-31'. The person table could have many columns and DBMS will have to access all these columns to return to you, even if you don’t need them. Only selecting columns you need reduces the work DBMS will have to do, making things faster.

I am a big proponent of sticking as close to the DBMS as possible when it comes to queries. For example, using meaningful views and useful stored procedures/functions, to me, is better than doing the same in application code. I think of it as object oriented programming in that a view provides an interface and all the work is being done behind the scenes. Tomorrow if the underlying structure of the database changes, the query behind the view can be changed without affecting the application. Similarly, stored procedures can be used. So the database developers and admins can continue to optimize the database without needing to “optimize” the application code.

Know Your DBMS

You have to know your DBMS. Know it inside and out. Know its intricate details. And then use its strengths to your advantage. Minimize its weaknesses using your knowledge. Using general optimization techniques, apply them within the environment of DBMS.

Miscellaneous

Regularly defragment indexes and tables so that disk access is faster than with fragmented data. Justin recommends using less complex permissions. Also update statistics on tables as frequently as required.

Conclusion

There is a lot of discussion and work on scaling databases. That should be a secondary consideration, after optimizing the database. Extract every ounce of performance from your one server and see how far it takes you. If you still want more, then look into scaling. The list I have touched upon on this post is by no means complete. It is just a starting point to get us all thinking along the lines of optimizing.

Collation and Character Set

I have always had difficulty differentiating between collation and character set. So I thought why not finally find out the difference? I should mention that Python 3 using unicode by default prompted me to investigate this issue now, at this time. When I began to think how to program in django so that applications may be made for a global audience, the matter of storing data in the database came up.

According to Character Sets and Collations in General, “A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.” In other words, a character set is the collection of characters, such as in a language. The alphabets of English are a character set. How to compare these alphabets with each other is collation.

For example, a, b, c, d, …, z, are the alphabets (character set). Comparing them includes (not limited to) sorting them in ascending or descending order. Does an uppercase f precede a lowercase f during sorting? This question would be part of collation.

But how do databases store these character sets? They are stored based on character encoding. Character encoding maps a character set (or its subset) to something else, say a numerical sequence (think ASCII, where 65 is A and 97 is a). Morse code is one type of character encoding. Different character encodings store data differently, and also place limits on what character sets can be stored.

For example, ASCII (a character encoding) is limited to a relatively small number of character set when compared to UTF-8 (another character encoding). Depending on what kind of data you are storing, you choose a character encoding, or more generally, a character set.

Once you have chosen how to store (encode) your data, you then choose how to compare (collate) the data. Some examples of collation include latin1_danish_c1 (MySQL) and latin2_czech_ci (MySQL).

My suggestion: choose UTF-8 as your character encoding and a related collation. This will allow you to store data in many different languages, and also give you a pretty good idea of how you want to collate. Of course, choose whatever best suits your needs.

Since Python prompted me to explore this topic, check out Unicode HOWTO for more information on using unicode in Python.

Stored Procedures in Databases

I read a very interesting article about stored procedures (Whats [sic] up with stored procedures these days?). Some points were raised against using stored procedures, which prompted me to write this post.

I am a big fan of databases. In fact, I am a big fan of stored procedures. The reason is that if I need to manipulate data, then it is easier for me to do so on the database itself, mostly using SQL (or only using SQL). Sure, one can use inline SQL or other, newer, fancier tools within your application code, but if you have something like PL/SQL or T-SQL, it already provides much of what you would like to do. They surely aren’t the solution to everything, but for most cases they are great.

I agree that business logic should be separate from data. However, in a well-normalized database, data is stored mostly in conjunction with logic. Let me put it another way: whatever data you store, the data itself contains a lot of business logic. If an invoice goes with a customer (business logic), you store these two values together in an RDBMS, not separately. So if not all business logic, at least a lot of it is within the data. Stored procedures take it a step further and solidify these relationships (logic) as a presentable view (not to be confused with a database ‘view’) to the outside world.

Application portability is addressed often times when it comes to stored procedures. If all your data logic is stored in an application, you can port the application from one DBMS to another, but it is difficult to implement same application in two or more technologies. For example, you have an application written in Java, and then you need to have a similar (if not exact) application written in Python, with the same database back-end. Now you have to take the data logic and re-implement it in Python. If the same logic had resided in the database as stored procedures, you could just pass parameters to them from any number of applications made in any number of technologies. Now your application becomes portable across applications rather than across databases. If you business logic needs to be tweaked, you make these changes in one place and every application is now updated without actually changing the application code.

Obviously, this doesn’t solve the problem of portability across databases. But if you are changing the database, you might have to change your application with all its SQL anyways (to some extent, at least). This brings me to another point. With much SQL or data logic within the application code, you lose the ability to keep data storage separate from your code. As in the case of multiple databases, your application now needs to connect to them individually to get its data. Meanwhile, with linked servers and stored procedures, you call a stored procedure without having to know how the database layer has been implemented. You could change that implementation without affecting the application. You could have a cluster or other high availability or load sharing mechanisms, and change them around, all the while the application only knows about the stored procedures it needs.

I don’t think either one of the solution is perfect for all situations. Sometimes it may be better to use stored procedures and sometimes not so much. I would rather use stored procedures than not. It may be because I am more comfortable working with databases. When there is an issue of control in a team, where both the DBA and the developer are fighting on controlling the solution to their liking, then it is not a debate on the merits of stored procedures but more of a human problem.

Disclaimer: this post may be rambling on, but I wanted to get my initial reaction across. As I think more about this issue, I will refine what I am saying.

Database Normalization

THIS POST IS A WORK IN PROGRESS. THERE MAY BE HEAVY MODIFICATIONS BETWEEN NOW AND THE TIME IT IS COMPLETED.

I have acquired enough knowledge about databases to get acceptably normalized tables. However, I always forget about the exact distinction between each normalized form. Therefore, I decided to write this post, collecting relevant information from all sources I come across into one place. It will also help those learning about this subject matter for the first time.

Example

Let’s say we have the following table:

StudentID StudentFirstName StudentLastName Semester Courses Grades InstructorID InstructorFirstName InstructorLastName
1 John Doe Spring 2008 1. Math
2. Science
3. Foreign Language
1. A
2. C
3. B
1. 15582
2. 15582
3. 788569
1. Joe
2. Joe
3. Jane
1. Public
2. Public
3. Smith
1 John Doe Fall 2007 1. Foreign Language
2. Geography
3. History
1. F
2. B
3. A
1. 788569
2. 57156
3. 996244
1. Jane
2. Henry
3. Prue
1. Smith
2. Frederick
3. Chang

Now we can see one main problem in this table. Data is repeated in the same column for the same row. For example, the Courses column for the first row has three different courses listed. If we wanted to search for the grade in Science class, we will have a hard time matching grades with their courses. Similarly, instructors will not be matched with courses without some extra work.

The solution seems to be to break up one row into multiple rows. Thus we get:

StudentID StudentFirstName StudentLastName Semester Courses Grades InstructorID InstructorFirstName InstructorLastName
1 John Doe Spring 2008 Math A 15582 Joe Public
1 John Doe Spring 2008 Science C 15582 Joe Public
1 John Doe Spring 2008 Foreign Language B 788569 Jane Smith
1 John Doe Fall 2007 Foreign Language F 788569 Jane Smith
1 John Doe Fall 2007 Geography B 57156 Henry Frederick
1 John Doe Fall 2007 History A 996244 Prue Chang

Now it will be much easier to find the exact information we are looking for if we want to see what grade a student got in, say, Math.

The next step is to find a way to uniquely identify each row in the table. This is where the candidate key comes into play. We have many options. We can use StudentID as a key. If we use it, however, we are unable to get a unique value for the Courses column: for StudentID 1, we get three different courses: Math, Science, and Foreign Language. So StudentID by itself is not enough.

We can discard StudentFirstName and StudentLastName from the choices for a key because they also have the same problem as StudentID. Maybe we can combine StudentID and Semester to form a key? However, this combination also have the same problem.

How about StudentID and Courses? Yes, they look like good candidates. However, again we have a problem. For Foreign Language, we have two different values for the Semester column: Spring 2008 and Fall 2007. As you can see, determining a candidate key is quite a task.

We may eventually combine StudentID, Semester, and Courses to create a candidate key. If we wish to use it as a primary key, then we have to deal with the concept of functional dependency.

Functional Dependency and First Normal Form

The concept is quite simple: once you have chosen a primary key, all columns, other than those in the primary key, should have a unique value for that primary key. For example, if StudentID is 1, Semester is Fall 2007, and Courses is History, you should only get a unique value for StudentFirstName, StudentLastName, Grades, InstructorID, InstructorFirstName, InstructorLastName, which in this case is Jon, Doe, A, 996244, Prue, Chang. Only one row has the values we are looking for.

Another way to look at it is: for every primary key, a unique value in a column in a row should be determined. You may not have the case where two values are found for the same primary key in the same column. When such is the case, we say that the primary key functionally determines the non-key values, or that the non-key values are functionally dependent on the primary key.

Another important thing to remember is that the primary key should be unique. In our example, the primary key is a combination of three columns (values). So in this case, each column may have duplicate values in the table, but when they are combined with each other, the combination must be unique.

For example, StudentID of 1 is repeated in six rows, Semester of Spring 2008 is repeated in three rows and Courses values of Foreign Language is repeated in two rows. However, the combination of StudentID. Semester, and Courses is unique in all six rows. This means the primary key is unique in our table.

If you have reached this situation, you are in the First Normal Form (1NF).

Second Normal Form

Once you have achieved 1NF, you may put the table in the database. However, if you look closely, there are some problems visible.

If John Doe needs to take another course in Spring 2008, you will insert a new row with some duplicate data in some columns, such as StudentID, StudentFirstName, StudentLastName, and Semester. And depending on the course being taken, other columns may also need to have duplicate data. But the biggest problem is, you cannot insert the new course without inserting the other (duplicate) data. This is called Insert Anomaly.

Let’s say you need to update Jane Smith’s last name because she got married and changed her name to Jane Smith-Pitt. You will need to update her last name in all the places it exists. Our example shows only two records (rows) but it would certainly be more if she taught more than one student. This problem is known as Update Anomaly.

If for some reason you need to delete Prue Chang’s data, you will have to delete the entire record wherever it appears. So you will lose any data which appeared alongside hers but was not exactly hers. This is known as Delete Anomaly.

These anomalies point out an important fact. All non-key values are not exactly dependent on the primary key alone. In fact, they are partially dependent on part of the key. For example, StudentFirstName and StudentLastName are dependent on StudentID. In fact, to find out the name of the student, all you need is the StudentID. This property is known as Partial Dependency.

To get our table into Second Normal Form (2NF), we need to remove all partial dependencies from the table. To do so, you need to figure out which non-key values are dependent on other values. We can see two such cases: StudentFirstName and StudentLastName are dependent on StudentID, and InstructorFirstName and InstructorLastName are dependent on InstructorID.

We take all partial dependencies and put them in their own tables. However, the primary key will not change for our main table because the remaining non-key values are still dependent on the primary key. In our example, we will have two more tables:

Student Table

StudentID StudentFirstName StudentLastName
1 John Doe

Instructor Table

InstructorID InstructorFirstName InstructorLastName
15582 Joe Public
788569 Jane Smith
57156 Henry Frederick
996244 Prue Chang

Courses Table

StudentID Semester Courses Grades InstructorID
1 Spring 2008 Math A 15582
1 Spring 2008 Science C 15582
1 Spring 2008 Foreign Language B 788569
1 Fall 2007 Foreign Language F 788569
1 Fall 2007 Geography B 57156
1 Fall 2007 History A 996244

Now that we have split up the main table into three smaller tables, we may say our courses table is in 2NF. But first, we have to make sure that the new tables created are first in 1NF and then also in 2NF.

In the Student table, we have a primary key called StudentID and it functionally determines the non-key values. Since none of the non-key values in partially dependent on any non-key columns, the table is also in 2NF.

Similarly, in the Instructor table, we have a primary key called InstructorID and it functionally determines the non-key values. Since none of the non-key values in partially dependent on any non-key columns, the table is also in 2NF.

In our Courses table, we have a new concept: Foreign Keys. The StudentID and InstructorID columns get their values from the Student and Instructor tables respectively. Therefore, these are called foreign keys. For these two columns, Student and Instructor tables may be thought of as parent tables while the Courses table is their child table. Foreign key columns may only have data which is already present in their parent tables. If the parent table does not have some data, it may not be entered in the foreign key columns of the child tables. On the other hand, any data present in the parent table does not necessarily have to be in its child table.

In our example tables, we only need to insert one record in the Student table if a new student is admitted and then use other tables without having to insert the same data again and again. So we got rid of the insert anomaly. Similar is the case for instructors. If an instructor changes her name, we just need to update the data in one table. Thus, we no longer have the update anomaly. If we delete an instructor, we need not delete student data from the database. Again, delete anomaly has been fixed. And since there are no more partial dependencies, all of our tables are in 2NF.