November 15, 2008 1 Comment
Tons of digital and traditional ink has been spent tackling the issue of optimizing databases. There are many techniques and preferences to go about this activity, and one can always find two people who disagree over approaches to the same goal. I will not even attempt to present lots of my ideas. What I will attempt to do, though, is to gather questions to ask when doing optimization. How anyone answers to questions would depend on circumstances, both known and unforeseen.
I need to point to two resources which prompted me to tackle this issue: 10 Tips for Optimizing for MySQL Queries and 10 Tips for Optimizing MySQL Queries (That don’t Suck).
There is never a good reason not to normalize your database. This should always be the first step. Anything you want to do, do normalization before it. I just cannot stress enough on the importance of normalizing before doing anything else. Normalization prevents storing redundant data and provides two big benefits: saves storage space, and makes it easier to deal with inserts and updates.
Read and Write
Divide your tables into two main categories: frequent reads and frequent writes. For example, a table for genre of movies doesn’t change that often but it used by other tables, so it would be in the frequent reads category. On the other hand, a table keeping log data gets many, many inserts; it goes in the frequent writes category.
You can use these categories in many ways. One way could be that you choose faster disks (say, SAS instead of SATA) to store read-frequently tables and somewhat slower disks for read-less-frequently tables. You can even apply two labels to the same table. For example, read-frequently write-frequently, read-frequently write-less-frequently, read-less-frequently write-frequently, and read-less-frequently write-less-frequently. And then you can decide how to manage these tables.
You can further categorize columns in tables using the same criteria. If certain columns of a table are read frequently, and all columns are written occasionally, then you have to ask a question: would it be alright to break up the table into two or more so that frequently read columns are in one table and all others into other tables? If most columns of a table are written to frequently, then you might want to keep them together so that the DBMS doesn’t have to work too much finding related columns in disparate tables for each write operation.
Break up Tables
If you have a smaller number of tables/columns which are access frequently (read or write), then it is easier for the DBMS to keep this in memory, making things faster. So when you can, break up tables even more than standard normalization so that there is less data to be stored in memory.
If your table needs a few variable length columns then explore the possibility of separating them into other tables to have more tables with fixed length rows. This speeds up access on disk.
My suggestion is to not build indexes from day one during development. First create the application which will be using this data. As the application takes good shape, you will get a good idea of which queries are used more frequently than others. And then see which columns and tables are required for these queries. Once you have a better idea, create indexes.
But before you actually create indexes, profile these queries and gather performance data. Then create indexes and and again gather performance data. Compare the two sets of performance data to see whether indexes actually helped and how. The golden rule is to create less indexes (because they take storage space themselves) but to create necessary and effective indexes.
A good resource is “lessons learnt from creating indexes.”
There are many types of keys, some of which are: primary, unique, and foreign. These three are the big ones. Rule of thumb is to use natural keys, that is, keys that uniquely identify data based on the data itself. For example, a person’s social security number is a natural key for identification. It is unique and two people (theoretically) cannot have the same number.
But there are situations where the data type of a natural key is variable length. Similarly, a composite key, where two or more columns combine to make a key, can sometimes be cumbersome to manage, especially when this key is used as a foreign key in some other table. In this case, for faster performance, an integer artificial key can be considered. For example, name and date of birth can be a candidate key for a table about people. Name is variable length and indexing them becomes cumbersome. So an artificial “id” can be given to each person, implemented as an automatically incrementing integer.
Artificial integer keys also come in handy when using object relational mappers (ORM) like sqlobject and hibernate. They usually are unable to handle composite keys and need just one column as the key. This is where you could use integer keys. In this case, you might still want to put unique constraints on the columns of composite candidate keys so that in future when ORMs can handle composite keys, you already have data structured as such.
The first rule in writing queries is to not access columns you don’t need. For example, use
select name, dateofbirth from person where dateofbirth between '1980-01-01' and '1980-12-31' instead of
select * from person where dateofbirth between '1980-01-01' and '1980-12-31'. The person table could have many columns and DBMS will have to access all these columns to return to you, even if you don’t need them. Only selecting columns you need reduces the work DBMS will have to do, making things faster.
I am a big proponent of sticking as close to the DBMS as possible when it comes to queries. For example, using meaningful views and useful stored procedures/functions, to me, is better than doing the same in application code. I think of it as object oriented programming in that a view provides an interface and all the work is being done behind the scenes. Tomorrow if the underlying structure of the database changes, the query behind the view can be changed without affecting the application. Similarly, stored procedures can be used. So the database developers and admins can continue to optimize the database without needing to “optimize” the application code.
Know Your DBMS
You have to know your DBMS. Know it inside and out. Know its intricate details. And then use its strengths to your advantage. Minimize its weaknesses using your knowledge. Using general optimization techniques, apply them within the environment of DBMS.
Regularly defragment indexes and tables so that disk access is faster than with fragmented data. Justin recommends using less complex permissions. Also update statistics on tables as frequently as required.
There is a lot of discussion and work on scaling databases. That should be a secondary consideration, after optimizing the database. Extract every ounce of performance from your one server and see how far it takes you. If you still want more, then look into scaling. The list I have touched upon on this post is by no means complete. It is just a starting point to get us all thinking along the lines of optimizing.