Collation and Character Set

I have always had difficulty differentiating between collation and character set. So I thought why not finally find out the difference? I should mention that Python 3 using unicode by default prompted me to investigate this issue now, at this time. When I began to think how to program in django so that applications may be made for a global audience, the matter of storing data in the database came up.

According to Character Sets and Collations in General, “A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.” In other words, a character set is the collection of characters, such as in a language. The alphabets of English are a character set. How to compare these alphabets with each other is collation.

For example, a, b, c, d, …, z, are the alphabets (character set). Comparing them includes (not limited to) sorting them in ascending or descending order. Does an uppercase f precede a lowercase f during sorting? This question would be part of collation.

But how do databases store these character sets? They are stored based on character encoding. Character encoding maps a character set (or its subset) to something else, say a numerical sequence (think ASCII, where 65 is A and 97 is a). Morse code is one type of character encoding. Different character encodings store data differently, and also place limits on what character sets can be stored.

For example, ASCII (a character encoding) is limited to a relatively small number of character set when compared to UTF-8 (another character encoding). Depending on what kind of data you are storing, you choose a character encoding, or more generally, a character set.

Once you have chosen how to store (encode) your data, you then choose how to compare (collate) the data. Some examples of collation include latin1_danish_c1 (MySQL) and latin2_czech_ci (MySQL).

My suggestion: choose UTF-8 as your character encoding and a related collation. This will allow you to store data in many different languages, and also give you a pretty good idea of how you want to collate. Of course, choose whatever best suits your needs.

Since Python prompted me to explore this topic, check out Unicode HOWTO for more information on using unicode in Python.

2 Responses to Collation and Character Set

  1. hs says:

    Why is Unicode different from UTF-8? Because Unicode is not UTF.

%d bloggers like this: