March 8, 2013
If you want to get a great introduction to Unicode in Python, watch (not just read, watch) Ned Batchelder’s presentation, Pragmatic Unicode. The most important thing I took away from it was to decode to Unicode string as soon as you have any input (file, network, etc.) and to encode to binary string as late as possible to give back to user, system, etc. Another thing is to know what encoding has been used so you can decode and encode back as necessary. These two are applicable to all programming, whether it’s in Python or some other language.
Some more things to remember:
- “\u2119” is a Unicode string containing one code point. Use a lowercase \u with four hex digits. Use uppercase \U with more-than-four (usually eight) hex digits in the same code point.
- “\xe2” is a binary string containing one byte.
- Always try to use UTF-8.