A Python and Unicode Ahaa! Moment
March 6, 2012
EDIT (2013-03-08): Watch the presentation Pragmatic Unicode by Ned Batchelder and try to ignore this post. I wrote it when I had a lesser understanding of Unicode. In other words, this post is deprecated.
I get stumped every time I try to work with Unicode in Python. The biggest problems arise when trying to read files with Unicode data in them. Today was again a day when I found out that everything I know about Unicode is either completely misunderstood or I have forgotten. But after several hours of looking at various tutorials, code snippets, etc., I finally got my eureka moment.
When I write a text file with Unicode data in it, I always use the symbol (e.g. ㇹ) instead of its code (e.g. \u31f9). When I read this file in Python, I usually get some kind of error. I learned today that for my sanity I should use the code and not symbol when writing Unicode in text files. But which code? I use UTF-8 codes and Unicode 4.0 / ISO 10646 Plane 0 has a great list of them. Now when I read Unicode from file in Python, it reads it without problem.
This ties into JSON as well. In your JSON text, instead of writing symbols as we see them, write the hexadecimal code that computers see. I tried this technique with Python 3 on Windows 7 and Windows 2008 R2.
If you want to normalize Unicode data, use unicodedata. The function to use is
normalize. I am still unclear on which supported “form” (‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’) to use in which situations. But through trial and error I have settled on NFC because it retains the actual character (unlike NFD) and does not substitute the compatibility character with its equivalent (unlike NFKC and NFKD). You really do need to read more about the unicodedata to understand what I mean.
But it’s really that simple. Use UTF-8 hexadecimal code when writing text files and use NFC when reading files to normalize data. For example, if your file contains the following data:
Then your Python script should have something like:
import unicodedata normalized_unicode = unicodedata.normalize('NFC', '\u2158\u31f9')
And when you display the data, it will show up as: