Unicode and Python

If you want to get a great introduction to Unicode in Python, watch (not just read, watch) Ned Batchelder’s presentation, Pragmatic Unicode. The most important thing I took away from it was to decode to Unicode string as soon as you have any input (file, network, etc.) and to encode to binary string as late as possible to give back to user, system, etc. Another thing is to know what encoding has been used so you can decode and encode back as necessary. These two are applicable to all programming, whether it’s in Python or some other language.

Some more things to remember:

  • “\u2119” is a Unicode string containing one code point. Use a lowercase \u with four hex digits. Use uppercase \U with more-than-four (usually eight) hex digits in the same code point.
  • “\xe2” is a binary string containing one byte.
  • Always try to use UTF-8.
Advertisements

I fixed a bug in yum

Well, it was a tiny syntax error in Fedora 17 alpha. For some reason Presto had a tiny typo where a hyphen/minus “-” was used instead of an underscore “_”. The full traceback allowed me to get the file and line number where the error was. I just visually scanned the surrounding code to see what could be wrong. That’s when I realized that replacing the unary minus with the underscore would solve the problem. I made the change and ran su -c 'yum update' again and the problem had been resolved. So a big win for the technologies and people involved.

A win for Python because I was able to make a change in the source and then run the code without having to know about or go through the linking, compiling, executing process.

A win for open source because I was able to view and modify the source code.

A win for Fedora because they are using Python, which made it so easy for me to fix a tiny bug without breaking updates for me.

I’d also like to add that I needed to do something similar for work. Something was broken and I had to read through some Ruby files. Although the syntax didn’t make too much sense to me, I was able to judge what changes I needed to make. I made them and worked around some bugs until they could be fixed. So hooray for Ruby as well.

A Python and Unicode Ahaa! Moment

EDIT (2013-03-08): Watch the presentation Pragmatic Unicode by Ned Batchelder and try to ignore this post. I wrote it when I had a lesser understanding of Unicode. In other words, this post is deprecated.

I get stumped every time I try to work with Unicode in Python. The biggest problems arise when trying to read files with Unicode data in them. Today was again a day when I found out that everything I know about Unicode is either completely misunderstood or I have forgotten. But after several hours of looking at various tutorials, code snippets, etc., I finally got my eureka moment.

When I write a text file with Unicode data in it, I always use the symbol (e.g. ㇹ) instead of its code (e.g. \u31f9). When I read this file in Python, I usually get some kind of error. I learned today that for my sanity I should use the code and not symbol when writing Unicode in text files. But which code? I use UTF-8 codes and Unicode 4.0 / ISO 10646 Plane 0 has a great list of them. Now when I read Unicode from file in Python, it reads it without problem.

This ties into JSON as well. In your JSON text, instead of writing symbols as we see them, write the hexadecimal code that computers see. I tried this technique with Python 3 on Windows 7 and Windows 2008 R2.

If you want to normalize Unicode data, use unicodedata. The function to use is normalize. I am still unclear on which supported “form” (‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’) to use in which situations. But through trial and error I have settled on NFC because it retains the actual character (unlike NFD) and does not substitute the compatibility character with its equivalent (unlike NFKC and NFKD). You really do need to read more about the unicodedata to understand what I mean.

But it’s really that simple. Use UTF-8 hexadecimal code when writing text files and use NFC when reading files to normalize data. For example, if your file contains the following data:

\u2158\u31f9

Then your Python script should have something like:


import unicodedata
normalized_unicode = unicodedata.normalize('NFC', '\u2158\u31f9')

And when you display the data, it will show up as:

⅘ㇹ

Why JSON?

I have been working on a tiny, lightweight automated testing framework that can be used to test both Linux and Windows environments, applications, etc. I have got a portion of it to a working state and it’s helping me get some Hyper-V automation done. The basic idea behind the framework is to use PowerShell on Windows and SSH on Linux to get things done. And yes, it’s built using Python 3. Anyways, the focus of this post is not the framework but JSON and why I chose it.

I needed a way to pass test suites to test controllers, run those tests, gather results, and then get the results. Since XML is used quite often for passing data between components, initially I thought of using it. There were two main objectives I had in mind: easy for humans to read/write test suites and easy for programmers to read/write XML in the framework. But as I tried to comprehend the scale of the project, XML fell short.

Reading and writing XML is pretty easy for humans but it can be cumbersome. XML allows a lot of freedom and flexibility in how you want to design your schema but it’s too verbose to quickly write stuff. Reading and writing XML using Python, although not too difficult, was not something I had a lot of time to tinker with. Plus I couldn’t get my head wrapped around DOM, etc.

JSON, on the other hand, looked a lot like using list and dict in Python. So I gave it a try and came to love it. The biggest benefit for me was that I had no translation to do between Python and XML. A list of dicts in Python looks almost exactly the same as a list of dicts in JSON. So while I’m writing JSON, it’s no different than when I write Python code. This got me up and running very quickly.

The Python 3 JSON library took a couple days for me to play around with and see how it would all fit in. I tried natural Python to JSON and back, and also reading JSON from file and writing it to file, etc. In the end, it was all very simple.

By formatting JSON and just looking at it one can also see how easy debugging is. The verbosity of XML can make it difficult to quickly pinpoint where the problem lies. Yes, parsers can point out syntax mistakes but semantic errors are easier (at least for me) to see in JSON. However, it has also been the case that I made less mistakes writing JSON than I do when writing XML. So this requires even less debugging. JSON is very neat to look at, when formatted properly.

Of course, for people already familiar with reading and writing XML, JSON might be a big jump. But I feel that with the simple rules JSON has, it won’t be long before any XML-trained user can be trained to use JSON.

To conclude, the biggest reasons for me to choose JSON were its easy reading, writing, and debugging capabilities. I don’t have to switch context between Python and XML and can simply stay in Python mode even when dealing with JSON. If you are starting a new project give JSON a try. I’m sure it’ll impress you. But don’t try to shoehorn JSON when XML would be the better choice (or vice versa). The right tool for the right job.

Introduction to Python subprocess module

First off, head over to subprocess — Subprocess management to get all the details. This post will try to provide a gentle introduction to subprocess and my experience using it. There will be some suggestions here that I *think* are correct but be careful when you implement them in your code. Also remember that I wrote and tested this code using Python 3.1 on Debian Squeeze.

First off, I found it better to just use the Popen class and not the convenience functions provided. Using it helped me get a better handle on what’s going on. Second, learn the difference between Popen.wait() and Popen.communicate(). wait() basically sets Popen.returncode but keeps the stdout and stderr pipes as is. communicate() sets Popen.returncode but also returns stdout and stderr and closes the pipes so you can’t use them again as stdin for another command.

Third, use the shlex module so that you don’t have to fight with the command while creating a list to feed to args in Popen.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import subprocess
import sys
import shlex
command_line = "sed -e 's/^import dev as settings_file$/import production as settings_file/' test -i"
command_to_run = shlex.split(command_line)
print (command_to_run)
try:
    command_run = subprocess.Popen(command_to_run, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
except:
    raise
command_run_stdout, command_run_stderr = command_run.communicate()
print (command_run.returncode, command_run_stderr.decode('utf-8'))
print (command_run_stdout.decode('utf-8'))

The preceeding code sample is pretty self-explanatory. I used shlex to create a list from my command string, a list used in the Popen class. I set both stdout and stderr to send their output to pipes. command_run is an object representing the command I ran. Using communicate(), I get three things: returncode (set automatically), stdout (returned by communicate), and stderror (returned by communicate). Since command_run_stdout and command_run_stderr are byte strings, I convert them into UTF-8 before printing.

I will modify the preceeding code so that I can use the stdout and stderr as stdin for another command.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import subprocess
import sys
import shlex
command_line = "ls -l"
command_to_run = shlex.split(command_line)
print (command_to_run)
try:
    command_run = subprocess.Popen(command_to_run, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
except:
    raise
command_run.wait()
print (command_run.returncode)
command_to_run_2 = ["grep", "-i", "TOTAL"]
try:
    command_run_2 = subprocess.Popen(command_to_run_2, stdin=command_run.stdout)
except:
    raise

The biggest difference here was that I used wait() instead of communicate() so that I could use stdout as stdin for the second command.

If you are able to understand these things, I believe you are on your way to writing basic scripts that call out to the shell to do some tasks it’s best suited to do: run commands.

Delete Large List of Files

It all started when I was reading More Elegant Way To Delete Large List of Files? on reddit. Reading comments on the page led me to Perl to the rescue: case study of deleting a large directory. But me being a Python fan, I wasn’t satisfied with a Perl solution. My search led me to meeb’s comment on Quickest way to delete large amounts of files.

To summarize my quest for knowledge.

Using Perl: perl -e 'chdir "BADnew" or die; opendir D, "."; while ($n = readdir D) { unlink $n }'

Using Python:


#!/usr/bin/env python
import shutil
shutil.rmtree('/stuff/i/want/to/delete')

Using Bash:
Step 0: (optional) Create a list of files to delete (source: valadil’s comment and ensuing discussion). This step will help you figure out exactly what will be deleted.

find . -name "log*.xml" -exec echo rm -f {} \; > test_file;

Step 1: find . -type f -name "log*.xml" -print0 | xargs --null -n 100 rm

If it were up to me, I would use the Bash method as it’s easier for me to understand.

Extract data from PostgreSQL dump file

After taking a database dump from PostgreSQL using pg_dump, you may want to only get the schema or only the data. This script has been created and tested using Python versions 2.7 (Linux) and 3.2 (Windows), using a dump file from PostgreSQL version 9.0 (Linux).

Usage is simple. Provide an input dump file with the -f flag; output file with -o flag; and then choose either to extract/export data with -d flag or schema with -s flag. If you only want to extract data for certain tables, use the -t flag and provide a comma-separated list of table names. These table names should match exactly with what’s in the dump file.

I hope you find this script useful and can modify/extend it to your needs. If you have ideas on how to make this code better, please do not hesitate to share your ideas.


from re import search
import argparse
import codecs

script_version='0.0.1'
parser = argparse.ArgumentParser(
    description='From a pgsql dump file, extract only the data to be inserted', 
    version=script_version)
parser.add_argument('-f', '--file', metavar='in-file', action='store', 
    dest='in_file_name', type=str, required=True, 
    help='Name of pgsql dump file')
parser.add_argument('-o', '--out-file', metavar='out-file', action='store', 
    dest='out_file_name', type=str, required=True, 
    help='Name of output file')
parser.add_argument('-d', '--data-only', action="store_true", default=False, 
    dest='data_only', required=False, 
    help='''Only data is extracted and schema is ignored. 
    If not specified, then -s must be specified.''')
parser.add_argument('-t', '--table-list', metavar='table-name-list', action='store', 
    dest='table_name_list', type=str, required=False, 
    help='''Optional: Command-separated list of table names to process. 
    Works only with -d flag.''')
parser.add_argument('-s', '--schema-only', action="store_true", default=False, 
    dest='schema_only', required=False, 
    help='''Only schema is extracted and data is ignored.
    If not specified, then -d must be specified.''')
args = parser.parse_args()

if args.data_only and args.schema_only:
    print ('Error: You can\'t provide -d and -s flags at the same time; choose only one')
    exit()
elif args.data_only:
    data_only = True
    schema_only = False
    start_copy = False
elif args.schema_only:
    data_only = False
    schema_only = True
    start_copy = True
else:
    print ('Error: Choose one of -d and -s flags')
    exit()

print ('Processing File:', args.in_file_name)
input_file_name = args.in_file_name
output_file_name = args.out_file_name
table_name_list = args.table_name_list

if table_name_list:
    table_list = table_name_list.split(',')
else:
    table_list = None

outfile = codecs.open(output_file_name, "w", encoding="utf-8")
with codecs.open(input_file_name, "r", encoding="utf-8") as infile:
    for line in infile:
        if data_only:
            if (not start_copy) and search('^COPY', line) and table_list:
                for table in table_list:
                    if search(''.join(['^COPY ', table.strip(), ' ']), line):
                        start_copy = True
                        outfile.write(line)
                        break
            elif (not start_copy) and search('^COPY', line) and not table_list:
                start_copy = True
                outfile.write(line)
            elif start_copy and search('^\\\.', line):
                start_copy = False
                outfile.write(line)
            elif start_copy:
                outfile.write(line)
        elif schema_only:
            if start_copy and search('^COPY', line):
                start_copy = False
            elif (not start_copy) and search('^\\\.', line):
                start_copy = True
            elif start_copy:
                outfile.write(line)
print ('Done')
outfile.close()