A Python and Unicode Ahaa! Moment

I get stumped every time I try to work with Unicode in Python. The biggest problems arise when trying to read files with Unicode data in them. Today was again a day when I found out that everything I know about Unicode is either completely misunderstood or I have forgotten. But after several hours of looking at various tutorials, code snippets, etc., I finally got my eureka moment.

When I write a text file with Unicode data in it, I always use the symbol (e.g. ㇹ) instead of its code (e.g. \u31f9). When I read this file in Python, I usually get some kind of error. I learned today that for my sanity I should use the code and not symbol when writing Unicode in text files. But which code? I use UTF-8 codes and Unicode 4.0 / ISO 10646 Plane 0 has a great list of them. Now when I read Unicode from file in Python, it reads it without problem.

This ties into JSON as well. In your JSON text, instead of writing symbols as we see them, write the hexadecimal code that computers see. I tried this technique with Python 3 on Windows 7 and Windows 2008 R2.

If you want to normalize Unicode data, use unicodedata. The function to use is normalize. I am still unclear on which supported “form” (‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’) to use in which situations. But through trial and error I have settled on NFC because it retains the actual character (unlike NFD) and does not substitute the compatibility character with its equivalent (unlike NFKC and NFKD). You really do need to read more about the unicodedata to understand what I mean.

But it’s really that simple. Use UTF-8 hexadecimal code when writing text files and use NFC when reading files to normalize data. For example, if your file contains the following data:

\u2158\u31f9

Then your Python script should have something like:


import unicodedata
normalized_unicode = unicodedata.normalize('NFC', '\u2158\u31f9')

And when you display the data, it will show up as:

⅘ㇹ

Generate HTML and PDF from DocBook in Fedora

DocBook is a widely-used format for writing documentation, articles, books, etc. For my purposes, I needed to generate XHTML and PDF files from documentation in DocBook format on a Fedora 16 server.

Install

You need to install the following packages.

sudo yum install libxslt docbook5-style-xsl docbook-utils

Convert single DocBook file to XHTML

Now comes the conversion. Run xsltproc as below and it will create an HTML file (mybook.html in this case) in the current directory.

xsltproc -o mybook.html /usr/share/sgml/docbook/xsl-ns-stylesheets/xhtml-1_1/docbook.xsl mydocbook.xml

You can explore the /usr/share/sgml/docbook/xsl-ns-stylesheets/ path for more options.

Convert modular DocBook file to XHTML

You can create a modular DocBook document (a book in my case) by separating out chapters of the book into separate files and including them in the main file. For example, there’s only one chapter in my book so I’ll have two files: docbook.book.xml and docbook.chapter.xml. These two files would look something like the following:

An example of file docbook.book.xml

<?xml version="1.0" encoding="UTF-8"?>
<book xml:id="wikiply_doc" xmlns="http://docbook.org/ns/docbook" version="5.0" xmlns:xi="http://www.w3.org/2001/XInclude">
    <title>Sample Book</title>
    <bookinfo>
        <author>
            <personname><firstname>Code</firstname><surname>Ghar</surname></personname>
        </author>
        <legalnotice>
            <para>Copyright 2011-2012 Code Ghar. All rights reserved.</para>
            <para>Redistribution and use in source (SGML DocBook) and 'compiled' forms (SGML, HTML, PDF, PostScript, RTF and so forth) with or without modification, are permitted.</para>
        </legalnotice>
    <copyright><year>2012</year><holder>Code Ghar</holder></copyright>
    </bookinfo>
    <xi:include href="docbook.chapter.xml" />
</book>

An example of file docbook.chapter.xml

<?xml version="1.0" encoding="UTF-8"?>
<chapter xml:id="installation" xmlns="http://docbook.org/ns/docbook" version="5.0" >
<title>Sample Chapter</title>
    <section xml:id="sample_chapter">
        <title>Sample Chapter</title>
        <para>This is example text in sample chapter</para>
    </section>
</chapter>

Run xsltproc as below and it will create an HTML file (mybook.html in this case) in the current directory from both files.

xsltproc -xinclude -o mybook.html /usr/share/sgml/docbook/xsl-ns-stylesheets/xhtml-1_1/docbook.xsl docbook.book.xml

Note the use of the -xinclude flag in the command and the xi:include XML tag in the docbook.book.xml file. These two things make the magic of modular DocBook possible.

bash alias

Since I work with a DocBook book often, I have created a bash alias as below:

alias dbtohtml="xsltproc -xinclude -o /home/codeghar/book/mybook.html /usr/share/sgml/docbook/xsl-ns-stylesheets/xhtml-1_1/docbook.xsl /home/codeghar/book/docbook.book.xml; sed -e 's/</\n</g' -e 's/<meta name/\n<meta http-equiv=\"Content-Type\" content=\"text\/html; charset=utf-8\" \/> \n <meta name/g' -i /home/codeghar/book/mybook.html"

The generated file does not have the HTML meta tag to identify it as UTF-8 and so space characters display as the character  in the web browser. Therefore, sed is used to enter the appropriate meta tag in the file.

Convert DocBook to PDF

Using the same example files (docbook.book.xml and docbook.chapter.xml), we will create a PDF instead of an XHTML file.

You need to install Apache FOP.

sudo yum install fop

Next you need to create an intermediate file (mybook.fo) as below.

xsltproc -xinclude -o mybook.fo /usr/share/sgml/docbook/xsl-ns-stylesheets/fo/docbook.xsl docbook.book.xml

Finally, run the following command to create the PDF file:

fop mybook.fo -pdf mybook.pdf

Hat Tips

DocBook Ubuntu Documentation; How to generate pdf from docbook 5.0; Getting Started with Docbook Book Authoring on Ubuntu; Writing Documentation; Playing With DocBook 5.0

Introduction to Python subprocess module

First off, head over to subprocess — Subprocess management to get all the details. This post will try to provide a gentle introduction to subprocess and my experience using it. There will be some suggestions here that I *think* are correct but be careful when you implement them in your code. Also remember that I wrote and tested this code using Python 3.1 on Debian Squeeze.

First off, I found it better to just use the Popen class and not the convenience functions provided. Using it helped me get a better handle on what’s going on. Second, learn the difference between Popen.wait() and Popen.communicate(). wait() basically sets Popen.returncode but keeps the stdout and stderr pipes as is. communicate() sets Popen.returncode but also returns stdout and stderr and closes the pipes so you can’t use them again as stdin for another command.

Third, use the shlex module so that you don’t have to fight with the command while creating a list to feed to args in Popen.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import subprocess
import sys
import shlex
command_line = "sed -e 's/^import dev as settings_file$/import production as settings_file/' test -i"
command_to_run = shlex.split(command_line)
print (command_to_run)
try:
    command_run = subprocess.Popen(command_to_run, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
except:
    raise
command_run_stdout, command_run_stderr = command_run.communicate()
print (command_run.returncode, command_run_stderr.decode('utf-8'))
print (command_run_stdout.decode('utf-8'))

The preceeding code sample is pretty self-explanatory. I used shlex to create a list from my command string, a list used in the Popen class. I set both stdout and stderr to send their output to pipes. command_run is an object representing the command I ran. Using communicate(), I get three things: returncode (set automatically), stdout (returned by communicate), and stderror (returned by communicate). Since command_run_stdout and command_run_stderr are byte strings, I convert them into UTF-8 before printing.

I will modify the preceeding code so that I can use the stdout and stderr as stdin for another command.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import subprocess
import sys
import shlex
command_line = "ls -l"
command_to_run = shlex.split(command_line)
print (command_to_run)
try:
    command_run = subprocess.Popen(command_to_run, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
except:
    raise
command_run.wait()
print (command_run.returncode)
command_to_run_2 = ["grep", "-i", "TOTAL"]
try:
    command_run_2 = subprocess.Popen(command_to_run_2, stdin=command_run.stdout)
except:
    raise

The biggest difference here was that I used wait() instead of communicate() so that I could use stdout as stdin for the second command.

If you are able to understand these things, I believe you are on your way to writing basic scripts that call out to the shell to do some tasks it’s best suited to do: run commands.

Delete Large List of Files

It all started when I was reading More Elegant Way To Delete Large List of Files? on reddit. Reading comments on the page led me to Perl to the rescue: case study of deleting a large directory. But me being a Python fan, I wasn’t satisfied with a Perl solution. My search led me to meeb’s comment on Quickest way to delete large amounts of files.

To summarize my quest for knowledge.

Using Perl: perl -e 'chdir "BADnew" or die; opendir D, "."; while ($n = readdir D) { unlink $n }'

Using Python:


#!/usr/bin/env python
import shutil
shutil.rmtree('/stuff/i/want/to/delete')

Using Bash:
Step 0: (optional) Create a list of files to delete (source: valadil’s comment and ensuing discussion). This step will help you figure out exactly what will be deleted.

find . -name "log*.xml" -exec echo rm -f {} \; > test_file;

Step 1: find . -type f -name "log*.xml" -print0 | xargs --null -n 100 rm

If it were up to me, I would use the Bash method as it’s easier for me to understand.

Extract data from PostgreSQL dump file

After taking a database dump from PostgreSQL using pg_dump, you may want to only get the schema or only the data. This script has been created and tested using Python versions 2.7 (Linux) and 3.2 (Windows), using a dump file from PostgreSQL version 9.0 (Linux).

Usage is simple. Provide an input dump file with the -f flag; output file with -o flag; and then choose either to extract/export data with -d flag or schema with -s flag. If you only want to extract data for certain tables, use the -t flag and provide a comma-separated list of table names. These table names should match exactly with what’s in the dump file.

I hope you find this script useful and can modify/extend it to your needs. If you have ideas on how to make this code better, please do not hesitate to share your ideas.


from re import search
import argparse
import codecs

script_version='0.0.1'
parser = argparse.ArgumentParser(
    description='From a pgsql dump file, extract only the data to be inserted', 
    version=script_version)
parser.add_argument('-f', '--file', metavar='in-file', action='store', 
    dest='in_file_name', type=str, required=True, 
    help='Name of pgsql dump file')
parser.add_argument('-o', '--out-file', metavar='out-file', action='store', 
    dest='out_file_name', type=str, required=True, 
    help='Name of output file')
parser.add_argument('-d', '--data-only', action="store_true", default=False, 
    dest='data_only', required=False, 
    help='''Only data is extracted and schema is ignored. 
    If not specified, then -s must be specified.''')
parser.add_argument('-t', '--table-list', metavar='table-name-list', action='store', 
    dest='table_name_list', type=str, required=False, 
    help='''Optional: Command-separated list of table names to process. 
    Works only with -d flag.''')
parser.add_argument('-s', '--schema-only', action="store_true", default=False, 
    dest='schema_only', required=False, 
    help='''Only schema is extracted and data is ignored.
    If not specified, then -d must be specified.''')
args = parser.parse_args()

if args.data_only and args.schema_only:
    print ('Error: You can\'t provide -d and -s flags at the same time; choose only one')
    exit()
elif args.data_only:
    data_only = True
    schema_only = False
    start_copy = False
elif args.schema_only:
    data_only = False
    schema_only = True
    start_copy = True
else:
    print ('Error: Choose one of -d and -s flags')
    exit()

print ('Processing File:', args.in_file_name)
input_file_name = args.in_file_name
output_file_name = args.out_file_name
table_name_list = args.table_name_list

if table_name_list:
    table_list = table_name_list.split(',')
else:
    table_list = None

outfile = codecs.open(output_file_name, "w", encoding="utf-8")
with codecs.open(input_file_name, "r", encoding="utf-8") as infile:
    for line in infile:
        if data_only:
            if (not start_copy) and search('^COPY', line) and table_list:
                for table in table_list:
                    if search(''.join(['^COPY ', table.strip(), ' ']), line):
                        start_copy = True
                        outfile.write(line)
                        break
            elif (not start_copy) and search('^COPY', line) and not table_list:
                start_copy = True
                outfile.write(line)
            elif start_copy and search('^\\\.', line):
                start_copy = False
                outfile.write(line)
            elif start_copy:
                outfile.write(line)
        elif schema_only:
            if start_copy and search('^COPY', line):
                start_copy = False
            elif (not start_copy) and search('^\\\.', line):
                start_copy = True
            elif start_copy:
                outfile.write(line)
print ('Done')
outfile.close()

AES Encryption with Python

To get AES encryption working in your Python script, you need to install PyCrypto.

Fedora: sudo yum install python-crypto
Debian: sudo aptitude install python-crypto
openSUSE: sudo zypper install python-crypto

Now the script, which has been created and tested on Python 2.7.


from Crypto.Cipher import AES
from base64 import b64encode, b64decode
import os
from datetime import datetime
from re import sub

# AES is a block cipher so you need to define size of block.
# Valid options are 16, 24, and 32
BLOCK_SIZE = 32

# Your input has to fit into a block of BLOCK_SIZE.
# To make sure the last block to encrypt fits
# in the block, you may need to pad the input.
# This padding must later be removed after decryption so a standard padding would help.
# Based on advice from Using Padding in Encryption,
# the idea is to separate the padding into two concerns: interrupt and then pad
# First you insert an interrupt character and then a padding character
# On decryption, first you remove the padding character until 
# you reach the interrupt character
# and then you remove the interrupt character
INTERRUPT = u'\u0001'
PAD = u'\u0000'

# Since you need to pad your data before encryption, 
# create a padding function as well
# Similarly, create a function to strip off the padding after decryption
def AddPadding(data, interrupt, pad, block_size):
    new_data = ''.join([data, interrupt])
    new_data_len = len(new_data)
    remaining_len = block_size - new_data_len
    to_pad_len = remaining_len % block_size
    pad_string = pad * to_pad_len
    return ''.join([new_data, pad_string])
def StripPadding(data, interrupt, pad):
    return data.rstrip(pad).rstrip(interrupt)

# AES requires a shared key, which is used to encrypt and decrypt data
# It MUST be of length 16, 24, or 32
# Make sure it is as random as possible 
# (although the example below is certainly not random)
# Based on comments from lighthill,
# you should use os.urandom() or Crypto.Random to generate random secret key
# I also use the GRC Ultra High Security Password Generator to generate a secret key
SECRET_KEY = u'a1b2c3d4e5f6g7h8a1b2c3d4e5f6g7h8'

# Initialization Vector (IV) should also always be provided
# With the same key but different IV, the same data is encrypted differently
# IV is similar to a 'salt' used in hashing
# It MUST be of length 16
# Based on comments from lighthill,
# you should NEVER use the same IV if you use MODE_OFB
# In any case, especially if you are encrypting, say data to be store in a database,
# you should try to use a different IV for different data sets,
# even if you use the same secret key
IV = u'12345678abcdefgh'

# Now you must choose a 'mode'. Options are available from Module AES.
# Although the default is MODE_ECB, it's highly recommended not to use it.
# For more information on different modes, read Block cipher modes of operation.
# In this example, I had used MODE_OFB
# But based on comments from lighthill,
# I switched over to MODE_CBC, which seems quite popular

# Let's create our cipher objects
cipher_for_encryption = AES.new(SECRET_KEY, AES.MODE_OFB, IV)
cipher_for_decryption = AES.new(SECRET_KEY, AES.MODE_OFB, IV)
cipher_for_encryption = AES.new(SECRET_KEY, AES.MODE_CBC, IV)
cipher_for_decryption = AES.new(SECRET_KEY, AES.MODE_CBC, IV)

# So you now have cipher objects
# Each operation that you perform on these objects alters its state
# So mostly you would want to perform a single operation on it each time
# For encrypting something, create a cipher object and encrypt the data
# For decrypting, create another cipher object and pass it the data to be decrypted
# This is the reason I called the cipher objects 
# 'cipher_for_encryption' and 'cipher_for_decryption'
#
#
#
# You will want to create encryption and decryption functions 
# so that it's easier to encrypt and decrypt data
def EncryptWithAES(encrypt_cipher, plaintext_data):
    plaintext_padded = AddPadding(plaintext_data, INTERRUPT, PAD, BLOCK_SIZE)
    encrypted = encrypt_cipher.encrypt(plaintext_padded)
    return b64encode(encrypted)
def DecryptWithAES(decrypt_cipher, encrypted_data):
    decoded_encrypted_data = b64decode(encrypted_data)
    decrypted_data = decrypt_cipher.decrypt(decoded_encrypted_data)
    return StripPadding(decrypted_data, INTERRUPT, PAD)

# We are now ready to encrypt and decrypt our data
our_data_to_encrypt = u'123456789012345678901234567890abc'
encrypted_data = EncryptWithAES(cipher_for_encryption, our_data_to_encrypt)
print ('Encrypted string:', encrypted_data)

# And let's decrypt our data
decrypted_data = DecryptWithAES(cipher_for_decryption, encrypted_data)
print ('Decrypted string:', decrypted_data)

Hat Tips

This post would not have been possible without help from: AES Encryption in Python Using PyCrypto; Block cipher modes of operation; Symmetric Encryption with PyCrypto; AES encryption of files in Python with PyCrypto; Using Padding in Encryption; Strings (Dive into Python 3);

Follow

Get every new post delivered to your Inbox.