Extract data from PostgreSQL dump file

After taking a database dump from PostgreSQL using pg_dump, you may want to only get the schema or only the data. This script has been created and tested using Python versions 2.7 (Linux) and 3.2 (Windows), using a dump file from PostgreSQL version 9.0 (Linux).

Usage is simple. Provide an input dump file with the -f flag; output file with -o flag; and then choose either to extract/export data with -d flag or schema with -s flag. If you only want to extract data for certain tables, use the -t flag and provide a comma-separated list of table names. These table names should match exactly with what’s in the dump file.

I hope you find this script useful and can modify/extend it to your needs. If you have ideas on how to make this code better, please do not hesitate to share your ideas.


from re import search
import argparse
import codecs

script_version='0.0.1'
parser = argparse.ArgumentParser(
    description='From a pgsql dump file, extract only the data to be inserted', 
    version=script_version)
parser.add_argument('-f', '--file', metavar='in-file', action='store', 
    dest='in_file_name', type=str, required=True, 
    help='Name of pgsql dump file')
parser.add_argument('-o', '--out-file', metavar='out-file', action='store', 
    dest='out_file_name', type=str, required=True, 
    help='Name of output file')
parser.add_argument('-d', '--data-only', action="store_true", default=False, 
    dest='data_only', required=False, 
    help='''Only data is extracted and schema is ignored. 
    If not specified, then -s must be specified.''')
parser.add_argument('-t', '--table-list', metavar='table-name-list', action='store', 
    dest='table_name_list', type=str, required=False, 
    help='''Optional: Command-separated list of table names to process. 
    Works only with -d flag.''')
parser.add_argument('-s', '--schema-only', action="store_true", default=False, 
    dest='schema_only', required=False, 
    help='''Only schema is extracted and data is ignored.
    If not specified, then -d must be specified.''')
args = parser.parse_args()

if args.data_only and args.schema_only:
    print ('Error: You can\'t provide -d and -s flags at the same time; choose only one')
    exit()
elif args.data_only:
    data_only = True
    schema_only = False
    start_copy = False
elif args.schema_only:
    data_only = False
    schema_only = True
    start_copy = True
else:
    print ('Error: Choose one of -d and -s flags')
    exit()

print ('Processing File:', args.in_file_name)
input_file_name = args.in_file_name
output_file_name = args.out_file_name
table_name_list = args.table_name_list

if table_name_list:
    table_list = table_name_list.split(',')
else:
    table_list = None

outfile = codecs.open(output_file_name, "w", encoding="utf-8")
with codecs.open(input_file_name, "r", encoding="utf-8") as infile:
    for line in infile:
        if data_only:
            if (not start_copy) and search('^COPY', line) and table_list:
                for table in table_list:
                    if search(''.join(['^COPY ', table.strip(), ' ']), line):
                        start_copy = True
                        outfile.write(line)
                        break
            elif (not start_copy) and search('^COPY', line) and not table_list:
                start_copy = True
                outfile.write(line)
            elif start_copy and search('^\\\.', line):
                start_copy = False
                outfile.write(line)
            elif start_copy:
                outfile.write(line)
        elif schema_only:
            if start_copy and search('^COPY', line):
                start_copy = False
            elif (not start_copy) and search('^\\\.', line):
                start_copy = True
            elif start_copy:
                outfile.write(line)
print ('Done')
outfile.close()

AES Encryption with Python

DO NOT USE THIS POST TO LEARN ABOUT AES ENCRYPTION IN PYTHON. I DID NOT HAVE SUFFICIENT EXPERIENCE WITH BYTES, STRINGS, AND ENCRYPTION WHEN I WROTE THIS. I WILL HAVE A NEW POST WITH PYTHON3 (AND HOPEFULLY IT WILL HAVE BETTER INFORMATION).

To get AES encryption working in your Python script, you need to install PyCrypto.

Fedora: sudo yum install python-crypto
Debian: sudo aptitude install python-crypto
openSUSE: sudo zypper install python-crypto

Now the script, which has been created and tested on Python 2.7.


from Crypto.Cipher import AES
from base64 import b64encode, b64decode
import os
from datetime import datetime
from re import sub

# AES is a block cipher so you need to define size of block.
# Valid options are 16, 24, and 32
BLOCK_SIZE = 32

# Your input has to fit into a block of BLOCK_SIZE.
# To make sure the last block to encrypt fits
# in the block, you may need to pad the input.
# This padding must later be removed after decryption so a standard padding would help.
# Based on advice from Using Padding in Encryption,
# the idea is to separate the padding into two concerns: interrupt and then pad
# First you insert an interrupt character and then a padding character
# On decryption, first you remove the padding character until 
# you reach the interrupt character
# and then you remove the interrupt character
INTERRUPT = u'\u0001'
PAD = u'\u0000'

# Since you need to pad your data before encryption, 
# create a padding function as well
# Similarly, create a function to strip off the padding after decryption
def AddPadding(data, interrupt, pad, block_size):
    new_data = ''.join([data, interrupt])
    new_data_len = len(new_data)
    remaining_len = block_size - new_data_len
    to_pad_len = remaining_len % block_size
    pad_string = pad * to_pad_len
    return ''.join([new_data, pad_string])
def StripPadding(data, interrupt, pad):
    return data.rstrip(pad).rstrip(interrupt)

# AES requires a shared key, which is used to encrypt and decrypt data
# It MUST be of length 16, 24, or 32
# Make sure it is as random as possible 
# (although the example below is certainly not random)
# Based on comments from lighthill,
# you should use os.urandom() or Crypto.Random to generate random secret key
# I also use the GRC Ultra High Security Password Generator to generate a secret key
SECRET_KEY = u'a1b2c3d4e5f6g7h8a1b2c3d4e5f6g7h8'

# Initialization Vector (IV) should also always be provided
# With the same key but different IV, the same data is encrypted differently
# IV is similar to a 'salt' used in hashing
# It MUST be of length 16
# Based on comments from lighthill,
# you should NEVER use the same IV if you use MODE_OFB
# In any case, especially if you are encrypting, say data to be store in a database,
# you should try to use a different IV for different data sets,
# even if you use the same secret key
IV = u'12345678abcdefgh'

# Now you must choose a 'mode'. Options are available from Module AES.
# Although the default is MODE_ECB, it's highly recommended not to use it.
# For more information on different modes, read Block cipher modes of operation.
# In this example, I had used MODE_OFB
# But based on comments from lighthill,
# I switched over to MODE_CBC, which seems quite popular

# Let's create our cipher objects
cipher_for_encryption = AES.new(SECRET_KEY, AES.MODE_OFB, IV)
cipher_for_decryption = AES.new(SECRET_KEY, AES.MODE_OFB, IV)
cipher_for_encryption = AES.new(SECRET_KEY, AES.MODE_CBC, IV)
cipher_for_decryption = AES.new(SECRET_KEY, AES.MODE_CBC, IV)

# So you now have cipher objects
# Each operation that you perform on these objects alters its state
# So mostly you would want to perform a single operation on it each time
# For encrypting something, create a cipher object and encrypt the data
# For decrypting, create another cipher object and pass it the data to be decrypted
# This is the reason I called the cipher objects 
# 'cipher_for_encryption' and 'cipher_for_decryption'
#
#
#
# You will want to create encryption and decryption functions 
# so that it's easier to encrypt and decrypt data
def EncryptWithAES(encrypt_cipher, plaintext_data):
    plaintext_padded = AddPadding(plaintext_data, INTERRUPT, PAD, BLOCK_SIZE)
    encrypted = encrypt_cipher.encrypt(plaintext_padded)
    return b64encode(encrypted)
def DecryptWithAES(decrypt_cipher, encrypted_data):
    decoded_encrypted_data = b64decode(encrypted_data)
    decrypted_data = decrypt_cipher.decrypt(decoded_encrypted_data)
    return StripPadding(decrypted_data, INTERRUPT, PAD)

# We are now ready to encrypt and decrypt our data
our_data_to_encrypt = u'123456789012345678901234567890abc'
encrypted_data = EncryptWithAES(cipher_for_encryption, our_data_to_encrypt)
print ('Encrypted string:', encrypted_data)

# And let's decrypt our data
decrypted_data = DecryptWithAES(cipher_for_decryption, encrypted_data)
print ('Decrypted string:', decrypted_data)

Hat Tips

This post would not have been possible without help from: AES Encryption in Python Using PyCrypto; Block cipher modes of operation; Symmetric Encryption with PyCrypto; AES encryption of files in Python with PyCrypto; Using Padding in Encryption; Strings (Dive into Python 3);

Python Script to Calculate Checksum of File

I was in desperate need of a utility to calculate checksums of certain files. I was running Windows, and did not want to install a third-party tool to do this for me. I did have Python3 installed and decided to write a script myself to give me checksum for a file. But the script grew bigger as I wanted more and more features. Eventually, the script does the following: (1) calculate and display checksum (MD5, SHA1, SHA224, SHA256, SHA384, SHA512) for a file; (2) calculate and display size (bytes, KB, MB, GB) in terms of storage (1024 bytes = 1 kilo byte or KB); and (3) calculate and display size (bits, Kb, Mb, Gb) in terms of data transfer over network. The script is as below. Actually, if you don’t want to know the internals of the script, just use it as a utility. You’ll need Python 3 and make sure you have argparse module installed (Debian: sudo aptitude install python-argparse).

# This script reads a file and displays its size and checksums
def get_custom_checksum(input_file_name):
    from datetime import datetime
    starttime = datetime.now()
	# START: Actual checksum calculation
    from hashlib import md5, sha1, sha224, sha384, sha256, sha512
    #chunk_size = 1 # 1 byte -- NOT RECOMENDED -- USE AT LEAST 1KB. When 1KB takes 1 min to run, 1B takes 19 minutes to run
    #chunk_size = 1024 # 1 KB
    chunk_size = 1048576 # 1024 B * 1024 B = 1048576 B = 1 MB
    file_md5_checksum = md5()
    file_sha1_checksum = sha1()
    file_sha224_checksum = sha224()
    file_sha256_checksum = sha256()
    file_sha384_checksum = sha384()
    file_sha512_checksum = sha512()
    try:
        with open(input_file_name, "rb") as f:
            byte = f.read(chunk_size)
            previous_byte = byte
            byte_size = len(byte)
            file_read_iterations = 1
            while byte:
                file_md5_checksum.update(byte)
                file_sha1_checksum.update(byte)
                file_sha224_checksum.update(byte)
                file_sha256_checksum.update(byte)
                file_sha384_checksum.update(byte)
                file_sha512_checksum.update(byte)
                previous_byte = byte
                byte = f.read(chunk_size)
                byte_size += len(byte)
                file_read_iterations += 1
    except IOError:
        print ('File could not be opened: %s' % (input_file_name))
        #exit()
        return
    except:
        raise
	# END: Actual checksum calculation
    # For storage purposes, 1024 bytes = 1 kilobyte
    # For data transfer purposes, 1000 bits = 1 kilobit
    kilo_byte_size = byte_size/1024
    mega_byte_size = kilo_byte_size/1024
    giga_byte_size = mega_byte_size/1024
    bit_size = byte_size*8
    kilo_bit_size = bit_size/1000
    mega_bit_size = kilo_bit_size/1000
    giga_bit_size = mega_bit_size/1000
    last_chunk_size = len(previous_byte)
    stoptime = datetime.now()
    processtime = stoptime-starttime
    custom_checksum_profile = {
        'starttime': starttime,
        'byte_size': byte_size,
        'kilo_byte_size': kilo_byte_size,
        'mega_byte_size': mega_byte_size,
        'giga_byte_size': giga_byte_size,
        'bit_size': bit_size,
        'kilo_bit_size': kilo_bit_size,
        'mega_bit_size': mega_bit_size,
        'giga_bit_size': giga_bit_size,
        'file_read_iterations': file_read_iterations,
        'last_chunk_size': last_chunk_size,
        'md5_checksum': file_md5_checksum.hexdigest(),
        'sha1_checksum': file_sha1_checksum.hexdigest(),
        'sha224_checksum': file_sha224_checksum.hexdigest(),
        'sha256_checksum': file_sha256_checksum.hexdigest(),
        'sha384_checksum': file_sha384_checksum.hexdigest(),
        'sha512_checksum': file_sha512_checksum.hexdigest(),
        'stoptime': stoptime,
        'processtime': processtime,
        }
    return custom_checksum_profile

    # Thanks to the following resources for helping with this function
    # http://stackoverflow.com/questions/1035340/reading-binary-file-in-python
    # http://abstracthack.wordpress.com/2007/10/19/calculating-md5-checksum/
    # http://www.speedguide.net/conversion.php

def print_custom_checksum(input_file_name):
    custom_checksum_profile = get_custom_checksum(input_file_name)
    try:
        print ('Start Time            :', custom_checksum_profile['starttime'])
        print ('File Size (bytes)     :', custom_checksum_profile['byte_size'])
        print ('File Size (kilobytes) :', custom_checksum_profile['kilo_byte_size'])
        print ('File Size (megabytes) :', custom_checksum_profile['mega_byte_size'])
        print ('File Size (gigabytes) :', custom_checksum_profile['giga_byte_size'])
        print ('File Size (bits)      :', custom_checksum_profile['bit_size'])
        print ('File Size (kilobits)  :', custom_checksum_profile['kilo_bit_size'])
        print ('File Size (megabits)  :', custom_checksum_profile['mega_bit_size'])
        print ('File Size (gigabits)  :', custom_checksum_profile['giga_bit_size'])
        print ('File Read Iterations  :', custom_checksum_profile['file_read_iterations'])
        print ('Last Chunk (bytes)    :', custom_checksum_profile['last_chunk_size'])
        print ('MD5                   :', custom_checksum_profile['md5_checksum'])
        print ('SHA1                  :', custom_checksum_profile['sha1_checksum'])
        print ('SHA224                :', custom_checksum_profile['sha224_checksum'])
        print ('SHA256                :', custom_checksum_profile['sha256_checksum'])
        print ('SHA384                :', custom_checksum_profile['sha384_checksum'])
        print ('SHA512                :', custom_checksum_profile['sha512_checksum'])
        print ('Stop Time             :', custom_checksum_profile['stoptime'])
        print ('Processing Time       :', custom_checksum_profile['processtime'])
    except TypeError: #  'NoneType' object is not subscriptable --- basically this should happen when the input file could not be opened
        #raise
        pass
    # csv output

import argparse
script_version='0.0.2'
parser = argparse.ArgumentParser(description='Determine and print various checksums of an input file and its size. Supported checksums are MD5, SHA1, SHA224, SHA256, SHA384, and SHA512.', version=script_version)
parser.add_argument('-f', '--file', metavar='in-file', action='store', dest='file_name', type=str, required=True, help='Name of file for which the checksum needs to be calculated')
args = parser.parse_args()
print ('Processing File       :', args.file_name)
print_custom_checksum(args.file_name)

# Thanks to the following resources for helping with argparse
# http://www.doughellmann.com/PyMOTW/argparse/

Gentle Introduction to Reading and Writing XML using Python

There are many ways to interact with XML using Python. Here I will provide a simple introduction to reading and writing XML using lxml.

Create (Write) XML

Here I will try to create a sample XML similar to how FreeSWITCH creates its extensions/users.

from lxml import etree
root = etree.Element("include")
comment1 = etree.Comment("This is a comment")
root.append(comment1)

First we create the root element, which is the include tag in this case. Then we add a comment to it.


user = etree.SubElement(root, "user")
user.set('id', '1000')

We create a user tag, which is a sub-element of the include tag. Using the set method, we have created a single attribute. The name of this attribute is id and its value is 1000.


params = etree.SubElement(user, "params")

Here we created a sub-element, params, of the user tag. Here params is a tag as well and does not have any attributes.


param = etree.SubElement(params, "param")
param.set('name', 'password')
param.set('value', '$${default_password}')

We create a sub-element of params called param. It has two attributes and their names are name and value. Their values are password and $${default_password} respectively.


param = etree.SubElement(params, "param")
param.set('name', 'vm-password')
param.set('value', '1000')

We create another sub-element of params with different attributes. This is to demonstrate that we can create as many sub-elements of a tag (element or sub-element) as required.

variables = etree.SubElement(user, "variables")

Here we created another sub-element, variables, of the user tag/element, similar to params.


variable = etree.SubElement(variables, "variable")
variable.set('name', 'toll_allow')
variable.set('value', 'domestic,international,local')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'accountcode')
variable.set('value', '1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'user_context')
variable.set('value', 'default')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'effective_caller_id_name')
variable.set('value', 'Extension 1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'effective_caller_id_number')
variable.set('value', '1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'outbound_caller_id_name')
variable.set('value', '$${outbound_caller_name}')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'outbound_caller_id_number')
variable.set('value', '$${outbound_caller_id}')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'callgroup')
variable.set('value', 'techsupport')
variable.text = 'This can contain data'

The above code creates a lot of different sub-elements, each called variable of the variables element/tag. Notice that we set some text in the .text at the end. All other variable tags do not have any “data” while the last one does. This is where I have moved away from the FreeSWITCH file because in it variable contains attributes and no “data”.


root_tree = etree.ElementTree(root)
print etree.tostring(root_tree, pretty_print=True)

Above we use the initial, root tag (include in this case) and traverse it to create a “tree”. All the tags we defined above are now in this tree structure. At the end we simply print the complete tree. The output should be similar to the one below.

<include>
  <!--This is a comment-->
  <user id="1000">
    <params>
      <param name="password" value="$${default_password}"/>
      <param name="vm-password" value="1000"/>
    </params>
    <variables>
      <variable name="toll_allow" value="domestic,international,local"/>
      <variable name="accountcode" value="1000"/>
      <variable name="user_context" value="default"/>
      <variable name="effective_caller_id_name" value="Extension 1000"/>
      <variable name="effective_caller_id_number" value="1000"/>
      <variable name="outbound_caller_id_name" value="$${outbound_caller_name}"/>
      <variable name="outbound_caller_id_number" value="$${outbound_caller_id}"/>
      <variable name="callgroup" value="techsupport">This can contain data</variable>
    </variables>
  </user>
</include>

Parse (Read) XML

Reading XML is very similar to writing it.

from lxml import etree
infile = open("1000.xml", 'r')

In the above code we open the XML file we created above (which we stored in file called 1000.xml in this case) for reading. If you’re running this on Python 3 then open it as read+binary, rb, instead of read-only.

context = etree.iterparse(infile, events=("start", "end"))

It’s a good idea to read an XML file iteratively so that if reading large files we do not store everything in memory at once. This reduces the memory requirements of reading large files. We have created an iterator which will read the file, infile. Since iterparse uses “events”, we are using two main events, namely start and end. “Start” occurs when a tag is encountered for the first time and “end” occurs when the tag is closed.


for event, element in context:
    print 'Event:', event
    print 'Element Tag:', element.tag
    print 'Element Text:', element.text
    print 'Element Items', element.items()
    print 'Previous Element', element.getprevious()
    print 'Parent Element', element.getparent()

In the above code we iterate over the XML file. The context iterator(?) returns two things on every pass: event (start or end in our case) and the element (or tag) read/encountered. The “element” object has some attributes and methods which we have used here:

  • tag contains the tag (include, user, params, variables, etc. in our example)
  • text contains any “data” the element might contain. In our case, the last variable contains data
  • items() returns a list containing attributes. These attributes have a name and a value. For example, each param contains two attributes with names name and value and their respective values
  • getprevious() returns the last element in the “tree”
  • Each element (or tag) in XML has exactly one parent and getparent() returns that tag (or element)

infile.close()

Finally, we close the input file. I will add one more thing: if you are searching for a particular tag (or element), you can provide it to iterparse like so: context = etree.iterparse(infile, events=("start", "end"), tag="param").

By running the above code on 1000.xml input file, you get output similar to the one provided below.

Event: start
Element Tag: include
Element Text:

Element Items []
Previous Element None
Parent Element None
Event: start
Element Tag: user
Element Text:

Element Items [(‘id’, ‘1000’)]
Previous Element <!–This is a comment–>
Parent Element <Element include at b7737784>
Event: start
Element Tag: params
Element Text:

Element Items []
Previous Element None
Parent Element <Element user at b77377ac>
Event: start
Element Tag: param
Element Text: None
Element Items [(‘name’, ‘password’), (‘value’, ‘$${default_password}’)]
Previous Element None
Parent Element <Element params at b77377d4>
Event: end
Element Tag: param
Element Text: None
Element Items [(‘name’, ‘password’), (‘value’, ‘$${default_password}’)]
Previous Element None
Parent Element <Element params at b77377d4>
Event: start
Element Tag: param
Element Text: None
Element Items [(‘name’, ‘vm-password’), (‘value’, ‘1000’)]
Previous Element <Element param at b77377fc>
Parent Element <Element params at b77377d4>
Event: end
Element Tag: param
Element Text: None
Element Items [(‘name’, ‘vm-password’), (‘value’, ‘1000’)]
Previous Element <Element param at b77377fc>
Parent Element <Element params at b77377d4>
Event: end
Element Tag: params
Element Text:

Element Items []
Previous Element None
Parent Element <Element user at b77377ac>
Event: start
Element Tag: variables
Element Text:

Element Items []
Previous Element <Element params at b77377d4>
Parent Element <Element user at b77377ac>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘toll_allow’), (‘value’, ‘domestic,international,local’)]
Previous Element None
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘toll_allow’), (‘value’, ‘domestic,international,local’)]
Previous Element None
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘accountcode’), (‘value’, ‘1000’)]
Previous Element <Element variable at b7737874>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘accountcode’), (‘value’, ‘1000’)]
Previous Element <Element variable at b7737874>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘user_context’), (‘value’, ‘default’)]
Previous Element <Element variable at b773789c>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘user_context’), (‘value’, ‘default’)]
Previous Element <Element variable at b773789c>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘effective_caller_id_name’), (‘value’, ‘Extension 1000’)]
Previous Element <Element variable at b77378c4>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘effective_caller_id_name’), (‘value’, ‘Extension 1000’)]
Previous Element <Element variable at b77378c4>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘effective_caller_id_number’), (‘value’, ‘1000’)]
Previous Element <Element variable at b77378ec>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘effective_caller_id_number’), (‘value’, ‘1000’)]
Previous Element <Element variable at b77378ec>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘outbound_caller_id_name’), (‘value’, ‘$${outbound_caller_name}’)]
Previous Element <Element variable at b7737914>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘outbound_caller_id_name’), (‘value’, ‘$${outbound_caller_name}’)]
Previous Element <Element variable at b7737914>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘outbound_caller_id_number’), (‘value’, ‘$${outbound_caller_id}’)]
Previous Element <Element variable at b773793c>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [(‘name’, ‘outbound_caller_id_number’), (‘value’, ‘$${outbound_caller_id}’)]
Previous Element <Element variable at b773793c>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: This can contain data
Element Items [(‘name’, ‘callgroup’), (‘value’, ‘techsupport’)]
Previous Element <Element variable at b7737964>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: This can contain data
Element Items [(‘name’, ‘callgroup’), (‘value’, ‘techsupport’)]
Previous Element <Element variable at b7737964>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variables
Element Text:

Element Items []
Previous Element <Element params at b77377d4>
Parent Element <Element user at b77377ac>
Event: end
Element Tag: user
Element Text:

Element Items [(‘id’, ‘1000’)]
Previous Element <!–This is a comment–>
Parent Element <Element include at b7737784>
Event: end
Element Tag: include
Element Text:

Element Items []
Previous Element None
Parent Element None

Hat Tips

I strongly recommend that you read up on XML if you are not familiar with it. I could not have written this post without the help of: Parsing XML and HTML with lxml; High-performance XML parsing in Python with lxml; The lxml.etree Tutorial; Write xml file using lxml library in Python; Changing the default indentation of etree.tostring in lxml

Relative Import in Python

I was writing a small application which needed to be separated into modules. For it to work on any machine, I wanted to use relative imports. My directory structure was as below:

mytopdirectory/
—__init__.py
—mymainscript.py
—country/
—__init__.py
——countryfile1
——countryfile2
——countryfile3
—process/
—__init__.py
——processfile1
——processfile2
——processfile3

In mymainscript.py, I had the following imports:

from country.countryfile1 import function1a
from country.countryfile2 import function2f
from process.processfile3 import function3z

In processfile3, I had the following imports:

from .country.countryfile3 import mydictionary

But I got the following error when running the mymainscript.py file: Attempted relative import beyond toplevel package. So I started looking at what the problem was.

I assumed that .country would work for files in process directory. But I learned that that’s not how it’s supposed to work. So I restructured the directory as below:

mytopdirectory/
—mymainscript.py
—lib/
——__init__.py
——country/
———__init__.py
———countryfile1
———countryfile2
———countryfile3
——process/
———__init__.py
———processfile1
———processfile2
———processfile3

I basically added another level between mymainscript.py, and country and process directories. Now mymainscript.py, had the following imports:

from lib.country.countryfile1 import function1a
from lib.country.countryfile2 import function2f
from lib.process.processfile3 import function3z

And processfile3 had the following imports:

from .. country.countryfile3 import mydictionary

In this way, mytopdirectory becomes a package whose main file is mymainscript.py. In this package, processfile1, processfile2, processfile3, countryfile1, countryfile2, and countryfile3 are six modules.

Hat tips: Path question; Modules; Can anyone explain python’s relative imports?;

Sort String Based on a Substring Key in Python

I have a scenario where I need to sort a file based on a substring in each line. The format of each line is fixed, each line has a fixed length (with some exceptions due to error, etc.), and the file has to be sorted based on a substring key. Example lines could be as below:

ABC800555121200000854112520100601002358000048
ABC800555548500000854112520100601070112000041
ABC800555751200000854112520100601125539000077

I tried to use sort in bash but since there was no delimiter to break the lines, the -k flag did not seem to work. It wouldn’t sort on the value of, say -k 22,36, but instead would sort from the first character. I wanted to give awk a try but didn’t have much time to look at it. Since I already work with Python, I thought why not create a function to sort a list of strings based on a substring. Following is the result:

def sort_on_substr(string_list, pos_start, pos_end=None):
    sort_key_list = []
    sort_key_dict = {}
    for one_string in string_list:
        if not pos_end:
            pos_end_here = len(one_string)
        elif pos_end > len(one_string) or pos_end  len(one_string):
            pos_start_here = len(one_string)
            pos_end_here = pos_start_here
        else:
            pos_start_here = pos_start
        sort_key = one_string[pos_start_here:pos_end_here]
        sort_key_list.append(sort_key)
        if sort_key in sort_key_dict:
            templist = sort_key_dict[sort_key]
            templist.append(one_string)
            sort_key_dict[sort_key] = templist
        else:
            templist = [one_string]
            sort_key_dict[sort_key] = templist
    sort_key_list_sorted = sorted(sort_key_list)
    return (sort_key_list_sorted, sort_key_dict)

To use this function you need to give a list of strings to sort, starting position of substring and ending position of substring (optional). It creates and returns two things: (1) a sorted list containing the substring keys; (2) a dictionary containing the string as value with the substring key as a key. Now you can iterate over the sorted list and stick it in the dictionary as a key to get the value. In my use case there could be one key in more than one strings so the dictionary stores the value as a list of strings. For me the order of this list does not matter because if the key is the same then the strings are equal and any one of them could come after the other in a sorted list.

An alternative to this approach is to forgo the sorted list of keys and instead sort the dictionary itself. Let’s leave it as an exercise for you in case you want to implement it that way.

Now all I have to do is supply a list of strings (maybe read from a file) and then give positions for the key. The result is a sorted list which I can then use as required.