Gentle Introduction to Reading and Writing XML using Python

There are many ways to interact with XML using Python. Here I will provide a simple introduction to reading and writing XML using lxml.

Create (Write) XML

Here I will try to create a sample XML similar to how FreeSWITCH creates its extensions/users.

from lxml import etree
root = etree.Element("include")
comment1 = etree.Comment("This is a comment")
root.append(comment1)

First we create the root element, which is the include tag in this case. Then we add a comment to it.


user = etree.SubElement(root, "user")
user.set('id', '1000')

We create a user tag, which is a sub-element of the include tag. Using the set method, we have created a single attribute. The name of this attribute is id and its value is 1000.


params = etree.SubElement(user, "params")

Here we created a sub-element, params, of the user tag. Here params is a tag as well and does not have any attributes.


param = etree.SubElement(params, "param")
param.set('name', 'password')
param.set('value', '$${default_password}')

We create a sub-element of params called param. It has two attributes and their names are name and value. Their values are password and $${default_password} respectively.


param = etree.SubElement(params, "param")
param.set('name', 'vm-password')
param.set('value', '1000')

We create another sub-element of params with different attributes. This is to demonstrate that we can create as many sub-elements of a tag (element or sub-element) as required.

variables = etree.SubElement(user, "variables")

Here we created another sub-element, variables, of the user tag/element, similar to params.


variable = etree.SubElement(variables, "variable")
variable.set('name', 'toll_allow')
variable.set('value', 'domestic,international,local')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'accountcode')
variable.set('value', '1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'user_context')
variable.set('value', 'default')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'effective_caller_id_name')
variable.set('value', 'Extension 1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'effective_caller_id_number')
variable.set('value', '1000')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'outbound_caller_id_name')
variable.set('value', '$${outbound_caller_name}')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'outbound_caller_id_number')
variable.set('value', '$${outbound_caller_id}')
variable = etree.SubElement(variables, "variable")
variable.set('name', 'callgroup')
variable.set('value', 'techsupport')
variable.text = 'This can contain data'

The above code creates a lot of different sub-elements, each called variable of the variables element/tag. Notice that we set some text in the .text at the end. All other variable tags do not have any “data” while the last one does. This is where I have moved away from the FreeSWITCH file because in it variable contains attributes and no “data”.


root_tree = etree.ElementTree(root)
print etree.tostring(root_tree, pretty_print=True)

Above we use the initial, root tag (include in this case) and traverse it to create a “tree”. All the tags we defined above are now in this tree structure. At the end we simply print the complete tree. The output should be similar to the one below.

<include>
  <!--This is a comment-->
  <user id="1000">
    <params>
      <param name="password" value="$${default_password}"/>
      <param name="vm-password" value="1000"/>
    </params>
    <variables>
      <variable name="toll_allow" value="domestic,international,local"/>
      <variable name="accountcode" value="1000"/>
      <variable name="user_context" value="default"/>
      <variable name="effective_caller_id_name" value="Extension 1000"/>
      <variable name="effective_caller_id_number" value="1000"/>
      <variable name="outbound_caller_id_name" value="$${outbound_caller_name}"/>
      <variable name="outbound_caller_id_number" value="$${outbound_caller_id}"/>
      <variable name="callgroup" value="techsupport">This can contain data</variable>
    </variables>
  </user>
</include>

Parse (Read) XML

Reading XML is very similar to writing it.

from lxml import etree
infile = open("1000.xml", 'r')

In the above code we open the XML file we created above (which we stored in file called 1000.xml in this case) for reading. If you’re running this on Python 3 then open it as read+binary, rb, instead of read-only.

context = etree.iterparse(infile, events=("start", "end"))

It’s a good idea to read an XML file iteratively so that if reading large files we do not store everything in memory at once. This reduces the memory requirements of reading large files. We have created an iterator which will read the file, infile. Since iterparse uses “events”, we are using two main events, namely start and end. “Start” occurs when a tag is encountered for the first time and “end” occurs when the tag is closed.


for event, element in context:
    print 'Event:', event
    print 'Element Tag:', element.tag
    print 'Element Text:', element.text
    print 'Element Items', element.items()
    print 'Previous Element', element.getprevious()
    print 'Parent Element', element.getparent()

In the above code we iterate over the XML file. The context iterator(?) returns two things on every pass: event (start or end in our case) and the element (or tag) read/encountered. The “element” object has some attributes and methods which we have used here:

  • tag contains the tag (include, user, params, variables, etc. in our example)
  • text contains any “data” the element might contain. In our case, the last variable contains data
  • items() returns a list containing attributes. These attributes have a name and a value. For example, each param contains two attributes with names name and value and their respective values
  • getprevious() returns the last element in the “tree”
  • Each element (or tag) in XML has exactly one parent and getparent() returns that tag (or element)

infile.close()

Finally, we close the input file. I will add one more thing: if you are searching for a particular tag (or element), you can provide it to iterparse like so: context = etree.iterparse(infile, events=("start", "end"), tag="param").

By running the above code on 1000.xml input file, you get output similar to the one provided below.

Event: start
Element Tag: include
Element Text:

Element Items []
Previous Element None
Parent Element None
Event: start
Element Tag: user
Element Text:

Element Items [('id', '1000')]
Previous Element <!–This is a comment–>
Parent Element <Element include at b7737784>
Event: start
Element Tag: params
Element Text:

Element Items []
Previous Element None
Parent Element <Element user at b77377ac>
Event: start
Element Tag: param
Element Text: None
Element Items [('name', 'password'), ('value', '$${default_password}')]
Previous Element None
Parent Element <Element params at b77377d4>
Event: end
Element Tag: param
Element Text: None
Element Items [('name', 'password'), ('value', '$${default_password}')]
Previous Element None
Parent Element <Element params at b77377d4>
Event: start
Element Tag: param
Element Text: None
Element Items [('name', 'vm-password'), ('value', '1000')]
Previous Element <Element param at b77377fc>
Parent Element <Element params at b77377d4>
Event: end
Element Tag: param
Element Text: None
Element Items [('name', 'vm-password'), ('value', '1000')]
Previous Element <Element param at b77377fc>
Parent Element <Element params at b77377d4>
Event: end
Element Tag: params
Element Text:

Element Items []
Previous Element None
Parent Element <Element user at b77377ac>
Event: start
Element Tag: variables
Element Text:

Element Items []
Previous Element <Element params at b77377d4>
Parent Element <Element user at b77377ac>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'toll_allow'), ('value', 'domestic,international,local')]
Previous Element None
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'toll_allow'), ('value', 'domestic,international,local')]
Previous Element None
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'accountcode'), ('value', '1000')]
Previous Element <Element variable at b7737874>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'accountcode'), ('value', '1000')]
Previous Element <Element variable at b7737874>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'user_context'), ('value', 'default')]
Previous Element <Element variable at b773789c>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'user_context'), ('value', 'default')]
Previous Element <Element variable at b773789c>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'effective_caller_id_name'), ('value', 'Extension 1000')]
Previous Element <Element variable at b77378c4>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'effective_caller_id_name'), ('value', 'Extension 1000')]
Previous Element <Element variable at b77378c4>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'effective_caller_id_number'), ('value', '1000')]
Previous Element <Element variable at b77378ec>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'effective_caller_id_number'), ('value', '1000')]
Previous Element <Element variable at b77378ec>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'outbound_caller_id_name'), ('value', '$${outbound_caller_name}')]
Previous Element <Element variable at b7737914>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'outbound_caller_id_name'), ('value', '$${outbound_caller_name}')]
Previous Element <Element variable at b7737914>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: None
Element Items [('name', 'outbound_caller_id_number'), ('value', '$${outbound_caller_id}')]
Previous Element <Element variable at b773793c>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: None
Element Items [('name', 'outbound_caller_id_number'), ('value', '$${outbound_caller_id}')]
Previous Element <Element variable at b773793c>
Parent Element <Element variables at b773784c>
Event: start
Element Tag: variable
Element Text: This can contain data
Element Items [('name', 'callgroup'), ('value', 'techsupport')]
Previous Element <Element variable at b7737964>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variable
Element Text: This can contain data
Element Items [('name', 'callgroup'), ('value', 'techsupport')]
Previous Element <Element variable at b7737964>
Parent Element <Element variables at b773784c>
Event: end
Element Tag: variables
Element Text:

Element Items []
Previous Element <Element params at b77377d4>
Parent Element <Element user at b77377ac>
Event: end
Element Tag: user
Element Text:

Element Items [('id', '1000')]
Previous Element <!–This is a comment–>
Parent Element <Element include at b7737784>
Event: end
Element Tag: include
Element Text:

Element Items []
Previous Element None
Parent Element None

Hat Tips

I strongly recommend that you read up on XML if you are not familiar with it. I could not have written this post without the help of: Parsing XML and HTML with lxml; High-performance XML parsing in Python with lxml; The lxml.etree Tutorial; Write xml file using lxml library in Python; Changing the default indentation of etree.tostring in lxml

About these ads

One Response to Gentle Introduction to Reading and Writing XML using Python

  1. Vorticity says:

    Just a suggestion. If you haven’t looked at it, you should look at lxml.objectify. It’s amazingly useful when you’re dealing with the parsed xml for long periods of time or in large bodies of code.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 30 other followers

%d bloggers like this: