silveira neto

carbon-based lifeform. virgo supercluster

Menu Close

Python Fast XML Parsing

Here is a useful tip on Python XML decoding.

I was extending xml.sax.ContentHandler class in a example to decode maps for a Pygame application when my connection went down and I noticed that the program stop working raising a exception regarded a call to urlib (a module for retrieve resources by url). I noticed that the module was getting the remote DTD schema to validate the XML.

<!DOCTYPE map SYSTEM "http://mapeditor.org/dtd/1.0/map.dtd">

This is not a requirement for my applications and it’s a huge performance overhead when works (almost 1 second for each map loaded) and when the applications is running in a environment without Internet it just waits for almost a minute and then fail with the remain decoding. A dirty workaround is open the XML file and get rid of the line containing the DTD reference.

But the correct way to programming XML decoding when we are not concerned on validate a XML schema is just the xml.parsers.expat. Instead of using a interface you just have to set some callback functions with the behaviors we want. This is a example from the documentation:

import xml.parsers.expat
 
# 3 handler functions
def start_element(name, attrs):
    print 'Start element:', name, attrs
def end_element(name):
    print 'End element:', name
def char_data(data):
    print 'Character data:', repr(data)
 
p = xml.parsers.expat.ParserCreate()
 
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
 
p.Parse("""<?xml version="1.0"?>
<parent id="top"><child1 name="paul">Text goes here</child1>
<child2 name="fred">More text</child2>
</parent>""", 1)

The output:

Start element: parent {'id': 'top'}
Start element: child1 {'name': 'paul'}
Character data: 'Text goes here'
End element: child1
Character data: '\n'
Start element: child2 {'name': 'fred'}
Character data: 'More text'
End element: child2
Character data: '\n'
End element: parent

© 2016 silveira neto. All rights reserved.

Theme by Anders Norén.