The etree tail quirk

2009-01-01

How to make a good XML solution for Python better.

Python was late to adopt XML and even then the libraries provided weren't as powerful or easy to use as those available elsewhere. Fortunately, with Python 2.5, a subset of the ElementTree library was incorporated into the standard library as xml.etree. Less fortunately, etree has its own quirks. One of these is to do with the handling of text within nodes.

If you're dealt with XML, you'd know that an XML element may contain text, which some XML parsers will (sensibly) interpret as the value for that element. In this example, the name element has Truman Capote as its value:

<name>Truman Capote</name>

It can also contain other elements, which are treated as children of the parent element. Here name has the child elements first and last:

<name><first>Truman</first><last>Capote</last></name>

It can also do both:

<credit>The novel was written by <name>Truman Capote</name> in
1965</credit>

In this case, what should the "value" of the parent element be? Should it be The novel was written by? This is logical, but should it also include the child element? And what about the trailing in 1965? That clearly belongs to the root credit, but it can't be part of the value without the intervening name element. The question of children is also complicated. If name is the only child, it removes it from its context within the surrounding text.

What is the correct way to handle this? In Python, xml.minidom treats text as a child element (actually "text node") in itself, as does the HTML DOM. This seems logical and orthogonal. The ordering and relationship of children is preserved. A node's "value" can either not be used (just look at terminal text nodes), or interpreted as the sum of all child nodes, or applies only when there is a single child that is text. Thus the above XML fragment renders as:

Element "credit"
- Text node "The novel was written by "
  
  Element "name"
  
  Text node "Truman Capote"
- Text node " in 1965"

This is a more consistent approach. The nature of a child node shouldn't change depending on other child nodes (other siblings). Yet this is what is exactly what happens with etree. Consider two XML fragments:

>>> frag1 = "<tag1>ABCD</tag1>"
>>> frag2 = "<tag1><tag2/>ABCD</tag1>"

When using minidom, in both cases the text ABCD is rendered as a text node:

>>> import xml.dom.minidom as minidom
>>> doc1 = minidom.parseString (frag1)
>>> root1 = doc1.childNodes[0]
>>> root1.childNodes [<DOM Text node "ABCD">]
>>> doc2 = minidom.parseString (frag2)
>>> root2 = doc2.childNodes[0]
>>> root2.childNodes [<DOM Element: tag2 at ...>, <DOM Text node "ABCD">]

in the respective parse trees:

Element "tag1"
- Text node "ABCD"

and:

Element "tag1"
- Element "tag2"
- Text node "ABCD"

In contrast, when parsing with etree:

>>> from xml.etree import ElementTree
>>> root3 = ElementTree.fromstring (frag1)
>>> root3[:] []
>>> root3.text 'ABCD'
>>> root4 = ElementTree.fromstring (frag2)
>>> root4[:] [<Element tag2 at ...>]
>>> root4[0].tail 'ABCD'

in the respective parse trees:

Element "tag1" (with text as "ABCD")

and:

Element "tag1"
- Element "tag2" (with tail as "ABCD")

To detail, in etree an Element has text and tail members, for the text stored within and following a element. If text is the sole child of an element, it will be stored in the text member. If there are sibling elements, it will be stored in text or as the tail of one of it's siblings, depending on its relative position. A more detailed example:

>>> tailfrag1 = '<tag1>text1<tag2>text2</tag2>text3</tag1>'
>>> tailfrag2 = '<tag1>text1<tag2/>text2<tag3/>text3</tag1>'

Parsing with xml.minidom produces orthogonal structures:

>>> m = minidom.parseString (tailfrag1)
>>> n = m.childNodes[0] # get the root
>>> n.tagName, n.nodeValue, len (n.childNodes) (u'tag1', None, 3)
>>> for x in n.childNodes: ... print x ... <DOM Text node "text1">
<DOM Element: tag2 at ...> <DOM Text node "text3">
>>> e = minidom.parseString (tailfrag2)
>>> f = e.childNodes[0] # get the root
>>> f.tagName, f.nodeValue, len (f.childNodes) (u'tag1', None, 5)
>>> for x in f.childNodes:
        ... print x ...
        <DOM Text node "text1"> <DOM
Element: tag2 at ...> <DOM Text node "text2"> <DOM Element: tag3 at ...>
<DOM Text node "text3">

The results respectively are:

Element "tag1"
- Text node "text1"
- Element "tag2"
  
  Text node "text2"
- Text node "text3"

and:

Element "tag1"
- Text node "text1"
- Element "tag2"
- Text node "text2"
- Element "tag3"
- Text node "text3"

While, with etree the text shifts between text and tail:

>>> x = ElementTree.fromstring (tailfrag1)
>>> x.tag, x.text, x.tail, len(x)
('tag1', 'text1', None, 1)
>>> y = x[0]
>>> y.tag, y.text, y.tail, len(y)
('tag2', 'text2', 'text3', 0)
>>> a = ElementTree.fromstring (tailfrag2)
>>> a.tag, a.text, a.tail, len(a)
('tag1', 'text1', None, 2)
>>> b = a[0]
>>> b.tag, b.text, b.tail, len(b)
('tag2', None, 'text2', 0)
>>> c = a[1]
>>> c.tag, c.text, c.tail, len(c)
('tag3', None, 'text3', 0)

The results respectively are:

Element "tag1" (with text as "text1")
- Element "tag2" (with text as "text2" and text as "text3")

and:

Element "tag1" (with text as "text1")
- Element "tag2" (with tail as "text2")
- Element "tag3" (with tail as "text3")

That's inconsistent and just damn untidy.

Admittedly, kosher XML should have elements that contain either a single text node or one or more elements. In the real world, this can't be relied upon. Take almost any stretch of HTML or XHTML:

In When Death Comes, Mary Oliver says When it's over, I want to say I have been a bride married to amazement, I've been a bridegroom taking the world into my arms.

With etree, parts of this sentence that logically should be at the same level appear as the text of the root paragraph, or as the text or tails of children of the root. When walking an such an XML tree, a program has to look for both text and tail members and interpret them correctly. Perhaps worse, when building an XML tree, code has to keep track of what has preceded so that text can be attached to correct place. Finally, the "tail" idiom is unlike the treatment of text in the HTML DOM, or in fact any other XML library I know of. That's not a showstopper, but makes it worth asking if the benefits of any new approach are real.

What's the solution? etree is simple and works well in most cases. Rather than writing YAXP (Yet Another XML Parser), a better solution is to modify the etree behaviour with wrapping function. This one of the reasons I developed the teetree module, detailed elsewhere.

Other illustrations (and solutions) to this problem can be found:

-ElementTree text helper

-LXML compatibility.