XML Processing with DOM and Perl

Using hashes of (mostly) anonymous subroutines, it is surprisingly easy to write XML translators with the Perl DOM interface.

The Idea
Perl Hashes to the Rescue
Passing Data Around
Anonymous Subroutines
Dispatching on Subroutine Names
Related Documents

The Idea

If you iterate recursively over a DOM tree (such as one provided by the XML::DOM Perl module <http://search.cpan.org/author/TJMATHER/XML-DOM/> ), you need something like a computed goto to dispatch on node types, and, probably more important, different elements. For example, consider the structure-oriented DTD from Notes on XML DTD Design (including the IMAGE element), and an analyzer for the VBOX element. Written in purely imperative style, it could look like this:

my $node = shift;
# We assume that $node itself has already been visited.

foreach my $child ($node->getChildNodes) {
  if ($child->getNodeType == TEXT_NODE) {
    # We will probably want to ignore this if it is
    # whitespace (an signal an error otherwise).
  } elsif ($child->getNodeType == ELEMENT_NODE) {
    if ($child->getNodeName eq 'HBOX') {
      process_HBOX ($child);
    } elsif ($child->getNodeName eq 'LIST') {
      process_LIST ($child);
    } elsif ($child->getNodeName eq 'IMAGE') {
      process_IMAGE ($child);
    } else {
      # Signal an error.
    }
  } else {
    # Process comments etc.
  }
}

Clearly, this is unacceptable. The syntactic overhead is huge. What we really want is some case statement à la Ada (but also for strings).

for Child in getChildNodes (Node) loop
   case getNodeType (Child) is
      when TEXT_NODE =>
         --  Process as above.
      when ELEMENT_NODE =>
         case getNodeName (Child) is
            when "HBOX" => Process_HBOX (Child);
            when "LIST" => Process_LIST (Child);
            when "IMAGE" => Process_IMAGE (Child);
            when others =>
               --  Signal an error.
         end case;
      when others =>
         --  As above.
   end case;
end loop;

This Ada-inspired syntax immediately suggests to use Perl hashes.

Perl Hashes to the Rescue

First of all, we note that a TEXT_NODE does not require radically different processing at this level, so we can work with a single case statement. We choose to represent text nodes with an element name ' TEXT ' (with a leading and trailing space), so our dispatch table could look like this:

' TEXT ' => process_Text_Node
'HBOX'   => process_HBOX
'LIST'   => process_LIST
'IMAGE'  => process_IMAGE

With a small convenience routine called dispatchingName, we get the following Perl snippet.

my %dispatch = (
  ' TEXT ' => \&process_Text_Node,
  HBOX     => \&process_HBOX,
  LIST     => \&process_LIST,
  IMAGE    => \&process_IMAGE,
);

my $node = shift;

foreach my $child ($node->getChildNodes) {
  &{$dispatch{dispatchingName ($child)}} ($child);
}

The last thre lines are generic and can be put into a convenience routine, and we arrive at the code below.

my $node = shift;
dispatchChildNodes ($node, 
  { ' TEXT ' => \&process_Text_Node,
    HBOX     => \&process_HBOX,
    LIST     => \&process_LIST,
    IMAGE    => \&process_IMAGE,
  });

This is quite hard to beat, but one detail is still missing.

Passing Data Around

If you want to write a fairly generic kind of XML translator with node-by-node processing, you need three pieces of information at each node:

the source node,
the target node (where processed information should go),
state.

This means that we have to enhance our code a bit.

my ($source, $target, $state) = @_;
dispatchChildNodes ($source, $target, $state,
  { ' TEXT ' => \&process_Text_Node,
    HBOX     => \&process_HBOX,
    LIST     => \&process_LIST,
    IMAGE    => \&process_IMAGE,
  });

Consequentially, the implementation of process_HBOX should start with the following code:

sub process_HBOX {
  my ($source, $target, $state) = @_;
 
  # probably some invocation of dispatchChildNodes
}

Each processing routine should examine $source, process it (consulting and updating $state as required), and append the result to $target.

Anonymous Subroutines

A neat aspect of the hash dispatch is that it plays so well with anonymous subroutines. Suppose that in our example, processing for the IMAGE element is so simple that we do not want to defer it to a subroutine. We can use an anonymous subroutine instead.

my ($source, $target, $state) = @_;
dispatchChildNodes ($source, $target, $state,
  { ' TEXT ' => \&process_Text_Node,
    HBOX     => \&process_HBOX,
    LIST     => \&process_LIST,
    IMAGE    => sub {
      my ($source, $target, $state) = @_;
      # Process as usual.
    },
  });

The my statement is required becase Perl does not properly support closures. For the same reason, you have to use the $state to pass information from the surrounding code to the nested subprogram.

Dispatching on Subroutine Names

Why do we not use Perl's introspection capability, mangle the element name in some way to get a subroutine name, and call that subroutine directly? First of all, all subroutines would have to be named. Second, if we operate in the "Best of Both Worlds" scenario of Notes on XML DTD Design, some tag names could be overloaded, and we have to dispatch to different subroutines, depending on the enclosing elements. This is trivially handled by our hashes approach.

In general, we think that the name-based subroutine dispatch is not a good idea. It might offer some reduction in terms of code size, but even this is not clear (because anonymous subroutines are not available, for example).

Revisions

2003-07-31: published

Florian Weimer
Home Blog (DE) Blog (EN) RSS Feeds Impressum